Bayesian Shrinkage and Padding Applied to Low-Sample Performances
How do we deal with low-sample performances? How much should we expect a player's performance to change as he gets more minutes?
An interesting question I was recently asked by Rodrigo Picchioni, Lead Data Scout at Monaco, about our metrics at Gemini inspired me to write and share this post. I’m also curious about the opinions of others in the industry on this challenge.
"How much should we expect a player's performance to change as he gets more minutes? How does that affect the metrics?"
This question raised some interesting thoughts. I believe there are multiple ways to deal with this challenge, but I wanted to share our approach.
At Gemini, we believe that low-sample performances should be treated as uncertain. What does that mean? Why?
Our goal is to automatically adjust player statistics when a player hasn't played much, so we can avoid overvaluing small flashes of brilliance that might just be luck.
That said, our approach is to apply statistical padding (shrinkage). By regressing toward the mean, we blend the player signal with a baseline/prior.
But, what is this baseline prior? What exactly is regressing toward the mean?
In Bayesian terms, a prior is a probability distribution p(θ) over an unknown parameter θ. In our scenario, this prior is not necessarily a distribution, but rather a value that is estimated or assumed before seeing much of the player’s data.
Here, θ represents a measure of performance. For the sake of example, let's define performance as the well-known VAEP metric. In other words, θ can be interpreted as the player's true VAEP per90.
The prior mean represents our best guess for θ before observing much data. It could be the league's VAEP average, the player's position VAEP average or
something more elaborate, like the player’s position VAEP average minus 1.5 times the standard deviation of the metric.
The prior strength, which controls how quickly this belief is overridden, corresponds to what we can think of as a minute-threshold: a denominator term that determines how much weight the baseline (prior mean) receives. However, rather than applying a universal minutes threshold, we can calculate appropriate minimums based on competition and position, since different positions and competitions have different variance patterns.
Given these definitions, our implementation is best described as an empirical Bayes–style shrinkage. We do not store a full p(θ) distribution. Instead, we apply a deterministic weighted average that behaves like a Bayesian posterior mean. That is:
The padded (shrunk) estimate is computed as a weighted average between the player's observed performance and a baseline prior:
padded = w × playerValue + (1 - w) × baseline
where:
padded= final shrunk estimate (it matches the form of a Bayesian posterior mean)w = n / (n + k)= weight assigned to the player's datan= player minutesk= prior strength (minutes threshold)playerValue = (Σ(value) / Σ(minutes)) * 90baseline= prior mean, i.e. our best guess forθ(e.g. league average, position average, etc.)
The prior strength k can be chosen heuristically or derived from the data.
The key intuition is that as n increases, w approaches 1, and the padded estimate converges to the player's observed performance. Conversely, with limited minutes, the baseline carries more weight.
Of course, it is important to consider that, regardless of padding, meaningful changes will still happen because:
- True performance may be changing (adaptation, tactical role, etc.), and
- Even at, say, 2000 minutes, some actions, especially rare events, still have impactful variance.
In short:
- With few minutes/actions: the player signal is more uncertain → higher baseline weight → more regression to the mean.
- As minutes/actions accumulate: the player signal becomes more reliable → the player weight increases (baseline weight decreases) → less regression to the mean.
This approach improves player evaluation methodologies by going beyond simplistic thresholds and providing nuanced, mathematically sound adjustments for limited playing time. Thus, the Statistical Padding methodology powers key analytical use cases:
- Performance Evaluation: Creates more stable metrics less susceptible to small sample variance
- Talent Identification: Helps distinguish between genuine talent and statistical noise
- Player Comparison: Enables fair comparison between players with different playing time
- League Translation: Forms the foundation for reliable cross-league performance comparison
- Time-Decay Analysis: Provides more stable inputs for longitudinal player evaluation
- Recruitment Analysis: Reduces the risk of overvaluing small-sample performances
What is Time Decay? How do we translate metrics across leagues? Subjects of upcoming posts...
P.S.: The work mentioned in this text was developed by Amod Sahasrabudhe, Gabriel Reis, João Lucas, Hugo Rios, Marc Garnica and me.