Overview
This guide explains convergence in mean (also called convergence in the r-th mean or Lr convergence). It shows how it connects to other stochastic convergence modes and offers practical tools to verify it and get rates.
In short, Xn converges to X in the r-th mean if the r-th moment of the error goes to zero: E[|Xn − X|^r] → 0 for r ≥ 1. You will see how to prove implications with standard inequalities, when mean-square (L2) convergence boils down to variance and bias, and how to make robust choices under heavy tails.
Convergence in mean sits inside a wider taxonomy that includes almost sure, in probability, and in distribution convergence; see Convergence of random variables.
The Lr viewpoint equips you with a norm-like distance on random variables. It unlocks tools from functional analysis and enables clear, quantitative error bounds.
Definitions: convergence in the r-th mean and the norm viewpoint
Convergence in the r-th mean formalizes the idea that average r-th power error vanishes. Formally, for r ≥ 1, we write Xn →Lr X if E[|Xn − X|^r] → 0 as n → ∞.
The special cases r = 1 and r = 2 are convergence in mean (L1) and mean-square convergence (L2), respectively.
It is helpful to view E[|Xn − X|^r] as the r-th power of a norm on the space of random variables with finite r-th moment: ||Y||r := (E|Y|^r)^(1/r).
For r ≥ 1, these Lp spaces are complete Banach spaces (a standard fact; see Lp space). Cauchy sequences in this metric have a limit in the space. This norm viewpoint lets you use triangle inequalities, Hölder, and Minkowski to control errors.
Keep in mind that for 0 < r < 1, |·|^r is not convex and the usual triangle inequality fails. We revisit this quasi-norm regime later.
Why convergence in mean matters in statistics and machine learning
In statistics and ML, you often compare estimators or models via expected loss. Convergence in mean directly encodes that the expected r-th power error shrinks. It matches common losses like mean absolute error (MAE, r = 1) and mean squared error (MSE, r = 2).
It captures not just whether estimates get close, but how their average error decays. For estimator consistency, L2 convergence implies convergence in probability and controls both bias and variance.
For risk minimization, L1 is more robust to outliers (linear penalty), while L2 is more efficient under light tails (quadratic penalty). For example, the sample mean converges in L2 to the true mean when the variance is finite. With heavy tails you might prefer L1 criteria or median-based estimators to avoid being dominated by rare, large errors.
In model evaluation, Lr convergence governs how training error translates into expected predictive risk. A small L2 error means small variance and bias—key for squared-loss regression. Small L1 error maintains robustness, useful under heavy-tailed noise or adversarial contamination.
Relationships among modes of convergence
Different convergence notions form a hierarchy, with implications and gaps that need extra conditions to bridge. Lr convergence implies convergence in probability.
Stronger Ls convergence implies Lr when s > r ≥ 1. The converses generally fail without additional structure like uniform integrability, which can restore L1 convergence.
Lr ⇒ convergence in probability (via Markov)
Mean convergence provides a direct handle on tail probabilities. For any ε > 0 and r > 0, Markov’s inequality gives P(|Xn − X| > ε) ≤ E[|Xn − X|^r] / ε^r.
Therefore, Xn →Lr X implies Xn →p X, and the probability of large deviations inherits the Lr rate (see Markov's inequality).
As a rule of thumb, if E[|Xn − X|^r] ≤ C n^−α, then P(|Xn − X| > ε) ≤ C ε^−r n^−α.
If s > r ≥ 1, Ls ⇒ Lr (via Hölder)
Higher-moment convergence dominates lower-moment convergence. On a probability space (total mass 1), if E[|Xn − X|^s] → 0 with s > r ≥ 1, then E[|Xn − X|^r] → 0.
One way to see this is by Hölder or Lyapunov inequalities: ||Y||r ≤ ||Y||s, hence ||Xn − X||r ≤ ||Xn − X||s → 0 (for background, see Hölder's inequality). This also yields a quantitative bound: E[|Xn − X|^r] ≤ (E[|Xn − X|^s])^(r/s).
When converses fail and how UI restores L1
Converse implications can fail dramatically. A classic counterexample is Xn = n with probability 1/n and 0 otherwise. Then Xn →p 0 because P(|Xn| > ε) = 1/n, yet E|Xn| = 1, so there is no L1 convergence to 0.
The issue is mass “spiking” to large values with shrinking probability. Uniform integrability (UI) rules out such spikes.
If Xn →p X and the family {|Xn|} is uniformly integrable, then Xn →L1 X (a Vitali-type principle; see Uniform integrability). Practical UI checks include boundedness in Lp for some p > 1, or domination by an integrable envelope Y with |Xn| ≤ Y almost surely.
In many estimation problems, verifying a moment bound with p > 1 is the simplest path to L1 convergence.
A toolkit for proving Lr convergence and obtaining rates
A small set of inequalities covers most Lr convergence proofs and rate bounds. You will use Jensen to exploit convexity, Hölder and Minkowski to decompose sums and handle products, and Markov/Chebyshev to move between moments and tail probabilities.
The patterns below are reusable in estimators, stochastic optimization, and time-series settings.
Jensen-based bounds for convex losses
Convexity simplifies r-th power bounds. For r ≥ 1, the map t ↦ |t|^r is convex, so |(1/n) ∑ ai|^r ≤ (1/n) ∑ |ai|^r for any real numbers ai.
Applied pointwise to random variables, this yields E| (1/n) ∑ Zi |^r ≤ (1/n) ∑ E|Zi|^r. As a quick corollary, for i.i.d. errors Zi with finite r-th moment, E|n^−1 ∑ Zi|^r ≤ E|Z|^r. For r = 2 and centered Zi, the exact identity is E|n^−1 ∑ Zi|^2 = Var(Z)/n.
Jensen also gives contraction under conditioning: E|E[Z | G]|^r ≤ E|Z|^r for r ≥ 1. Conditional expectation is a linear projection and |·|^r is convex. This is especially useful when you replace a random target with its conditional mean prediction: Lr risk cannot increase under conditioning.
Hölder and Minkowski to handle sums and norms
Hölder controls products and cross-terms, while Minkowski (triangle inequality in Lr) handles sums. If 1/p + 1/q = 1/r with p, q, r ≥ 1, then E|UV|^r ≤ ||U||p^r ||V||q^r by Hölder, and thus ||UV||r ≤ ||U||p ||V||q.
This makes product errors manageable when you know separate moment bounds on the factors. Minkowski yields ||∑k Yk||r ≤ ∑k ||Yk||r and hence ||Yn + Zn − (Y + Z)||r ≤ ||Yn − Y||r + ||Zn − Z||r (see Minkowski inequality).
That proves closure of Lr convergence under finite sums and extends to vector-valued variables via a norm on the ambient space. For sample means and linear estimators, combine these with independence- or variance-based steps to get sharper rates, especially in L2.
Markov/Chebyshev for tail-to-mean control
Tail bounds and moment bounds translate into each other. Markov gives P(|W| > ε) ≤ E[|W|^r] / ε^r, and Chebyshev is the r = 2 case with centered W: P(|W| > ε) ≤ Var(W) / ε^2.
Conversely, you can represent moments by tails: E|W|^r = r ∫0^∞ t^(r−1) P(|W| > t) dt. Uniform tail decay implies Lr convergence.
For rates, combine moment control with a tail inequality. Example: if E[|Xn − X|^2] ≤ σ^2 / n (mean-square), then P(|Xn − X| > ε) ≤ σ^2 / (n ε^2). If you only have sub-exponential or sub-Gaussian tails, use those bounds directly to infer Lr rates via integration or known moment formulas.
Rate templates you can reuse
Useful patterns recur across estimators, and you can often read off Lr rates from moment assumptions.
- Sample mean with finite variance: If Xi are i.i.d. with Var(X) = σ^2, then E[(X̄n − μ)^2] = σ^2 / n, so L2 convergence is O(n^−1) in MSE and O(n^−1/2) in RMSE. For sub-Gaussian Xi, E|X̄n − μ|^r ≤ C_r σ^r n^(−r/2).
- Empirical risk minimization with Lipschitz loss: If ℓ is L-Lipschitz and Ẑn → Z in Lr, then ||ℓ(Ẑn) − ℓ(Z)||r ≤ L ||Ẑn − Z||r, so the Lr rate transfers up to the Lipschitz factor.
- Linear estimators: If θ̂n = A_n Y with bounded operator norm ||A_n||op ≤ C and Y has finite r-th moment, then ||θ̂n − θ||r ≤ C ||Y − EY||r up to bias terms; in L2 you can separate variance and bias to quantify rates.
- Martingale averages (sketch): If Mn is a square-integrable martingale with predictable quadratic variation ⟨M⟩n ≍ n, then E|Mn / n|^2 ≍ 1/n and L2 convergence follows; sharper Lr rates use Burkholder–Davis–Gundy but the 1/√n pattern persists.
Lp completeness and the Cauchy criterion
Completeness guarantees that “approaching something” in Lr means there is actually a limit X in the space. For r ≥ 1, Lp spaces are complete Banach spaces.
Therefore, if (Xn) is Cauchy in Lr, there exists X with E[|Xn − X|^r] → 0. To use this in practice, verify the Cauchy property: for every ε > 0, find N such that for all m, n ≥ N, E[|Xm − Xn|^r] < ε.
A common way is to show summability of consecutive differences: if ∑n E[|Xn+1 − Xn|^r]^(1/r) < ∞, then (Xn) is Lr-Cauchy by the triangle inequality. Completeness then supplies the limit without having to guess X. You can later identify X via distributional or almost sure arguments.
Stability and algebra of Lr convergence
Lr convergence behaves well under common operations, provided you track the needed moments. Lipschitz maps preserve rates, sums are straightforward by Minkowski, products require mixed-moment control via Hölder, and conditional expectation acts as a contraction in Lr.
Lipschitz and smooth transformations
If f is L-Lipschitz, then ||f(Xn) − f(X)||r ≤ L ||Xn − X||r for r ≥ 1. This is an Lr analog of the continuous mapping principle.
Continuity plus a Lipschitz constant passes convergence and rates to transformed variables. For smooth f, a mean value argument gives |f(Xn) − f(X)| ≤ |f′(ξn)| |Xn − X| for some ξn between Xn and X.
If |f′| has at most polynomial growth and you have corresponding moment bounds for Xn and X, Hölder yields ||f(Xn) − f(X)||r ≤ C ||Xn − X||p with r, p adjusted to handle the growth. In practice, check that E[|f′(Xn)|^q] stays bounded for some q > 1 to apply Hölder.
Sums, products, and conditional expectation
Algebraic closure comes from Minkowski, Hölder, and the linearity/contraction of conditional expectation.
- Sums: If Xn →Lr X and Yn →Lr Y, then Xn + Yn →Lr X + Y by Minkowski: ||(Xn + Yn) − (X + Y)||r ≤ ||Xn − X||r + ||Yn − Y||r.
- Products: If Xn →Lp X and Yn →Lq Y with 1/p + 1/q = 1/r and r ≥ 1, then Xn Yn →Lr XY. Hölder gives ||Xn Yn − XY||r ≤ ||Xn − X||p ||Yn||q + ||X||p ||Yn − Y||q, and boundedness of the nonconverging factor’s moments lets the error go to zero.
- Conditional expectation: E[· | G] is an Lr contraction for r ≥ 1: ||E[Z | G]||r ≤ ||Z||r. Hence if Xn →Lr X, then E[Xn | G] →Lr E[X | G]. This stability under “smoothing” is central in regression and filtering.
Bias–variance decomposition and mean-square convergence to constants
Mean-square convergence to a constant c is governed by a simple identity: E[(Xn − c)^2] = Var(Xn) + (E[Xn] − c)^2. Thus Xn →L2 c if and only if Var(Xn) → 0 and E[Xn] → c.
This characterization cleanly separates stochastic variability from systematic error (bias). For the sample mean X̄n of i.i.d. variables with mean μ and variance σ^2, the bias is zero and Var(X̄n) = σ^2 / n, so X̄n →L2 μ at rate 1/n in MSE.
In linear regression with exogenous regressors and finite second moments, the OLS estimator has bias that vanishes under correct specification and variance that shrinks with sample size according to the design. L2 consistency follows by checking both terms.
The same checklist applies to shrinkage estimators. Ensure bias goes to zero at a rate that does not dominate variance.
Checklist: Var(Xn) → 0 and E[Xn] → c
A fast diagnostic for L2 → c is to verify two conditions:
- Bias: E[Xn] → c.
- Variance: Var(Xn) → 0.
For X̄n with E|X|^2 < ∞, E[X̄n] = μ and Var(X̄n) = σ^2 / n, so both conditions hold. For OLS under standard assumptions (exogeneity, no perfect multicollinearity, finite second moments), the estimator’s bias is zero and its variance shrinks roughly like 1/n times the inverse Gram matrix; this yields L2 convergence to the true parameter as n increases.
Heavy-tailed case studies: when L1 holds but L2 fails
Heavy tails expose the difference between L1 and L2 convergence. The key is whether first and second moments exist. Pareto and stable distributions provide concrete thresholds that guide estimator choice and convergence guarantees.
Pareto(α) thresholds for L1 vs L2
For a Pareto distribution with tail index α > 0 (P(X > t) ≍ t^−α; see Pareto distribution), the first moment exists iff α > 1, and the second moment exists iff α > 2.
If Xi are i.i.d. Pareto(α) with α ∈ (1, 2], the sample mean still converges in L1 and almost surely to μ by a truncation argument. It does not converge in L2 because Var(X) is infinite. Consequently, MSE-based guarantees fail, while MAE-style measures still behave.
When α ≤ 1, even the mean is infinite. In this regime, the sample mean is unstable. L1 convergence to a finite constant cannot hold, and robust summaries (median, trimmed means) or scale-invariant criteria are preferable.
α-stable domains and estimator behavior
α-stable laws with α ∈ (1, 2) have finite means but infinite variances; with α ≤ 1, even means do not exist (see Stable distribution).
For α ∈ (1, 2), expect L1-type convergence for averages and avoid L2 claims. For α ≤ 1, design estimators around medians or quantiles.
In regression with α-stable noise (α < 2), OLS can behave erratically. For α ∈ (1, 2) it remains unbiased under exogeneity but exhibits large dispersion and slow concentration. For α ≤ 1, the mean is undefined and “unbiasedness” is not meaningful.
Switching to L1 regression (least absolute deviations) or Huber losses improves stability. In practice, diagnose tail behavior and pick r accordingly: r = 1 or robust losses for α ≤ 2, r = 2 for sub-Gaussian or sub-exponential errors.
The quasi-norm regime 0 < r < 1: what changes and pitfalls
When 0 < r < 1, |·|^r is not convex and the triangle inequality fails, so ||·||r is a quasi-norm rather than a norm. Many convenient tools (Minkowski, Jensen in the usual direction) break, and you must avoid arguments that implicitly assume convexity.
That said, the Markov bound still works: P(|Xn − X| > ε) ≤ E[|Xn − X|^r] / ε^r, so Lr convergence still implies convergence in probability. However, closure under sums is delicate.
Generally, ||X + Y||r^r ≤ ||X||r^r + ||Y||r^r holds for 0 < r ≤ 1 (subadditivity of t ↦ t^r). Transferring rates through nonlinear maps requires tailored bounds. If you can, lift to an Ls space with s ≥ 1 via moment assumptions, apply standard tools there, and then infer r-level statements.
Beyond real-valued variables: vectors and Banach/Hilbert spaces
Convergence in mean extends seamlessly to random vectors and, more generally, to random elements in normed spaces. The definition replaces absolute value with the ambient norm: Xn →Lr X in Rd means E[||Xn − X||^r] → 0 for a chosen norm ||·||.
In finite dimensions, all norms are equivalent, so componentwise Lr convergence and norm-based Lr convergence are interchangeable up to constants. In infinite-dimensional settings (function spaces), use Bochner integrability and the geometry of Banach or Hilbert spaces to define moments and convergence.
Random vectors: componentwise and norm-based views
For Xn, X in Rd with Euclidean norm, E[||Xn − X||^2] → 0 is exactly L2 convergence in the multivariate sense. Because norms are equivalent, ||·||1, ||·||2, ||·||∞ yield the same convergence notion (though rates differ by constants).
Componentwise L2 convergence implies vector L2 convergence and vice versa. Stability properties (Lipschitz maps, sums, products via bilinear forms) carry over unchanged.
Bochner integrability in Banach/Hilbert spaces
For random elements in a Banach space (B, ||·||), define Lr convergence via E[||Xn − X||^r] → 0 with Bochner integrability: E[||X||^r] < ∞ and measurability.
In Hilbert spaces (e.g., L2[0,1]), inner products enable variance/bias decompositions mirroring finite dimensions. Many Lr tools extend with minimal change, but check moment conditions carefully, especially for nonlinear mappings.
How to choose r in practice: robustness vs efficiency
Choosing r is a trade-off between robustness and statistical efficiency. L2 (mean-square convergence) is optimal under Gaussian-like errors and leads to strong concentration and smooth optimization.
L1 (mean convergence) is robust to outliers and heavy tails, tolerating rare but large deviations that would explode L2 risk.
A practical guideline is:
- Use r = 2 (MSE) when tails are light (sub-Gaussian/sub-exponential), moments are abundant, and you value smooth gradients.
- Use r = 1 (MAE) or robust hybrids (Huber, quantile loss) when you suspect heavy tails, contamination, or want median-like behavior.
- Under model misspecification, L1 often aligns better with median targets, while L2 targets means; choose based on the parameter of interest and the tail index you infer from data.
Examples and counterexamples you should know
A few constructions clarify what does and does not follow in mean convergence. First, Lr ⇒ probability via Markov is a one-line bound: P(|Xn − X| > ε) ≤ E[|Xn − X|^r]/ε^r.
Second, Ls ⇒ Lr for s > r ≥ 1 uses Hölder monotonicity: ||Y||r ≤ ||Y||s. Two counterexamples are essential.
The “spike” example Xn = n with probability 1/n, else 0, shows convergence in probability can fail to imply L1. Another is that convergence in distribution alone never guarantees any Lr convergence. You can have Xn ⇒ X with E[|Xn − X|^r] not going to zero if moments are unstable.
These examples motivate adding uniform integrability or stronger moment conditions to recover L1 or L2 statements.
FAQs on convergence in mean
-
What conditions guarantee that convergence in probability combined with uniform integrability implies L1 convergence? If Xn →p X and {|Xn|} is uniformly integrable (for example, supn E|Xn|^p < ∞ for some p > 1, or |Xn| ≤ Y with E[Y] < ∞), then Xn →L1 X by a Vitali-type theorem; see Uniform integrability.
-
How can I bound E[|Xn − X|^r] to obtain a rate using Jensen, Hölder, or Minkowski? Use convexity for averages (Jensen), triangle inequality for sums (Minkowski), and mixed moments for products (Hölder). Example: for Lipschitz f, E|f(Xn) − f(X)|^r ≤ L^r E|Xn − X|^r; for sample means with finite variance, E|X̄n − μ|^2 = σ^2/n.
-
When is mean-square convergence to a constant equivalent to variance going to zero and bias going to zero? Always: E[(Xn − c)^2] = Var(Xn) + (E[Xn] − c)^2, so Xn →L2 c iff Var(Xn) → 0 and E[Xn] → c.
-
Does convergence in mean (L1) imply almost sure convergence, and what extra assumptions make it hold? L1 ⇒ probability, but not almost sure. To upgrade to almost sure, add summability of tail probabilities or martingale/summability conditions that force with-probability-one convergence.
-
How do Lipschitz or smooth transformations f(.) affect Lr convergence of random variables? Lipschitz maps preserve Lr convergence and rates: ||f(Xn) − f(X)||r ≤ L ||Xn − X||r. For smooth f with controlled derivatives and suitable moments, Hölder transfers convergence with explicit constants.
-
Can a sequence converge in probability but fail to converge in L1 even if its first moments are bounded? Yes. The spike example has supn E|Xn| = 1 yet Xn does not converge to 0 in L1; bounded first moments do not ensure uniform integrability.
-
What changes when defining convergence in the r-th mean for 0 < r < 1 versus r ≥ 1? |·|^r is not convex and the triangle inequality fails, so several standard tools (Minkowski, Jensen in the usual direction) break. However, Markov still gives Lr ⇒ probability.
-
How does convergence in mean extend to random vectors and to Hilbert/Banach space–valued random elements? Replace |·| with the ambient norm and require Bochner integrability: E[||Xn − X||^r] → 0 with E[||X||^r] < ∞. In finite dimensions, componentwise and norm-based views are equivalent.
-
Which algebraic operations (sums, products, conditioning) preserve Lr convergence and under what moment conditions? Sums: always, by Minkowski. Products: need Hölder-compatible moments p, q with 1/p + 1/q = 1/r. Conditional expectation: contraction in Lr for r ≥ 1.
-
How do I check that a sequence is Cauchy in Lr and use completeness to prove convergence? Show for any ε > 0 there is N with E[|Xm − Xn|^r] < ε for m, n ≥ N. A sufficient condition is ∑n ||Xn+1 − Xn||r < ∞. Completeness of Lp (r ≥ 1) then guarantees a limit.
-
In practice, when should I choose L1 over L2 convergence criteria for estimator consistency and robustness to outliers? Choose L2 when tails are light and efficiency matters (squared loss, Gaussian noise). Choose L1 or robust losses when you expect heavy tails or contamination; L1 remains meaningful when L2 breaks (e.g., Pareto with α ∈ (1, 2]).
-
What are concrete heavy-tailed examples where L1 holds but L2 fails, and how do parameter thresholds determine this? For Pareto(α) with α ∈ (1, 2], E|X| < ∞ but Var(X) = ∞, so averages converge in L1 (and almost surely) but not in L2; see Pareto distribution. For α-stable with α ∈ (1, 2), the mean exists but variance is infinite; L1 behavior persists while L2 fails (see Stable distribution).