Quotes and Notes from Statistical Theory: A Concise Introduction (2ed)

by Felix Abramovich and Ya’acov Ritov

Get the book from Routledge (here) or Amazon (here)

This is a truly excellent book explaining the underlying ideas, mathematics, and principles of major statistical concepts. Its organization is suburb, and the authors’ commentary on why a theorem is so useful and how the presented ideas fit together and/or contrast is invaluable (and, quite frankly, better than I have seen anywhere else).

1 Introduction

Suppose the observed data is the sample \(\mathbf{y} = \{y_1, \ldots, y_n\}\) of size \(n\). We will model \(\mathbf{y}\) as a realization of an \(n\)-dimensional random vector \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) with a joint distribution \(f_\mathbf{Y}(\mathbf{y})\). The true distribution of the data is rarely completely known; nevertheless, it can often be reasonable to assume that it belongs to some family of distributions \(\mathcal{F}\). We will assume that \(\mathcal{F}\) is a parametric family, that is, that we know the type of distribution \(f_\mathbf{Y}(\mathbf{y})\) up to some unknown parameter(s) \(\theta \in \Theta\), where \(\Theta\) is a parameter space. Typically, we will consider the case where \(y_1, \ldots, y_n\) are the results of independent identical experiments. In this case, \(Y_1, \ldots, Y_n\) can be treated as independent, identically distributed random variables with the common distribution \(f_\theta(y)\) from a parametric family of distributions \(\mathcal{F}_\theta\), \(\theta \in \Theta\).

Define the likelihood function \(L(\theta; \mathbf{y}) = P_\theta(\mathbf{y})\) — the probability to observe the given data \(\mathbf{y}\) for any possible value of \(\theta \in \Theta\). First assume that \(\mathbf{Y}\) is discrete. The value \(L(\theta; \mathbf{y})\) can be viewed as a measure of likeliness of \(\theta\) to the observed data \(\mathbf{y}\). If \(L(\theta_1; \mathbf{y}) > L(\theta_2; \mathbf{y})\) for a given \(\mathbf{y}\), we can say that the value \(\theta_1\) for \(\theta\) is more suited to the data than \(\theta_2\). For a continuous random variable \(\mathbf{y}\), the likelihood ratio \(L(\theta_1; \mathbf{y})/L(\theta_2; \mathbf{y})\) shows the strength of the evidence in favor of \(\theta = \theta_1\) vs \(\theta = \theta_2\).

A statistic \(T(\mathbf{Y})\) is any real or vector-valued function that can be computed using the data alone. A statistic \(T(\mathbf{Y})\) is sufficient for an unknown parameter \(\theta\) if the conditional distribution of all the data \(\mathbf{Y}\) given \(T(\mathbf{Y})\) does not depend on the \(\theta\). In other words, given \(T(\mathbf{Y})\) no other information on \(\theta\) can be extracted from \(\mathbf{y}\). This definition allows one to check whether a given statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\), but it does not provide one with a constructive way to find it.

However, the Fisher-Neyman Factorization Theorem says that a statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\) iff for all \(\theta \in \Theta\) that \(L(\theta, \mathbf{y}) = g(T(\mathbf{y}), \theta) \cdot h(\mathbf{y})\), where the function \(g(\cdot)\) depends on \(\theta\) and the statistic \(T(\mathbf{Y})\), while \(h(\mathbf{y})\) does not depend on \(\theta\). In particular, if the likelihood \(L(\theta; \mathbf{y})\) depends on data only through \(T(\mathbf{Y})\), then \(T(\mathbf{Y})\) is a sufficient statistic for \(\theta\) and \(h(\mathbf{y}) = 1\).

A sufficient statistic is not unique. For example, the entire sample \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) is always a (trivial) sufficient statistic. We may seek a minimal sufficient statistic implying the maximal reduction of the data. A statistic \(T(\mathbf{Y})\) is called a minimal sufficient statistic if it is a function of any other sufficient statistic.

Another important property of a statistic is completeness. Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), where \(\theta \in \Theta\). A statistic \(T(\mathbf{Y})\) is complete if no statistic \(g(\mathbf{T})\) exists (except \(g(\mathbf{T})=0\)) such that \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\). In other words, if \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\), then necessarily \(g(\mathbf{T})=0\). To verify completeness for a general distribution can be a nontrivial mathematical problem, but thankfully it is much simpler for the exponential family of distributions that includes many of the “common” distributions. Completeness is a useful in determining minimal sufficiency because if a sufficient statistic \(T(\mathbf{Y})\) is complete, then it is also minimal sufficient. (Note, however, that a minimal sufficient statistic may not necessarily be complete.)

A (generally multivariate) family of distributions \({f_\theta(\mathbf{y}): \theta \in \Theta}\) is said to be an (one parameter) exponential family if: (1) \(\Theta\) is an open interval, (2) the support of the distribution \(f_\theta\) does not depend on \(\theta\), and (3) \(f_\theta(\mathbf{y}) = exp\{c(\theta)T(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\) where \(c(\cdot)\), \(T(\cdot)\), \(d(\cdot)\), and \(S(\cdot)\) are known functions; \(c(\theta)\) is usually called the natural parameter of the distribution. We say that \(f_\theta\) where \(\theta = (\theta_1, \ldots \theta_p)\) belongs to a \(k\)-parameter exponential family by changing (3) such that \(f_\theta(\mathbf{y}) = exp\{ \sum_{j=1}^k c_j(\theta)T_j(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\). The function \(c(\theta) = \{c_1(\theta), \ldots, c_k(\theta)\}\) are the natural parameters of the distribution. (Note that the dimensionality \(p\) of the original parameter \(\theta\) is not necessarily the same as the dimensionality \(k\) of the natural parameter \(c(\theta)\).)

Consider a random sample \(Y_1, \ldots, Y_n\), where \(Y_i \sim f_\theta(y)\) and \(f_\theta\) belongs to a \(k\)-parameter exponential family of distributions, then (1) the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\) also belongs to the \(k\)-parameter exponential family, (2) \(T_\mathbf{Y} = (\sum_{i=1}^n T_1(Y_i), \ldots, \sum_{i=1}^n T_k(Y_i))\) is the sufficient statistic for \(c(\theta) = (c_1(\theta), \ldots, c_k(\theta))\) (and, therefore, for \(\theta\)), and (3) if some regularity conditions hold, then \(T_\mathbf{Y}\) is complete and therefore minimal sufficient (if the latter exists).

2 Point Estimation

Estimation of the unknown parameters of distributions from the data is one of the key issues in statistics. A (point) estimator \(\hat{\theta} = \hat{\theta}(\mathbf{Y})\) of an unknown parameter \(\theta\) is any statistic used for estimating \(\theta\). The value of \(\hat{\theta}(\mathbf{y})\) evaluated for a given sample is called an estimate. This is a general, somewhat trivial definition that does not say anything about the goodness of estimation; one would evidently be interested in “good” estimators.

Maximum Likelihood Estimation is the most used method of estimation of parameters in parametric models. As we’ve discussed, \(L(\theta; \mathbf{y})\) is the measure of likeliness of a parameter’s values \(\theta\) for the observed data \(\mathbf{y}\). It is only natural then to seek the “most likely” value of \(\theta\). The MLE \(\hat{\theta}\) of \(\theta\) is \(\hat{\theta} = \arg\max_{\theta\in\Theta} L(\theta; \mathbf{y})\) — the value of \(\theta\) that maximizes the likelihood. Often we are interested in a function of a parameter \(\xi = g(\theta)\). When \(g(\cdot)\) is 1:1, the MLE of \(\xi\) is the function \(g(\cdot)\) applied to the MLE of \(\theta\): \(\hat{\xi} = g(\hat{\theta})\), the proof of which is just a reparameterization of the likelihood in terms of \(\xi\) instead of \(\theta\). Although the conception of the MLE was motivated by an intuitively clear underlying idea, the justification for its use is much deeper. It is a really “good” method of estimation (to be discussed later on the topic of asymptotics).

Another popular method of estimation is the Method of Moments. Its main idea is based on expressing the population moments of the distribution of data in terms of its unknown parameter(s) and equating them to their corresponding sample moments. MMEs have some known problems: consider a sample of size 4 from a uniform distribution \(U(0, \theta)\) with the observed sample 0.2, 0.6, 2, and 0.4. The MME is 1.6, which does not make much sense given the observed value of 2 in the sample. For these and related reasons, the MMEs are less used than the MLE counterparts. However, MMEs are usually simpler to compute and can be used, for example, as reasonable initial values in numerical iterative procedures for MLEs. On the other hand, the Method of Moments does not require knowledge of the entire distribution of the data (up to the unknown parameters) but only its moments and thus may be less sensitive to possible misspecification of a model.

The Method of Least Squares play a key role in regression and analysis of variance. In a typical regression setup, we are given \(n\) observations \((\mathbf{x}_i, y_i)\), \(i=1, \ldots, n\) over \(m\) explanatory variables \(\mathbf{x} = (x_1, \ldots, x_m)\) and the response variable \(\mathbf{Y}\). We assume that \(y_i = g_\theta(\mathbf{x}_i) + \varepsilon_i\), \(i = 1, \ldots, n\) where the response function \(g_\theta(\cdot): \mathbb{R}^m \rightarrow \mathbb{R}\) has a known parametric form and depends on \(p \le n\) unknown parameters \(\theta = (\theta_1, \ldots, \theta_p)\). The LSE looks for a \(\hat{\theta}\) that yields the best fit \(g_\hat{\theta}(\mathbf{x})\) to the observed \(\mathbf{y}\) w.r.t. the Euclidean distance: \(\hat{\theta} = \arg\min_\theta \sum_{i=1}^n (y_i - g_\theta(\mathbf{x}_i))^2\). For linear regression, the solution is available in closed form. For non-linear regression, however, it can generally be only found numerically.

More generally than the three above procedures, one can consider any function \(\rho(\theta, y)\) as a measure of the goodness-of-fit and look for an estimator that maximizes or minimizes \(\sum_{i=1}^n \rho(\theta, y_i)\) w.r.t. \(\theta\). Such estimators are called M-estimators. It runs out that various well-known estimators can be viewed as M-estimators for a particular \(\rho(\theta, y)\) including \(\bar{Y}\) for the sample mean as well as MLE, LSE, and a generalized version of MME not yet discussed. As we’ll see when discussing asymptotics, M-estimators share many important asymptotic properties.

A natural question is how to compare between various estimators. First, we should define a measure of goodness-of-estimation. Recall that any estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is a function of a random sample and therefore is a random variable itself with a certain distribution, expectation, variance, etc. A somewhat naive attempt to measure the goodness-of-estimation of \(\hat{\theta}\) would be to consider the error \(|\hat{\theta}-\theta|\). However, \(\theta\) is unknown and, as we have mentioned, an estimator \(\hat{\theta}\) is a random variable and hence the value \(|\hat{\theta}-\theta|\) will vary from sample to sample. It may be “small” for some of the samples, while “large” for others and therefore cannot be used as a proper criterion for goodness-of-estimation of an estimator \(\hat{\theta}\). A more reasonable measure would then be an average distance over all possible samples, that is, the mean absolute error \(E|\hat{\theta}-\theta|\), where the expectation is taken w.r.t. the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\). It indeed can be used as a measure of goodness-of-estimation but usually, mostly due to convenience of differentiation, the conventional measure is the mean squared error (MSE) given by \(MSE(\hat{\theta}, \theta) = E(\hat{\theta}-\theta)^2\).

The MSE can be decomposed into two components: \(MSE(\hat{\theta}, \theta) = Var(\hat{\theta}) + b^2(\hat{\theta},\theta)\). The first is the stochastic error (variance) and the second is a systematic or deterministic error (bias). Having defined the goodness-of-estimation measure by MSE, one can compare different estimators and choose the one with the smallest MSE. However, since the \(MSE(\hat{\theta}, \theta)\) typically depends on the unknown \(\theta\), it is a common situation where no estimator is uniformly superior for all \(\theta \in \Theta\).

Ideally, a good estimator with a small MSE should have both low variance and low bias. However, it might be hard to have both. One of the common approaches is to first control the bias component of the overall MSE and to consider unbiased estimators. There is no general rule or algorithm for deriving an unbiased estimator. In fact, unbiasedness is a property of an estimator rather than a method of estimation. One usually checks an MLE or any other estimator for bias. Sometimes one can then modify the original estimator to “correct” its bias. Note that unlike ML estimation, unbiasedness is not invariant under nonlinear transformation of the original parameter: if \(\hat{\theta}\) is an unbiased estimator of \(\theta\), \(g(\hat{\theta})\) is generally a biased estimator for \(g(\theta)\).

What does unbiasedness of an estimator \(\hat{\theta}\) actually mean? Suppose we were observing not a single sample but all possible samples of size \(n\) from a sample space and were calculating the estimates \(\hat{\theta}_j\) for each one of them. The unbiasedness means that the average value of \(\hat{\theta}\) over the entire sample space is \(\theta\), but it does not guarantee yet that \(\hat{\theta}_j \approx \theta\) for each particular sample. The dispersion of \(\hat{\theta}_j\)’s around their average value \(\theta\) might be large and, since in reality we have only a single sample, its particular value of \(\hat{\theta}\) might be quite away from \(\theta\). To ensure with high confidence that \(\hat{\theta}_j \approx \theta\) for any sample we need in addition for the variance \(Var(\hat{\theta})\) to be small.

An estimator \(\hat{\theta}\) is called a uniformly minimum variance unbiased estimator (UMVUE) of \(\theta\) if \(\hat{\theta}\) is unbiased and for any other unbiased estimator \(\tilde{\theta}\) of \(\theta\), \(Var(\hat{\theta}) \le Var(\tilde{\theta})\). If the UMVUE exists, it is necessarily unique. Recall that there is no general algorithm to obtain unbiased estimators in general and a UMVUE in particular. However, there exists a lower bound for a variance of an unbiased estimator, which can be used as a benchmark for evaluating its goodness.

Define the Fisher Information Number \(I(\theta) = E((\ln f_\theta(\mathbf{y}))'_\theta)^2\). The derivative of the log density is sometimes called the Score Function. Thus, the Fisher Information Number is the expected square of the Score. The Cramer-Rao Lower Bound Theorem states that if \(T\) is an unbiased estimator for \(g(\theta)\), where \(g(\cdot)\) is differentiable, then \(Var(T) \ge (g'(\theta))^2 / I(\theta)\) or more simply, when \(T\) is an unbiased estimator for \(\theta\), \(Var(T) \ge 1 / I(\theta)\). We are especially interested in the case where \(Y_1, \ldots, Y_n\) is a random sample from a distribution \(f_\theta(y)\). In that case, \(I(\theta) = nI^*(\theta)\) where \(I^*(\theta) = E((\ln f_\theta(y))'_\theta)^2\) is the Fisher Information Number of \(f_\theta(y)\), and, therefore, for any unbiased estimator \(T\) of \(g(\theta)\), we have that \(Var(T) \ge (g'_\theta(\theta))^2 / nI^*(\theta)\). There is another, usually more convenient formula for calculating the Fisher Information Number \(I(\theta)\) other than its direct definition: \(I(\theta) = -E(\ln f_\theta(\mathbf{Y}))''_\theta\).

It is important to emphasize that the CRLB theorem is one direction only: if the variance of an unbiased estimator does not achieve the Cramer-Rao lower bound, one still cannot claim that it is not an UMVUE. Nevertheless, it can be used as a benchmark for measuring the goodness of an unbiased estimator. One special result related to the exponential family distribution: the Cramer-Rao lower bound is achieved only for distributions from the exponential family.

The Cramer-Rao lower bound allows one only to evaluate the goodness of a proposed unbiased estimator but does not provide any constructive way to derive it. In fact, as we have argued, there is no such general rule at all. However, if one manages to obtain any initial (even crude) unbiased estimator, it may be possible to improve it. The Rao-Blackwell Theorem shows that if there is an unbiased estimator that is not a function of a sufficient statistic \(W\), one can construct another unbiased estimator based on \(W\) with an MSE not larger than the original one: let \(T\) be an unbiased estimator of \(\theta\) and \(W\) be a sufficient statistic for \(\theta\), and define \(T_1 = E(T|W)\), then \(T_1\) is an unbiased estimator of \(\theta\) and \(Var(T_1) \le Var(T)\). Thus, in terms of MSE, only unbiased estimators based on a sufficient statistic are of interest. This demonstrates again a strong sense of the notion of sufficiency.

Does Rao-Blackwellization necessarily yield an UMVUE? Generally not. To guarantee UMVUE an additional requirement of completeness on a sufficient statistic \(W\) is needed. The Lehmann-Scheffe Theorem formalizes this: if \(T\) is an unbiased estimator of \(\theta\) and \(W\) is a complete sufficient statistic for \(\theta\), then \(T_1 = E(T|W)\) is the unique UMVUE of \(\theta\). Even without the Lehmann-Scheffe theorem, it can be shown under mild conditions that if the distribution of the data belongs to the exponential family and an unbiased estimator is a function of the corresponding sufficient statistic, it is an UMVUE. Note that despite its elegance, the application of the Rao-Blackwell Theorem in more complicated cases is quite limited. The two main obstacles are in finding an initial unbiased estimator \(T\) and calculating the conditional expectation \(E(T|W)\)

3 Confidence Intervals, Bounds, and Regions

When we estimate a parameter we essentially “guess” its value. This should be a “well educated guess,” hopefully the best of its kind (in whatever sense). However, statistical estimation is made with error. Presenting just the estimator, no matter how good it is, is usually not enough – one should give an idea about the estimation error. The words “estimation” and “estimate” (in contrast to “measure” and “value”) point to the fact that an estimate is an inexact appraisal of a value. Great efforts of statistics are invested in trying to quantify the estimation error and find ways to express it in a well-defined way.

Any estimator has error, that is, if \(\theta\) is estimated by \(\hat{\theta}\), then \(\hat{\theta} - \theta\) is usually different from \(0\). The standard measure of error is the mean squared error \(MSE = E(\hat{\theta} - \theta)^2\). When together with the value of the estimator we are given its MSE, we get a feeling of how precise the estimate is. It is more common to quote the standard error defined by \(SE(\hat{\theta}) = \sqrt{MSE(\hat{\theta}, \theta)} = \sqrt{E(\hat{\theta}-\theta)^2}\). Unlike the MSE, the SE is measured in the same units as the estimator.

However, the MSE expresses an average squared estimation error but tells nothing about its distribution, which generally might be complicated. One would be interested, in particular, in the probability \(P(|\hat{\theta}-\theta| > c)\) that the estimation error exceeds a certain accuracy level \(c > 0\). Markov’s Inequality enables us to translate the MSE into an upper bound on this probability: \(P(|\hat{\theta}-\theta| \ge c) \le MSE/c^2\). However, this bound is usually very conservative and the actual probability may be much smaller. Typically, a quite close approximation of the error distribution is obtained by the Central Limit Theorem (discussed later).

Above, we suggested quoting the SE besides the estimator itself. However, standard statistical practice is different and it is not very intuitive. There are a few conceptual difficulties in being precise when talking about the error. Suppose we estimate \(\mu\) with \(\bar{Y}\) and we find \(\bar{Y}=10\) and \(SE=0.2\). We are quite confident that \(\mu\) is between \(9.6\) and \(10.4\). However, this statement makes no probabilistic sense. The true \(\mu\) is either in the interval \((9.6,10.4)\) or it’s not. The unknown \(\mu\) is not a random variable – it has a single fixed value, whether or not it is known to the statistician.

The “trick” that has been devised is to move from a probabilistic statement about the unknown (but with a fixed value) parameter to a probabilistic statement about the method. We say something like: “The interval \((9.6,10.4)\) for the value \(\mu\) was constructed by a method with is 95% successful.” Since the method is usually successful, we have confidence in its output.

Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), \(\theta \in \Theta\). A \((1-\alpha)100\%\) Confidence Interval for \(\theta\) is the pair of scalar-valued statistics \(L=L(Y_1, \ldots, Y_n)\) and \(U=U(Y_1, \ldots, Y_n)\) such that \(P(L \le \theta \le U) \ge 1-\alpha\) for all \(\theta \in \Theta\) with the inequality as closs to an equality as possible.

What is the right value of \(\alpha\)? Nothing in the statistical theory dictates a particular choice. But the standard values are \(0.10\), \(0.05\), and \(0.01\) with \(0.05\) being most common.

A pivot is a function \(\psi(Y_1, \ldots, Y_n; \theta)\) of the data and the parameters, whose distribution does not depend on unknown parameters. Note, the pivot is not a statistic and cnanot be calculated form the data, exactly because it depends on the unknown parameters. On the other hand, since its distribution does not depend on the unknown parameters, we can find an interval \(A_\alpha\) such that \(P(\psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha) = 1 - \alpha\). Then we can invert the inclusion and define the interval (or region in general) \(C_\alpha = \{\theta : \psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha \}\). The set \(C_\alpha\) is then a \((1-\alpha)100\%\) confidence set. Note that \(C_\alpha\) is a random set because it depends on the random sample.

Does a pivot always exist? For a random sample \(Y_1, \ldots, Y_n\) from any continuous distribution with a cdf \(F_\theta(y)\), the value \(-\sum_{i=1}^n \ln F_\theta(Y_i) \sim \frac{1}{2} \chi^2_{2n}\) and, therefore, is a pivot. However, in the general case, it might be difficult (if possible) to invert the corresponding confidence interval for this pivot to a confidence interval for the original parameter of interest \(\theta\). More convenient pivots are easy to find when the distribution belongs to a scale-location family.

Is a confidence interval for \(\theta\) necessarily unique? Definitely not! Different choices of pivots lead to different forms of confidence intervals. Moreover, even for a given pivot, one can typically construct an infinite set of confidence intervals at the same confidence level. What is the “best” choice for a confidence interval? A conventional, somewhat ad hoc approach is based on error symmetry: set \(P(L>\theta) = P(U<\theta) = \alpha/2\). A more appealing approach would be to seek a confidence interval of a minimal expected length; however, in general, this leads to a nonlinear minimization problem that might not have a solution in closed form.

A parameter of interest may be not the original parameter \(\theta\) but its function \(g(\theta)\). Let \((L,U)\) be a \((1-\alpha)100\%\) confidence interval for \(\theta\) and \(g(\cdot)\) a strictly increasing function. Then \((g(L),g(U))\) is a \((1-\alpha)100\%\) confidence interval for \(g(\theta)\).

The normal confidence intervals are undoubtedly the most important. We do not claim that most data sets are sampled from a normal distribution – this is definitely far from being true. However, what really matters is whether the distribution of an estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is close to normal rather than the distribution of \(Y_1, \ldots, Y_n\) themselves. When discussing asymptotics later, we argue that many estimators based on large or even medium sized samples are indeed approximately normal and, therefore, the normal confidence intervals can (at least approximately) be used.

4 Hypothesis Testing

To be added

5 Asymptotic Analysis

To be added

6 Bayesian Inference

To be added

7 Elements of Statistical Decision Theory

To be added

8 Linear Models

To be added

9 Nonparametric Estimation

To be added