Get the book from Routledge (here) or Amazon (here)

This is a truly excellent book explaining the underlying ideas, mathematics, and principles of major statistical concepts. Its organization is suburb, and the authors’ commentary on why a theorem is so useful and how the presented ideas fit together and/or contrast is invaluable (and, quite frankly, better than I have seen anywhere else).

1 Introduction

Suppose the observed data is the sample \(\mathbf{y} = \{y_1, \ldots, y_n\}\) of size \(n\). We will model \(\mathbf{y}\) as a realization of an \(n\)-dimensional random vector \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) with a joint distribution \(f_\mathbf{Y}(\mathbf{y})\). The true distribution of the data is rarely completely known; nevertheless, it can often be reasonable to assume that it belongs to some family of distributions \(\mathcal{F}\). We will assume that \(\mathcal{F}\) is a parametric family, that is, that we know the type of distribution \(f_\mathbf{Y}(\mathbf{y})\) up to some unknown parameter(s) \(\theta \in \Theta\), where \(\Theta\) is a parameter space. Typically, we will consider the case where \(y_1, \ldots, y_n\) are the results of independent identical experiments. In this case, \(Y_1, \ldots, Y_n\) can be treated as independent, identically distributed random variables with the common distribution \(f_\theta(y)\) from a parametric family of distributions \(\mathcal{F}_\theta\), \(\theta \in \Theta\).

Define the likelihood function \(L(\theta; \mathbf{y}) = P_\theta(\mathbf{y})\) — the probability to observe the given data \(\mathbf{y}\) for any possible value of \(\theta \in \Theta\). First assume that \(\mathbf{Y}\) is discrete. The value \(L(\theta; \mathbf{y})\) can be viewed as a measure of likeliness of \(\theta\) to the observed data \(\mathbf{y}\). If \(L(\theta_1; \mathbf{y}) > L(\theta_2; \mathbf{y})\) for a given \(\mathbf{y}\), we can say that the value \(\theta_1\) for \(\theta\) is more suited to the data than \(\theta_2\). For a continuous random variable \(\mathbf{y}\), the likelihood ratio \(L(\theta_1; \mathbf{y})/L(\theta_2; \mathbf{y})\) shows the strength of the evidence in favor of \(\theta = \theta_1\) vs \(\theta = \theta_2\).

A statistic \(T(\mathbf{Y})\) is any real or vector-valued function that can be computed using the data alone. A statistic \(T(\mathbf{Y})\) is sufficient for an unknown parameter \(\theta\) if the conditional distribution of all the data \(\mathbf{Y}\) given \(T(\mathbf{Y})\) does not depend on \(\theta\). In other words, given \(T(\mathbf{Y})\) no other information on \(\theta\) can be extracted from \(\mathbf{y}\). This definition allows one to check whether a given statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\), but it does not provide one with a constructive way to find it.

However, the Fisher-Neyman Factorization Theorem says that a statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\) iff for all \(\theta \in \Theta\) that \(L(\theta, \mathbf{y}) = g(T(\mathbf{y}), \theta) \cdot h(\mathbf{y})\), where the function \(g(\cdot)\) depends on \(\theta\) and the statistic \(T(\mathbf{Y})\), while \(h(\mathbf{y})\) does not depend on \(\theta\). In particular, if the likelihood \(L(\theta; \mathbf{y})\) depends on data only through \(T(\mathbf{Y})\), then \(T(\mathbf{Y})\) is a sufficient statistic for \(\theta\) and \(h(\mathbf{y}) = 1\).

A sufficient statistic is not unique. For example, the entire sample \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) is always a (trivial) sufficient statistic. We may seek a minimal sufficient statistic implying the maximal reduction of the data. A statistic \(T(\mathbf{Y})\) is called a minimal sufficient statistic if it is a function of any other sufficient statistic.

Another important property of a statistic is completeness. Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), where \(\theta \in \Theta\). A statistic \(T(\mathbf{Y})\) is complete if no statistic \(g(\mathbf{T})\) exists (except \(g(\mathbf{T})=0\)) such that \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\). In other words, if \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\), then necessarily \(g(\mathbf{T})=0\). To verify completeness for a general distribution can be a nontrivial mathematical problem, but thankfully it is much simpler for the exponential family of distributions that includes many of the “common” distributions. Completeness is a useful in determining minimal sufficiency because if a sufficient statistic \(T(\mathbf{Y})\) is complete, then it is also minimal sufficient. (Note, however, that a minimal sufficient statistic may not necessarily be complete.)

A (generally multivariate) family of distributions \({f_\theta(\mathbf{y}): \theta \in \Theta}\) is said to be an (one parameter) exponential family if: (1) \(\Theta\) is an open interval, (2) the support of the distribution \(f_\theta\) does not depend on \(\theta\), and (3) \(f_\theta(\mathbf{y}) = exp\{c(\theta)T(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\) where \(c(\cdot)\), \(T(\cdot)\), \(d(\cdot)\), and \(S(\cdot)\) are known functions; \(c(\theta)\) is usually called the natural parameter of the distribution. We say that \(f_\theta\) where \(\theta = (\theta_1, \ldots \theta_p)\) belongs to a \(k\)-parameter exponential family by changing (3) such that \(f_\theta(\mathbf{y}) = exp\{ \sum_{j=1}^k c_j(\theta)T_j(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\). The function \(c(\theta) = \{c_1(\theta), \ldots, c_k(\theta)\}\) are the natural parameters of the distribution. (Note that the dimensionality \(p\) of the original parameter \(\theta\) is not necessarily the same as the dimensionality \(k\) of the natural parameter \(c(\theta)\).)

Consider a random sample \(Y_1, \ldots, Y_n\), where \(Y_i \sim f_\theta(y)\) and \(f_\theta\) belongs to a \(k\)-parameter exponential family of distributions, then (1) the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\) also belongs to the \(k\)-parameter exponential family, (2) \(T_\mathbf{Y} = (\sum_{i=1}^n T_1(Y_i), \ldots, \sum_{i=1}^n T_k(Y_i))\) is the sufficient statistic for \(c(\theta) = (c_1(\theta), \ldots, c_k(\theta))\) (and, therefore, for \(\theta\)), and (3) if some regularity conditions hold, then \(T_\mathbf{Y}\) is complete and therefore minimal sufficient (if the latter exists).

2 Point Estimation

Estimation of the unknown parameters of distributions from the data is one of the key issues in statistics. A (point) estimator \(\hat{\theta} = \hat{\theta}(\mathbf{Y})\) of an unknown parameter \(\theta\) is any statistic used for estimating \(\theta\). The value of \(\hat{\theta}(\mathbf{y})\) evaluated for a given sample is called an estimate. This is a general, somewhat trivial definition that does not say anything about the goodness of estimation; one would evidently be interested in “good” estimators.

Maximum Likelihood Estimation is the most used method of estimation of parameters in parametric models. As we’ve discussed, \(L(\theta; \mathbf{y})\) is the measure of likeliness of a parameter’s values \(\theta\) for the observed data \(\mathbf{y}\). It is only natural then to seek the “most likely” value of \(\theta\). The MLE \(\hat{\theta}\) of \(\theta\) is \(\hat{\theta} = \arg\max_{\theta\in\Theta} L(\theta; \mathbf{y})\) — the value of \(\theta\) that maximizes the likelihood. Often we are interested in a function of a parameter \(\xi = g(\theta)\). When \(g(\cdot)\) is 1:1, the MLE of \(\xi\) is the function \(g(\cdot)\) applied to the MLE of \(\theta\): \(\hat{\xi} = g(\hat{\theta})\), the proof of which is just a reparameterization of the likelihood in terms of \(\xi\) instead of \(\theta\). Although the conception of the MLE was motivated by an intuitively clear underlying idea, the justification for its use is much deeper. It is a really “good” method of estimation (to be discussed later on the topic of asymptotics).

Another popular method of estimation is the Method of Moments. Its main idea is based on expressing the population moments of the distribution of data in terms of its unknown parameter(s) and equating them to their corresponding sample moments. MMEs have some known problems: consider a sample of size 4 from a uniform distribution \(U(0, \theta)\) with the observed sample 0.2, 0.6, 2, and 0.4. The MME is 1.6, which does not make much sense given the observed value of 2 in the sample. For these and related reasons, the MMEs are less used than the MLE counterparts. However, MMEs are usually simpler to compute and can be used, for example, as reasonable initial values in numerical iterative procedures for MLEs. On the other hand, the Method of Moments does not require knowledge of the entire distribution of the data (up to the unknown parameters) but only its moments and thus may be less sensitive to possible misspecification of a model.

The Method of Least Squares play a key role in regression and analysis of variance. In a typical regression setup, we are given \(n\) observations \((\mathbf{x}_i, y_i)\), \(i=1, \ldots, n\) over \(m\) explanatory variables \(\mathbf{x} = (x_1, \ldots, x_m)\) and the response variable \(Y\). We assume that \(y_i = g_\theta(\mathbf{x}_i) + \varepsilon_i\), \(i = 1, \ldots, n\) where the response function \(g_\theta(\cdot): \mathbb{R}^m \rightarrow \mathbb{R}\) has a known parametric form and depends on \(p \le n\) unknown parameters \(\theta = (\theta_1, \ldots, \theta_p)\). The LSE looks for a \(\hat{\theta}\) that yields the best fit \(g_\hat{\theta}(\mathbf{x})\) to the observed \(\mathbf{y}\) w.r.t. the Euclidean distance: \(\hat{\theta} = \arg\min_\theta \sum_{i=1}^n (y_i - g_\theta(\mathbf{x}_i))^2\). For linear regression, the solution is available in closed form. For non-linear regression, however, it can generally be only found numerically.

More generally than the three above procedures, one can consider any function \(\rho(\theta, y)\) as a measure of the goodness-of-fit and look for an estimator that maximizes or minimizes \(\sum_{i=1}^n \rho(\theta, y_i)\) w.r.t. \(\theta\). Such estimators are called M-estimators. It turns out that various well-known estimators can be viewed as M-estimators for a particular \(\rho(\theta, y)\) including \(\bar{Y}\) for the sample mean as well as MLE, LSE, and a generalized version of MME not yet discussed. As we’ll see when discussing asymptotics, M-estimators share many important asymptotic properties.

A natural question is how to compare between various estimators. First, we should define a measure of goodness-of-estimation. Recall that any estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is a function of a random sample and therefore is a random variable itself with a certain distribution, expectation, variance, etc. A somewhat naive attempt to measure the goodness-of-estimation of \(\hat{\theta}\) would be to consider the error \(|\hat{\theta}-\theta|\). However, \(\theta\) is unknown and, as we have mentioned, an estimator \(\hat{\theta}\) is a random variable and hence the value \(|\hat{\theta}-\theta|\) will vary from sample to sample. It may be “small” for some of the samples, while “large” for others and therefore cannot be used as a proper criterion for goodness-of-estimation of an estimator \(\hat{\theta}\). A more reasonable measure would then be an average distance over all possible samples, that is, the mean absolute error \(E|\hat{\theta}-\theta|\), where the expectation is taken w.r.t. the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\). It indeed can be used as a measure of goodness-of-estimation but usually, mostly due to convenience of differentiation, the conventional measure is the mean squared error (MSE) given by \(MSE(\hat{\theta}, \theta) = E(\hat{\theta}-\theta)^2\).

The MSE can be decomposed into two components: \(MSE(\hat{\theta}, \theta) = Var(\hat{\theta}) + b^2(\hat{\theta},\theta)\). The first is the stochastic error (variance) and the second is a systematic or deterministic error (bias). Having defined the goodness-of-estimation measure by MSE, one can compare different estimators and choose the one with the smallest MSE. However, since the \(MSE(\hat{\theta}, \theta)\) typically depends on the unknown \(\theta\), it is a common situation where no estimator is uniformly superior for all \(\theta \in \Theta\).

Ideally, a good estimator with a small MSE should have both low variance and low bias. However, it might be hard to have both. One of the common approaches is to first control the bias component of the overall MSE and to consider unbiased estimators. There is no general rule or algorithm for deriving an unbiased estimator. In fact, unbiasedness is a property of an estimator rather than a method of estimation. One usually checks an MLE or any other estimator for bias. Sometimes one can then modify the original estimator to “correct” its bias. Note that unlike ML estimation, unbiasedness is not invariant under nonlinear transformation of the original parameter: if \(\hat{\theta}\) is an unbiased estimator of \(\theta\), \(g(\hat{\theta})\) is generally a biased estimator for \(g(\theta)\).

What does unbiasedness of an estimator \(\hat{\theta}\) actually mean? Suppose we were observing not a single sample but all possible samples of size \(n\) from a sample space and calculating the estimates \(\hat{\theta}_j\) for each one of them. The unbiasedness means that the average value of \(\hat{\theta}\) over the entire sample space is \(\theta\), but it does not guarantee yet that \(\hat{\theta}_j \approx \theta\) for each particular sample. The dispersion of \(\hat{\theta}_j\)’s around their average value \(\theta\) might be large and, since in reality we have only a single sample, its particular value of \(\hat{\theta}\) might be quite away from \(\theta\). To ensure with high confidence that \(\hat{\theta}_j \approx \theta\) for any sample we need in addition for the variance \(Var(\hat{\theta})\) to be small.

An estimator \(\hat{\theta}\) is called a uniformly minimum variance unbiased estimator (UMVUE) of \(\theta\) if \(\hat{\theta}\) is unbiased and for any other unbiased estimator \(\tilde{\theta}\) of \(\theta\), \(Var(\hat{\theta}) \le Var(\tilde{\theta})\). If the UMVUE exists, it is necessarily unique. Recall that there is no general algorithm to obtain unbiased estimators in general and a UMVUE in particular. However, there exists a lower bound for a variance of an unbiased estimator, which can be used as a benchmark for evaluating its goodness.

Define the Fisher Information Number \(I(\theta) = E((\ln f_\theta(\mathbf{y}))'_\theta)^2\). The derivative of the log density is sometimes called the Score Function. Thus, the Fisher Information Number is the expected square of the Score. The Cramer-Rao Lower Bound Theorem states that if \(T\) is an unbiased estimator for \(g(\theta)\), where \(g(\cdot)\) is differentiable, then \(Var(T) \ge (g'(\theta))^2 / I(\theta)\) or more simply, when \(T\) is an unbiased estimator for \(\theta\), \(Var(T) \ge 1 / I(\theta)\). We are especially interested in the case where \(Y_1, \ldots, Y_n\) is a random sample from a distribution \(f_\theta(y)\). In that case, \(I(\theta) = nI^*(\theta)\) where \(I^*(\theta) = E((\ln f_\theta(y))'_\theta)^2\) is the Fisher Information Number of \(f_\theta(y)\), and, therefore, for any unbiased estimator \(T\) of \(g(\theta)\), we have that \(Var(T) \ge (g'_\theta(\theta))^2 / nI^*(\theta)\). There is another, usually more convenient formula for calculating the Fisher Information Number \(I(\theta)\) other than its direct definition: \(I(\theta) = -E(\ln f_\theta(\mathbf{Y}))''_\theta\).

It is important to emphasize that the CRLB theorem is one direction only: if the variance of an unbiased estimator does not achieve the Cramer-Rao lower bound, one still cannot claim that it is not an UMVUE (it might just be the UMVUE). Nevertheless, it can be used as a benchmark for measuring the goodness of an unbiased estimator. One special result related to the exponential family distribution: the Cramer-Rao lower bound is achieved only for distributions from the exponential family.

The Cramer-Rao lower bound allows one only to evaluate the goodness of a proposed unbiased estimator but does not provide any constructive way to derive it. In fact, as we have argued, there is no such general rule at all. However, if one manages to obtain any initial (even crude) unbiased estimator, it may be possible to improve it. The Rao-Blackwell Theorem shows that if there is an unbiased estimator that is not a function of a sufficient statistic \(W\), one can construct another unbiased estimator based on \(W\) with an MSE not larger than the original one: let \(T\) be an unbiased estimator of \(\theta\) and \(W\) be a sufficient statistic for \(\theta\), and define \(T_1 = E(T|W)\), then \(T_1\) is an unbiased estimator of \(\theta\) and \(Var(T_1) \le Var(T)\). Thus, in terms of MSE, only unbiased estimators based on a sufficient statistic are of interest. This demonstrates again a strong sense of the notion of sufficiency.

Does Rao-Blackwellization necessarily yield an UMVUE? Generally not. To guarantee UMVUE an additional requirement of completeness on a sufficient statistic \(W\) is needed. The Lehmann-Scheffe Theorem formalizes this: if \(T\) is an unbiased estimator of \(\theta\) and \(W\) is a complete sufficient statistic for \(\theta\), then \(T_1 = E(T|W)\) is the unique UMVUE of \(\theta\). Even without the Lehmann-Scheffe theorem, it can be shown under mild conditions that if the distribution of the data belongs to the exponential family and an unbiased estimator is a function of the corresponding sufficient statistic, it is an UMVUE. Note that despite its elegance, the application of the Rao-Blackwell Theorem in more complicated cases is quite limited. The two main obstacles are in finding an initial unbiased estimator \(T\) and calculating the conditional expectation \(E(T|W)\)

3 Confidence Intervals, Bounds, and Regions

When we estimate a parameter we essentially “guess” its value. This should be a “well educated guess,” hopefully the best of its kind (in whatever sense). However, statistical estimation is made with error. Presenting just the estimator, no matter how good it is, is usually not enough – one should give an idea about the estimation error. The words “estimation” and “estimate” (in contrast to “measure” and “value”) point to the fact that an estimate is an inexact appraisal of a value. Great efforts of statistics are invested in trying to quantify the estimation error and find ways to express it in a well-defined way.

Any estimator has error, that is, if \(\theta\) is estimated by \(\hat{\theta}\), then \(\hat{\theta} - \theta\) is usually different from \(0\). The standard measure of error is the mean squared error \(MSE = E(\hat{\theta} - \theta)^2\). When together with the value of the estimator we are given its MSE, we get a feeling of how precise the estimate is. It is more common to quote the standard error defined by \(SE(\hat{\theta}) = \sqrt{MSE(\hat{\theta}, \theta)} = \sqrt{E(\hat{\theta}-\theta)^2}\). Unlike the MSE, the SE is measured in the same units as the estimator.

However, the MSE expresses an average squared estimation error but tells nothing about its distribution, which generally might be complicated. One would be interested, in particular, in the probability \(P(|\hat{\theta}-\theta| > c)\) that the estimation error exceeds a certain accuracy level \(c > 0\). Markov’s Inequality enables us to translate the MSE into an upper bound on this probability: \(P(|\hat{\theta}-\theta| \ge c) \le MSE/c^2\). However, this bound is usually very conservative and the actual probability may be much smaller. Typically, a quite close approximation of the error distribution is obtained by the Central Limit Theorem (discussed later).

Above, we suggested quoting the SE besides the estimator itself. However, standard statistical practice is different and it is not very intuitive. There are a few conceptual difficulties in being precise when talking about the error. Suppose we estimate \(\mu\) with \(\bar{Y}\) and we find \(\bar{Y}=10\) and \(SE=0.2\). We are quite confident that \(\mu\) is between \(9.6\) and \(10.4\). However, this statement makes no probabilistic sense. The true \(\mu\) is either in the interval \((9.6,10.4)\) or it’s not. The unknown \(\mu\) is not a random variable – it has a single fixed value, whether or not it is known to the statistician.

The “trick” that has been devised is to move from a probabilistic statement about the unknown (but with a fixed value) parameter to a probabilistic statement about the method. We say something like: “The interval \((9.6,10.4)\) for the value \(\mu\) was constructed by a method which is 95% successful.” Since the method is usually successful, we have confidence in its output.

Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), \(\theta \in \Theta\). A \((1-\alpha)100\%\) Confidence Interval for \(\theta\) is the pair of scalar-valued statistics \(L=L(Y_1, \ldots, Y_n)\) and \(U=U(Y_1, \ldots, Y_n)\) such that \(P(L \le \theta \le U) \ge 1-\alpha\) for all \(\theta \in \Theta\) with the inequality as closs to an equality as possible.

What is the right value of \(\alpha\)? Nothing in the statistical theory dictates a particular choice. But the standard values are \(0.10\), \(0.05\), and \(0.01\) with \(0.05\) being most common.

A pivot is a function \(\psi(Y_1, \ldots, Y_n; \theta)\) of the data and the parameters, whose distribution does not depend on unknown parameters. Note, the pivot is not a statistic and cannot be calculated from the data, exactly because it depends on the unknown parameters. On the other hand, since its distribution does not depend on the unknown parameters, we can find an interval \(A_\alpha\) such that \(P(\psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha) = 1 - \alpha\). Then we can invert the inclusion and define the interval (or region in general) \(C_\alpha = \{\theta : \psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha \}\). The set \(C_\alpha\) is then a \((1-\alpha)100\%\) confidence set. Note that \(C_\alpha\) is a random set because it depends on the random sample.

Does a pivot always exist? For a random sample \(Y_1, \ldots, Y_n\) from any continuous distribution with a cdf \(F_\theta(y)\), the value \(-\sum_{i=1}^n \ln F_\theta(Y_i) \sim \frac{1}{2} \chi^2_{2n}\) and, therefore, is a pivot. However, in the general case, it might be difficult (if possible) to invert the corresponding confidence interval for this pivot to a confidence interval for the original parameter of interest \(\theta\). More convenient pivots are easy to find when the distribution belongs to a scale-location family.

Is a confidence interval for \(\theta\) necessarily unique? Definitely not! Different choices of pivots lead to different forms of confidence intervals. Moreover, even for a given pivot, one can typically construct an infinite set of confidence intervals at the same confidence level. What is the “best” choice for a confidence interval? A conventional, somewhat ad hoc approach is based on error symmetry: set \(P(L>\theta) = P(U<\theta) = \alpha/2\). A more appealing approach would be to seek a confidence interval of a minimal expected length; however, in general, this leads to a nonlinear minimization problem that might not have a solution in closed form.

A parameter of interest may be not the original parameter \(\theta\) but its function \(g(\theta)\). Let \((L,U)\) be a \((1-\alpha)100\%\) confidence interval for \(\theta\) and \(g(\cdot)\) a strictly increasing function. Then \((g(L),g(U))\) is a \((1-\alpha)100\%\) confidence interval for \(g(\theta)\).

The normal confidence intervals are undoubtedly the most important. We do not claim that most data sets are sampled from a normal distribution – this is definitely far from being true. However, what really matters is whether the distribution of an estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is close to normal rather than the distribution of \(Y_1, \ldots, Y_n\) themselves. When discussing asymptotics later, we argue that many estimators based on large or even medium sized samples are indeed approximately normal and, therefore, the normal confidence intervals can (at least approximately) be used.

4 Hypothesis Testing

In many situations the research question yields one of two possible answers “yes” or “no” and a statistician is required to choose the “correct” one basd on the data. The two possible answers can be viewed as two hypotheses—the null hypothesis (\(H_0\)) and the alternative (\(H_1\)). The process of chooseing between them based on the data is known as hypothesis testing.

More formally, suppose \(\mathbf{Y} \sim f_\theta(\mathbf{y})\) where \(f_\theta(\mathbf{y})\) belongs to a parametric family of distributions \(\mathcal{F}_\theta\), \(\theta \in \Theta\). Under the null hypothesis \(\theta \in \Theta_0 \subset \Theta\), while under the alternative \(\theta \in \Theta_1 \subset \Theta\), where \(\Theta_0 \cap \Theta_1 = \emptyset\). Given the data \(\mathbf{Y}\) we need to find a rule (test) to decide whether we accept or reject the null hypothese \(H_0\) based on a test statistic \(T(\mathbf{Y})\). Typicaly \(T(\mathbf{Y})\) is chosen in such a way that it tends to be small under \(H_0\): the larger it is, the stronger the evidence against \(H_0\) in favor of \(H_1\). The null hypothesis is then rejected if \(T(\mathbf{Y})\) is “too large” for it, that is, if \(T(\mathbf{Y}) > C\) for some critical value \(C\).

It is essential to understand that since the inference on the hypothesis is based on random data, there is always a possibility for a wrong decision: we might either erroneously reject \(H_0\) (Type I error) or accept it (Type II error). The key question in hypothesis testing is a proper choiceof a test statistic and the corresponding critical value that would minimize the probability of an erroneous decision. Unfortunately, as we shall see, it is generally impossible to minimize the probability of errors of both types: decreasing one of them comes at the expense of increasing the other.

The probability of a Type I error (erroneous rejection of the null hypothesis) also known as the (significance) level of the test is \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right) = P_{\theta_0} \left( \mathbf{Y} \in \Omega_1 \right)\). Similarly, the probability of a Type II error (erroneous acceptance of the null hypothesis) is \(\beta = P_{\theta_1} \left( T(\mathbf{Y}) < C \right) = P_{\theta_1} \left( \mathbf{Y} \in \Omega_0 \right)\). The power of a test is \(\pi = 1 - \beta = P_{\theta_1} \left( \mathbf{Y} \in \Omega_1 \right)\). Note that hypothesis testing induces a partition of the sample space \(\Omega\) into two disjoint regions: the rejection region \(\Omega_1 = \{ \mathbf{y}: T(\mathbf{y}) \ge C \}\) and its complement, the acceptance region \(\Omega_0 = \{ \mathbf{y}: T(\mathbf{y}) < C \}\).

The choice of \(C\) is usually not obvious. \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right)\) and \(\beta = P_{\theta_1} \left( T(\mathbf{Y}) < C \right)\). Increasing \(C\) leads to the reduction of the rejection region \(\Omega_1\) and therefore to a smaller \(\alpha\) but larger \(\beta\) and vice versa. It is impossible to find an optimal \(C\) that would minimize both \(\alpha\) and \(\beta.\) Typically, one controls the probability of the Type I error at some conventional low level (say, 0.05 or 0.01) and finds the corresponding \(C\) from the equation \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right)\). In this case, there is no longer symmetry between the two hypotheses and the choice of \(H_0\) becomes crucial. Two researchers analyzing the same data might come to different conclusions depending on their choices for the null and alternative. Although there are no formal rules, the null hypothesis usually represents the current or conventional state and a “strong evidence” should be provided to reject it in fabor of the alternative corresponding to a change.

Up until now, for a given test statistic \(T(\mathbf{Y})\), we calculate its value \(t_{obs}=T(\mathbf{y})\) for the observed data \(\mathbf{y}\) and compared \(t_{obs}\) with a critical value \(C\) corresponding to a significance level \(\alpha\), and rejected the null hypothesis if \(t_{obs} \ge C\). We would like to standardize \(t_{obs}\) to a scale which will tell us directly its “unlikeness” under the null. Namely, consider the probability under \(H_0\) that the test statistic \(T(\mathbf{Y})\) has a value of at least \(t_{obs}\). Such a probability is called a p-value, that is, \(\textrm{p-value} = P_{\theta_0} \left( T(\mathbf{Y}) \ge t_{obs} \right)\). A p-value can be viewed as the minimal significance level for which one would reject the null hypothesis for a given value of a test statistic.

So far, a test statistic \(T(\mathbf{Y})\) was chosen in advance and we discussed the corresponding probabilities of both error typoes, the power, p-value, critical value, etc. The core question in hypothesis testing, however, is the proper choice of \(T(\mathbf{Y})\). What is the “optimal” test?

Let \(\mathbf{Y} \sim f_\theta(\mathbf{y})\) and test two simple hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta = \theta_1\). Define the likelihood ratio

\[ \lambda(\mathbf{Y}) = \frac{ L(\theta_1,; \mathbf{Y}) }{ L(\theta_0,; \mathbf{Y})} = \frac{ f_{\theta_1}(\mathbf{Y}) }{ f_{\theta_0}(\mathbf{Y}) } \]

The value of \(\lambda(\mathbf{Y})\), in fact, shows how much \(\theta_1\) is more likely for the observed data than \(\theta_0\). Obviously, the larger \(\lambda(\mathbf{Y})\), the stronger the evidence against the null. The corresponding likelihood ratio test (LRT) at a given level \(\alpha\) is of the form: reject \(H_0\) in favor of \(H_1\) if \(\lambda(\mathbf{Y}) \ge C\), where the critical value \(C\) satistifes \(P_{\theta_0} \left( \lambda(\mathbf{Y}) \ge C \right) = \alpha\). As it happens, the LRT is exactly the optimal test we are looking for that is, the one with maximal power (MP) among all tests at a signifiance level \(\alpha\), which is formalized as the Neyman-Pearson Lemma.

A main problem is in its practical implementation, specifically of finding critical value \(C\). For a fixed level \(\alpha\), \(C\) is the solution of \(P_{\theta_0} \left( \lambda(\mathbf{Y}) \ge C \right) = \alpha\), where the distribution of \(\lambda(\mathbf{Y})\) might be quite complicated. For large samples various asymptotic approximations for the distribution of \(\lambda(\mathbf{Y})\) can be used.

The situation in which both the null and the alternative hypotheses are simple is quite rare in practical applications — it is usual that at least the alternative hypothesis is composite. So we now have the power function \(\pi(\theta)\) of a test is the probability of rejecting \(H_0\) as a function of \(\theta\), that is, \(\pi(\theta) = P_\theta \left( \textrm{reject} H_0 \right) = P_\theta \left( \mathbf{Y} \in \Omega_1 \right)\). The significance level is now the maximal probability of a Type I error over all possible \(\theta \in \Theta_0\) (ie, the worst case scenario): \(\alpha = \sup_{\theta \in \Theta_0} \pi(\theta)\). And similarly the p-value is the \(\sup_{\theta \in \Theta_0} P_\theta \left( T(\mathbf{Y}) \ge t_{obs} \right)\).

As we have argued, the main problem in hypothesis testing is the proper choice of a test. We discussed that for simple hypothesis one compares the performance of two different tests with the same significance level by their powers and looks for the maximum power test. What happens with composite hypothesis, where there is no single power value but a power function? Consider two tests with power function \(\pi_1(\theta)\) and \(\pi_2(\theta)\) with the same significance level \(\alpha\). If \(\pi_1(\theta) > \pi_2(\theta)\) for all \(\theta \in \Theta_1\), we can say that the first test uniformly outperforms its counterpart. Unfortunately, this is not common. Rather it is typical for \(\pi_1(\theta) > \pi_2(\theta)\) for only some \(\theta \in \Theta_1\) and so neither test is uniformly better and we cannot prefer either of them. This is reminiscent of the situation of comparing estimators by MSE in Section 2.

The requirement of uniformly most powerfulness for all \(\theta \in \Theta_1\) among all possible tests at givel level \(\alpha\) is very strong. For example, there is a UMP for a one-sided test for the normal mean, but not for a two-sided test for the normal mean. And in multiparameter cases, UMP tests do not exist at all except for some singular cases. Nevertheless, it seems tempting to generalize the likelihood ratio test of Neyman-Pearson to composite hypotheses. The natural generalization of the likelihood ratio for composite hypotheses is

\[ \lambda^*(\mathbf{y}) = \frac{ \sup_{\theta\in\Theta_1} L(\theta_1,; \mathbf{Y}) }{ \sup_{\theta\in\Theta_0} L(\theta_0,; \mathbf{Y}) } = \frac{ \sup_{\theta\in\Theta_1} f_{\theta_1}(\mathbf{Y}) }{ \sup_{\theta\in\Theta_0} f_{\theta_0}(\mathbf{Y}) } \]

Usually it is more convenient to use the equivalent generalized likelihood ratio (GLR) statistic

\[ \lambda^*(\mathbf{y}) = \frac{ \sup_{\theta\in\Theta} L(\theta_1; \mathbf{Y}) }{ \sup_{\theta\in\Theta_0} L(\theta_0; \mathbf{Y}) } \]

(Note the change in subscript on \(\sup\) in the numerator). Also note that there is no proof that the GLR is UMP. But, as it happens, several well-known and widely used statistical tests are GLRs: one and two-sided t-tests for normal means, \(\chi^2\) test for normal variance, F-test for normal variance, F-tests for comparing nested regression models, Pearson’s goodness-of-fit \(\chi^2\) test, etc.

There is a duality between hypothesis testing and confidence intervals. If one has derived a \((1-\alpha)100\%\) confidence interval for a single unknown parameter \(\theta\) and then wants to test \(H_0: \theta = \theta_0\) against a two sided alternative \(H_1: \theta \ne \theta_0\) at level \(\alpha\), one should simply check whether \(\theta_0\) lies within the confidence interval and then accept \(H_0\) or (if outside the interval) reject \(H_0\). The duality extends straightforwardly to the multiparameter case.

For a discussion of sequential testing or multiple testing, see section 4.5 and 4.6.

5 Asymptotic Analysis

In many situations an estimator might be a complicated function of the data, and the exact calculation of its characteristics (e.g., bias, variance, and MSE) might be quite difficult if possible at all. Nevertheless, rather often, good approximate properties can be obtained if the sample size is “sufficiently large.” Intuitively, everyone would agree that a “good” estimator should improve and approach the estimated parameter as the sample size increases: the larger the pole, the more representative it results are. Conclusions based on a large sample have more ground than those based on a small sample. But what is the formal justification for this claim?

Similar issues arise with confidence intervals and hypothesis testing. To construct a confidence interval, we should know the distribution of the underlying statistic, and that may be difficult to calculate. Can we derive at least an approximate confidence interval, without knowing the exact distribution of the statistic for a “sufficiently large” sample? In hypothesis testing we need the distribution of the test statistic \(T(\mathbf{Y})\) under the null hypothesis in order to find the critical value for a given significance level \(\alpha\) by solving \(P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right) = \alpha\) or to calculate the p-value. The statistic \(T(\mathbf{Y})\) might be a complicated function of the data whose distribution is hard to derive, if doable at all. Is it possible then to find its simple asymptotic approximation?

Statisticians like to answer these qustions by considering a sequence of similar problems that differ from one another only in their sample size. Let \(\hat{\theta}_n = \hat{\theta}(Y_1, \ldots, Y_n)\) be an estimator for \(\theta\) based on a sample of size \(n\). We can then study the asymptotic properties fo the sequence \(\{\hat\theta_n\}\) as \(n\) increases. A thorough history (not included here) of prerequisites includes discussion of convergence of a sequence of real numbers, then a sequence of vectors, then a sequence of functions, and finally the convergence of a sequence of estimators \(\{\hat\theta_n\}\) to a parameter \(\theta\). We consider three types of convergence of estimators: in mean-squared error, in probability, and in distribution.

Recall that the most common measure of goodness-of-estimation for \(\hat\theta_n\) is its mean squared error \(MSE(\hat\theta_n, \theta) = E(\hat\theta_n-\theta)^2\). We say that \(\hat\theta_n\) converges in MSE to \(\theta\) if \(MSE(\hat\theta_n, \theta) \rightarrow 0\) as the sample size \(n\) increases. We denote this as \(\hat\theta_n \stackrel{\textrm{MSE}}{\rightarrow} \theta\). Consistency in MSE, however, might be hard to prove and sometimes even too strong to ask for.

A more realistic approach is then to relax it by the weaker convergence in probability. The sequence of random variables \(\{Y_n\}\) is said to converge to \(c\) if for any \(\varepsilon > 0\) the probability \(P(|Y_n-c|>\varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\). We write this as \(Y_n \stackrel{p}{\rightarrow} c\). With a proof of a continuous mapping theorem — i.e., that \(g(Y_n) \stackrel{p}{\rightarrow} g(c)\) — we an define consistency in probability of an estimator (usually just called consistency for brevity) as \(\hat\theta_n \stackrel{p}{\rightarrow} \theta\) as \(n \rightarrow \infty\).

The weak law of large numbers (WLLN) implies that a sample mean is always a consistent estimator of the expectation. Let \(Y_1, Y_2, \ldots\) be i.i.d. random variables with finite second moment. Then \(\bar{Y}_n = \frac{1}{n}\sum_{i=1}^n Y_i \stackrel{p}{\rightarrow} EY\) as \(n \rightarrow \infty\). The proof follows immediately from Chebyshev’s inequality: \(P(|\bar{Y}_n-\mu| \ge \varepsilon) \le \textrm{Var}(\bar{Y}_n) / \varepsilon^2 = \textrm{Var}(Y)/(n\varepsilon^2) \rightarrow 0\). Although consistency iteself is important, the rate of convergence also matters. One would obviously prefer a consistent estimator with a faster convergence rate. We will not consider these issues in detail but just menttion that in many situations, \(n^{-1/2}\) is the best available convergence rate for estimating a parameter; such estimator are called \(\sqrt{n}\)-consistent.

Consistency (in MSE or probability) considered in the previous sections still does not say anything about the asymptotic distribution of an estimator needed, for example, to construct an approximate confidence interval for an unknown parameter or to calculate the critical value in hypothesis testing. What we need is convergence in distribution. A sequence of random variables \(Y_1, Y_2, \ldots\) with cdf’s \(F_1(\cdot), F_2(\cdot),\ldots\) is said to converge in distribution to a random variable \(Y\) with cdf \(F(\cdot)\) if \(\lim_{n\rightarrow\infty}F_n(x) = F(x)\) for every continuity point \(x\) of \(F(\cdot)\). We denote this by \(Y_n \stackrel{\mathscr{D}}{\rightarrow} Y\). Convergence in distribution is important because it enables us to approximate probabilities. Quite often \(F_n\) is either complex or not explicitly known, while \(F\) is simple and known. If \(Y_n \stackrel{\mathscr{D}}{\rightarrow} F\), we can use \(F\) to calculate approximate probabilities related to \(Y_n\).

Primte examples of limiting distributions include Poisson (as an approximation for binomial random variables with large \(n\) and small \(p\)), the normal distribution (for properly saled means), the exponential distribution (for the minimum of many random variables), and the extreme value distribution (for the distribution of extremely large values). The fact tha tthese distributions are obtained as a limit is the reason why they are so useful.

Suppose \(Y_1, Y_2, \ldots\) are i.i.d. random variables with the common (finite) expectation \(\mu\) and variance \(\sigma^2\). Let \(\bar{Y}_n = \frac{1}{n}\sum_{i=1}^nY_i\). Then as \(n \rightarrow \infty\), \(\sqrt{n}(\bar{Y}_n - \mu) \stackrel{\mathscr{D}}{\rightarrow} \mathscr{N}(0,\sigma^2)\). This is a formal way to say that for sufficiently large \(n\), the distribution of a sample mean \(\bar{Y}_n\) is approximately normal with mean \(\mu\) and variance \(\sigma^2/n\). Note that it is always true that \(E\bar{Y}_n=\mu\) and \(\textrm{Var}(\bar{Y}_n)=\sigma^2/n\). The CLT adds that the distribution of \(\bar{Y}_n\) is approximately normal.

The CLT is a real refinement of the WLLN. Both consider the asymptotic behavior of \(\bar{Y}_n\). The WLLN only says that \(\bar{Y}_N \stackrel{p}{\rightarrow} \mu\), or that \(\bar{Y}_n\) is very close to \(\mu\) when \(n\) is large, but does not tell us how close. The CLT gives a precise answer. It is as if we look at the cdf of \(\bar{Y}_n\) through a magnifying glass: we center it around \(\mu\) and magnify by \(\sqrt{n}\). As \(n\) increases, the cdf looks more and more smooth, and becomes closer and closer to a normal cdf.

An estimator \(\hat{\theta}_n\) is alled a consistent asymptotically normal (CAN) estimator of \(\theta\) if \(\sqrt{n}(\hat{\theta}_n - \theta) \stackrel{\mathscr{D}}{\rightarrow} \mathscr{N}\left(0,\sigma^2(\theta)\right)\) for some asymptotic variance \(\sigma^2(\theta)\) that may generally depend on \(\theta\). If \(\hat{\theta}_n\) is a CAN estimator of \(\theta\), then it is also \(\sqrt{n}\)-consistent. A theorem heavily based on the Delta Method explains why so many estimators are CAN: if \(\hat{\theta}_n\) is a CAN estimator of \(\theta\) and \(g(\cdot)\) is any differentiable function with \(g'(\theta)\ne0\), then \(g(\hat{\theta}_n)\) is a CAN estimator of \(g(\theta)\).

The notations asymptotic mean and asymptotic variance for \(\theta\) and \(\sigma^2(\theta)\) might be somewhat misleading: although it is tempting to think that for a CAN estimator \(\hat{theta}_n\) of \(\theta\), \(E\hat{\theta}_n \rightarrow \theta\) and \(\textrm{Var}(\hat{\theta}_n) \rightarrow \sigma^2(\theta)\) as \(n\) increases, generally it is not true. The convergence in distribution does not necessarily yield convergence of moments.

Asymptotical normality of an estimator can be used to construct an approximate normal confidence interval when its exact distribution is complicated or even not completely known.

\[ P \left( \hat{\theta}_n - z_{\alpha/2} \frac{\sigma(\theta)}{\sqrt{n}} \le \theta \le \hat{\theta}_n - z_{\alpha/2} \frac{\sigma(\theta)}{\sqrt{n}} \right) = P\left( - z_{\alpha/2} \le \sqrt{n} \frac{\hat{\theta}_n - \theta}{\sigma(\theta)} \le z_{\alpha/2} \right) \rightarrow 1 - \alpha \]

This equation is not much use on its own since the asymptotic variance \(\sigma^2(\theta)\) generally also depends on \(\theta\) or is an additional unknown parameter. Intuitively, a possible approach would be to replace \(\sigma(\theta)\) by some estimator and how that it will not strongly affect the resulting asymptotic converage probability. As it happens, the idea indeed works in \(\hat{\sigma}_n\) is any consistent estimator of \(\sigma(\theta)\).

An example is given whereby \(\hat{\lambda}\) estimates \(\lambda\) but also \(\hat{\theta}\) estimates \(\theta = 1/\lambda\) and \(\hat{p}\) estimate \(p=e^{-\lambda}\). Asymptotically normal confidence intervals are constructed for \(\lambda\), \(\theta\), and \(p\). These approaches, unsurprisingly, lead to different results — there is no unique way to construct a confidence interval (as already mentioned in Chaper 3). Which one is better? One likely prefers a confidence interval with shorter expected length. For asymptotic confidence intervals, another (probably even more important) issue is the goodness of normal approximation for distributions of the underlying statistics. Another possible approach to deal with an unknown asymptotic variance \(\sigma^2(\theta)\) of a CAN estimator \(\hat{\theta}_n\) when constructing an asymptotic confidence interval for \(\theta\) is by the use of the variance stabilizing transformation \(g(\theta)\) whose asymptotic variance \(g'(\theta)\sigma^2(\theta)\) does not depend on \(\theta\). Evidently, this transformation \(g(\theta)\) should satisfy \(g'(\theta) = c/\sigma(\theta)\) for a chosen constant \(c\). A \((1-\alpha)100\%\) asymptotic confidence interval for the resulting \(g(\theta)\) is then \(g(\hat{\theta}_n) \pm z_{\alpha/2}\frac{c}{\sqrt{n}}\), which can be used to construct an asymptotic confidence interval for \(\theta\) itself.

((chapter unfinished))

6 Bayesian Inference

To be added

7 Elements of Statistical Decision Theory

To be added

8 Linear Models

To be added

9 Nonparametric Estimation

To be added