Get the book from Routledge (here) or Amazon (here)

This is a truly excellent book explaining the underlying ideas, mathematics, and principles of major statistical concepts. Its organization is suburb, and the authors’ commentary on why a theorem is so useful and how the presented ideas fit together and/or contrast is invaluable (and, quite frankly, better than I have seen anywhere else).

1 Introduction

Suppose the observed data is the sample \(\mathbf{y} = \{y_1, \ldots, y_n\}\) of size \(n\). We will model \(\mathbf{y}\) as a realization of an \(n\)-dimensional random vector \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) with a joint distribution \(f_\mathbf{Y}(\mathbf{y})\). The true distribution of the data is rarely completely known; nevertheless, it can often be reasonable to assume that it belongs to some family of distributions \(\mathcal{F}\). We will assume that \(\mathcal{F}\) is a parametric family, that is, that we know the type of distribution \(f_\mathbf{Y}(\mathbf{y})\) up to some unknown parameter(s) \(\theta \in \Theta\), where \(\Theta\) is a parameter space. Typically, we will consider the case where \(y_1, \ldots, y_n\) are the results of independent identical experiments. In this case, \(Y_1, \ldots, Y_n\) can be treated as independent, identically distributed random variables with the common distribution \(f_\theta(y)\) from a parametric family of distributions \(\mathcal{F}_\theta\), \(\theta \in \Theta\).

Define the likelihood function \(L(\theta; \mathbf{y}) = P_\theta(\mathbf{y})\) — the probability to observe the given data \(\mathbf{y}\) for any possible value of \(\theta \in \Theta\). First assume that \(\mathbf{Y}\) is discrete. The value \(L(\theta; \mathbf{y})\) can be viewed as a measure of likeliness of \(\theta\) to the observed data \(\mathbf{y}\). If \(L(\theta_1; \mathbf{y}) > L(\theta_2; \mathbf{y})\) for a given \(\mathbf{y}\), we can say that the value \(\theta_1\) for \(\theta\) is more suited to the data than \(\theta_2\). For a continuous random variable \(\mathbf{y}\), the likelihood ratio \(L(\theta_1; \mathbf{y})/L(\theta_2; \mathbf{y})\) shows the strength of the evidence in favor of \(\theta = \theta_1\) vs \(\theta = \theta_2\).

A statistic \(T(\mathbf{Y})\) is any real or vector-valued function that can be computed using the data alone. A statistic \(T(\mathbf{Y})\) is sufficient for an unknown parameter \(\theta\) if the conditional distribution of all the data \(\mathbf{Y}\) given \(T(\mathbf{Y})\) does not depend on \(\theta\). In other words, given \(T(\mathbf{Y})\) no other information on \(\theta\) can be extracted from \(\mathbf{y}\). This definition allows one to check whether a given statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\), but it does not provide one with a constructive way to find it.

However, the Fisher-Neyman Factorization Theorem says that a statistic \(T(\mathbf{Y})\) is sufficient for \(\theta\) iff for all \(\theta \in \Theta\) that \(L(\theta, \mathbf{y}) = g(T(\mathbf{y}), \theta) \cdot h(\mathbf{y})\), where the function \(g(\cdot)\) depends on \(\theta\) and the statistic \(T(\mathbf{Y})\), while \(h(\mathbf{y})\) does not depend on \(\theta\). In particular, if the likelihood \(L(\theta; \mathbf{y})\) depends on data only through \(T(\mathbf{Y})\), then \(T(\mathbf{Y})\) is a sufficient statistic for \(\theta\) and \(h(\mathbf{y}) = 1\).

A sufficient statistic is not unique. For example, the entire sample \(\mathbf{Y} = \{Y_1, \ldots, Y_n\}\) is always a (trivial) sufficient statistic. We may seek a minimal sufficient statistic implying the maximal reduction of the data. A statistic \(T(\mathbf{Y})\) is called a minimal sufficient statistic if it is a function of any other sufficient statistic.

Another important property of a statistic is completeness. Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), where \(\theta \in \Theta\). A statistic \(T(\mathbf{Y})\) is complete if no statistic \(g(\mathbf{T})\) exists (except \(g(\mathbf{T})=0\)) such that \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\). In other words, if \(E_\theta g(\mathbf{T}) = 0\) for all \(\theta \in \Theta\), then necessarily \(g(\mathbf{T})=0\). To verify completeness for a general distribution can be a nontrivial mathematical problem, but thankfully it is much simpler for the exponential family of distributions that includes many of the “common” distributions. Completeness is a useful in determining minimal sufficiency because if a sufficient statistic \(T(\mathbf{Y})\) is complete, then it is also minimal sufficient. (Note, however, that a minimal sufficient statistic may not necessarily be complete.)

A (generally multivariate) family of distributions \({f_\theta(\mathbf{y}): \theta \in \Theta}\) is said to be an (one parameter) exponential family if: (1) \(\Theta\) is an open interval, (2) the support of the distribution \(f_\theta\) does not depend on \(\theta\), and (3) \(f_\theta(\mathbf{y}) = exp\{c(\theta)T(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\) where \(c(\cdot)\), \(T(\cdot)\), \(d(\cdot)\), and \(S(\cdot)\) are known functions; \(c(\theta)\) is usually called the natural parameter of the distribution. We say that \(f_\theta\) where \(\theta = (\theta_1, \ldots \theta_p)\) belongs to a \(k\)-parameter exponential family by changing (3) such that \(f_\theta(\mathbf{y}) = exp\{ \sum_{j=1}^k c_j(\theta)T_j(\mathbf{y}) + d(\theta) + S(\mathbf{y})\}\). The function \(c(\theta) = \{c_1(\theta), \ldots, c_k(\theta)\}\) are the natural parameters of the distribution. (Note that the dimensionality \(p\) of the original parameter \(\theta\) is not necessarily the same as the dimensionality \(k\) of the natural parameter \(c(\theta)\).)

Consider a random sample \(Y_1, \ldots, Y_n\), where \(Y_i \sim f_\theta(y)\) and \(f_\theta\) belongs to a \(k\)-parameter exponential family of distributions, then (1) the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\) also belongs to the \(k\)-parameter exponential family, (2) \(T_\mathbf{Y} = (\sum_{i=1}^n T_1(Y_i), \ldots, \sum_{i=1}^n T_k(Y_i))\) is the sufficient statistic for \(c(\theta) = (c_1(\theta), \ldots, c_k(\theta))\) (and, therefore, for \(\theta\)), and (3) if some regularity conditions hold, then \(T_\mathbf{Y}\) is complete and therefore minimal sufficient (if the latter exists).

2 Point Estimation

Estimation of the unknown parameters of distributions from the data is one of the key issues in statistics. A (point) estimator \(\hat{\theta} = \hat{\theta}(\mathbf{Y})\) of an unknown parameter \(\theta\) is any statistic used for estimating \(\theta\). The value of \(\hat{\theta}(\mathbf{y})\) evaluated for a given sample is called an estimate. This is a general, somewhat trivial definition that does not say anything about the goodness of estimation; one would evidently be interested in “good” estimators.

Maximum Likelihood Estimation is the most used method of estimation of parameters in parametric models. As we’ve discussed, \(L(\theta; \mathbf{y})\) is the measure of likeliness of a parameter’s values \(\theta\) for the observed data \(\mathbf{y}\). It is only natural then to seek the “most likely” value of \(\theta\). The MLE \(\hat{\theta}\) of \(\theta\) is \(\hat{\theta} = \arg\max_{\theta\in\Theta} L(\theta; \mathbf{y})\) — the value of \(\theta\) that maximizes the likelihood. Often we are interested in a function of a parameter \(\xi = g(\theta)\). When \(g(\cdot)\) is 1:1, the MLE of \(\xi\) is the function \(g(\cdot)\) applied to the MLE of \(\theta\): \(\hat{\xi} = g(\hat{\theta})\), the proof of which is just a reparameterization of the likelihood in terms of \(\xi\) instead of \(\theta\). Although the conception of the MLE was motivated by an intuitively clear underlying idea, the justification for its use is much deeper. It is a really “good” method of estimation (to be discussed later on the topic of asymptotics).

Another popular method of estimation is the Method of Moments. Its main idea is based on expressing the population moments of the distribution of data in terms of its unknown parameter(s) and equating them to their corresponding sample moments. MMEs have some known problems: consider a sample of size 4 from a uniform distribution \(U(0, \theta)\) with the observed sample 0.2, 0.6, 2, and 0.4. The MME is 1.6, which does not make much sense given the observed value of 2 in the sample. For these and related reasons, the MMEs are less used than the MLE counterparts. However, MMEs are usually simpler to compute and can be used, for example, as reasonable initial values in numerical iterative procedures for MLEs. On the other hand, the Method of Moments does not require knowledge of the entire distribution of the data (up to the unknown parameters) but only its moments and thus may be less sensitive to possible misspecification of a model.

The Method of Least Squares play a key role in regression and analysis of variance. In a typical regression setup, we are given \(n\) observations \((\mathbf{x}_i, y_i)\), \(i=1, \ldots, n\) over \(m\) explanatory variables \(\mathbf{x} = (x_1, \ldots, x_m)\) and the response variable \(Y\). We assume that \(y_i = g_\theta(\mathbf{x}_i) + \varepsilon_i\), \(i = 1, \ldots, n\) where the response function \(g_\theta(\cdot): \mathbb{R}^m \rightarrow \mathbb{R}\) has a known parametric form and depends on \(p \le n\) unknown parameters \(\theta = (\theta_1, \ldots, \theta_p)\). The LSE looks for a \(\hat{\theta}\) that yields the best fit \(g_\hat{\theta}(\mathbf{x})\) to the observed \(\mathbf{y}\) w.r.t. the Euclidean distance: \(\hat{\theta} = \arg\min_\theta \sum_{i=1}^n (y_i - g_\theta(\mathbf{x}_i))^2\). For linear regression, the solution is available in closed form. For non-linear regression, however, it can generally be only found numerically.

More generally than the three above procedures, one can consider any function \(\rho(\theta, y)\) as a measure of the goodness-of-fit and look for an estimator that maximizes or minimizes \(\sum_{i=1}^n \rho(\theta, y_i)\) w.r.t. \(\theta\). Such estimators are called M-estimators. It turns out that various well-known estimators can be viewed as M-estimators for a particular \(\rho(\theta, y)\) including \(\bar{Y}\) for the sample mean as well as MLE, LSE, and a generalized version of MME not yet discussed. As we’ll see when discussing asymptotics, M-estimators share many important asymptotic properties.

A natural question is how to compare between various estimators. First, we should define a measure of goodness-of-estimation. Recall that any estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is a function of a random sample and therefore is a random variable itself with a certain distribution, expectation, variance, etc. A somewhat naive attempt to measure the goodness-of-estimation of \(\hat{\theta}\) would be to consider the error \(|\hat{\theta}-\theta|\). However, \(\theta\) is unknown and, as we have mentioned, an estimator \(\hat{\theta}\) is a random variable and hence the value \(|\hat{\theta}-\theta|\) will vary from sample to sample. It may be “small” for some of the samples, while “large” for others and therefore cannot be used as a proper criterion for goodness-of-estimation of an estimator \(\hat{\theta}\). A more reasonable measure would then be an average distance over all possible samples, that is, the mean absolute error \(E|\hat{\theta}-\theta|\), where the expectation is taken w.r.t. the joint distribution of \(\mathbf{Y} = (Y_1, \ldots, Y_n)\). It indeed can be used as a measure of goodness-of-estimation but usually, mostly due to convenience of differentiation, the conventional measure is the mean squared error (MSE) given by \(MSE(\hat{\theta}, \theta) = E(\hat{\theta}-\theta)^2\).

The MSE can be decomposed into two components: \(MSE(\hat{\theta}, \theta) = Var(\hat{\theta}) + b^2(\hat{\theta},\theta)\). The first is the stochastic error (variance) and the second is a systematic or deterministic error (bias). Having defined the goodness-of-estimation measure by MSE, one can compare different estimators and choose the one with the smallest MSE. However, since the \(MSE(\hat{\theta}, \theta)\) typically depends on the unknown \(\theta\), it is a common situation where no estimator is uniformly superior for all \(\theta \in \Theta\).

Ideally, a good estimator with a small MSE should have both low variance and low bias. However, it might be hard to have both. One of the common approaches is to first control the bias component of the overall MSE and to consider unbiased estimators. There is no general rule or algorithm for deriving an unbiased estimator. In fact, unbiasedness is a property of an estimator rather than a method of estimation. One usually checks an MLE or any other estimator for bias. Sometimes one can then modify the original estimator to “correct” its bias. Note that unlike ML estimation, unbiasedness is not invariant under nonlinear transformation of the original parameter: if \(\hat{\theta}\) is an unbiased estimator of \(\theta\), \(g(\hat{\theta})\) is generally a biased estimator for \(g(\theta)\).

What does unbiasedness of an estimator \(\hat{\theta}\) actually mean? Suppose we were observing not a single sample but all possible samples of size \(n\) from a sample space and calculating the estimates \(\hat{\theta}_j\) for each one of them. The unbiasedness means that the average value of \(\hat{\theta}\) over the entire sample space is \(\theta\), but it does not guarantee yet that \(\hat{\theta}_j \approx \theta\) for each particular sample. The dispersion of \(\hat{\theta}_j\)’s around their average value \(\theta\) might be large and, since in reality we have only a single sample, its particular value of \(\hat{\theta}\) might be quite away from \(\theta\). To ensure with high confidence that \(\hat{\theta}_j \approx \theta\) for any sample we need, in addition to unbiasedness, for the variance \(Var(\hat{\theta})\) to be small.

An estimator \(\hat{\theta}\) is called a uniformly minimum variance unbiased estimator (UMVUE) of \(\theta\) if \(\hat{\theta}\) is unbiased and for any other unbiased estimator \(\tilde{\theta}\) of \(\theta\), \(Var(\hat{\theta}) \le Var(\tilde{\theta})\). If the UMVUE exists, it is necessarily unique. Recall that there is no general algorithm to obtain unbiased estimators in general and a UMVUE in particular. However, there exists a lower bound for a variance of an unbiased estimator, which can be used as a benchmark for evaluating its goodness.

Define the Fisher Information Number \(I(\theta) = E((\ln f_\theta(\mathbf{y}))'_\theta)^2\). The derivative of the log density is sometimes called the Score Function. Thus, the Fisher Information Number is the expected square of the Score. The Cramer-Rao Lower Bound Theorem states that if \(T\) is an unbiased estimator for \(g(\theta)\), where \(g(\cdot)\) is differentiable, then \(Var(T) \ge (g'(\theta))^2 / I(\theta)\) or more simply, when \(T\) is an unbiased estimator for \(\theta\), \(Var(T) \ge 1 / I(\theta)\). We are especially interested in the case where \(Y_1, \ldots, Y_n\) is a random sample from a distribution \(f_\theta(y)\). In that case, \(I(\theta) = nI^*(\theta)\) where \(I^*(\theta) = E((\ln f_\theta(y))'_\theta)^2\) is the Fisher Information Number of \(f_\theta(y)\), and, therefore, for any unbiased estimator \(T\) of \(g(\theta)\), we have that \(Var(T) \ge (g'_\theta(\theta))^2 / nI^*(\theta)\). There is another, usually more convenient formula for calculating the Fisher Information Number \(I(\theta)\) other than its direct definition: \(I(\theta) = -E(\ln f_\theta(\mathbf{Y}))''_\theta\).

It is important to emphasize that the CRLB theorem is one direction only: if the variance of an unbiased estimator does not achieve the Cramer-Rao lower bound, one still cannot claim that it is not an UMVUE (it might just be the UMVUE). Nevertheless, it can be used as a benchmark for measuring the goodness of an unbiased estimator. One special result related to the exponential family distribution: the Cramer-Rao lower bound is achieved only for distributions from the exponential family.

The Cramer-Rao lower bound allows one only to evaluate the goodness of a proposed unbiased estimator but does not provide any constructive way to derive it. In fact, as we have argued, there is no such general rule at all. However, if one manages to obtain any initial (even crude) unbiased estimator, it may be possible to improve it. The Rao-Blackwell Theorem shows that if there is an unbiased estimator that is not a function of a sufficient statistic \(W\), one can construct another unbiased estimator based on \(W\) with an MSE not larger than the original one: let \(T\) be an unbiased estimator of \(\theta\) and \(W\) be a sufficient statistic for \(\theta\), and define \(T_1 = E(T|W)\), then \(T_1\) is an unbiased estimator of \(\theta\) and \(Var(T_1) \le Var(T)\). Thus, in terms of MSE, only unbiased estimators based on a sufficient statistic are of interest. This demonstrates again a strong sense of the notion of sufficiency.

Does Rao-Blackwellization necessarily yield an UMVUE? Generally not. To guarantee UMVUE an additional requirement of completeness on a sufficient statistic \(W\) is needed. The Lehmann-Scheffe Theorem formalizes this: if \(T\) is an unbiased estimator of \(\theta\) and \(W\) is a complete sufficient statistic for \(\theta\), then \(T_1 = E(T|W)\) is the unique UMVUE of \(\theta\). Even without the Lehmann-Scheffe theorem, it can be shown under mild conditions that if the distribution of the data belongs to the exponential family and an unbiased estimator is a function of the corresponding sufficient statistic, it is an UMVUE. Note that despite its elegance, the application of the Rao-Blackwell Theorem in more complicated cases is quite limited. The two main obstacles are in finding an initial unbiased estimator \(T\) and calculating the conditional expectation \(E(T|W)\)

3 Confidence Intervals, Bounds, and Regions

When we estimate a parameter we essentially “guess” its value. This should be a “well educated guess,” hopefully the best of its kind (in whatever sense). However, statistical estimation is made with error. Presenting just the estimator, no matter how good it is, is usually not enough – one should give an idea about the estimation error. The words “estimation” and “estimate” (in contrast to “measure” and “value”) point to the fact that an estimate is an inexact appraisal of a value. Great efforts of statistics are invested in trying to quantify the estimation error and find ways to express it in a well-defined way.

Any estimator has error, that is, if \(\theta\) is estimated by \(\hat{\theta}\), then \(\hat{\theta} - \theta\) is usually different from \(0\). The standard measure of error is the mean squared error \(MSE = E(\hat{\theta} - \theta)^2\). When together with the value of the estimator we are given its MSE, we get a feeling of how precise the estimate is. It is more common to quote the standard error defined by \(SE(\hat{\theta}) = \sqrt{MSE(\hat{\theta}, \theta)} = \sqrt{E(\hat{\theta}-\theta)^2}\). Unlike the MSE, the SE is measured in the same units as the estimator.

However, the MSE expresses an average squared estimation error but tells nothing about its distribution, which generally might be complicated. One would be interested, in particular, in the probability \(P(|\hat{\theta}-\theta| > c)\) that the estimation error exceeds a certain accuracy level \(c > 0\). Markov’s Inequality enables us to translate the MSE into an upper bound on this probability: \(P(|\hat{\theta}-\theta| \ge c) \le MSE/c^2\). However, this bound is usually very conservative and the actual probability may be much smaller. Typically, a quite close approximation of the error distribution is obtained by the Central Limit Theorem (discussed later).

Above, we suggested quoting the SE besides the estimator itself. However, standard statistical practice is different and it is not very intuitive. There are a few conceptual difficulties in being precise when talking about the error. Suppose we estimate \(\mu\) with \(\bar{Y}\) and we find \(\bar{Y}=10\) and \(SE=0.2\). We are quite confident that \(\mu\) is between \(9.6\) and \(10.4\). However, this statement makes no probabilistic sense. The true \(\mu\) is either in the interval \((9.6,10.4)\) or it’s not. The unknown \(\mu\) is not a random variable – it has a single fixed value, whether or not it is known to the statistician.

The “trick” that has been devised is to move from a probabilistic statement about the unknown (but with a fixed value) parameter to a probabilistic statement about the method. We say something like: “The interval \((9.6,10.4)\) for the value \(\mu\) was constructed by a method which is 95% successful.” Since the method is usually successful, we have confidence in its output.

Let \(Y_1, \ldots, Y_n \sim f_\theta(y)\), \(\theta \in \Theta\). A \((1-\alpha)100\%\) Confidence Interval for \(\theta\) is the pair of scalar-valued statistics \(L=L(Y_1, \ldots, Y_n)\) and \(U=U(Y_1, \ldots, Y_n)\) such that \(P(L \le \theta \le U) \ge 1-\alpha\) for all \(\theta \in \Theta\) with the inequality as close to an equality as possible.

What is the right value of \(\alpha\)? Nothing in the statistical theory dictates a particular choice. But the standard values are \(0.10\), \(0.05\), and \(0.01\) with \(0.05\) being most common.

A pivot is a function \(\psi(Y_1, \ldots, Y_n; \theta)\) of the data and the parameters, whose distribution does not depend on unknown parameters. Note, the pivot is not a statistic and cannot be calculated from the data, exactly because it depends on the unknown parameters. On the other hand, since its distribution does not depend on the unknown parameters, we can find an interval \(A_\alpha\) such that \(P(\psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha) = 1 - \alpha\). Then we can invert the inclusion and define the interval (or region in general) \(C_\alpha = \{\theta : \psi(Y_1, \ldots, Y_n; \theta) \in A_\alpha \}\). The set \(C_\alpha\) is then a \((1-\alpha)100\%\) confidence set. Note that \(C_\alpha\) is a random set because it depends on the random sample.

Does a pivot always exist? For a random sample \(Y_1, \ldots, Y_n\) from any continuous distribution with a cdf \(F_\theta(y)\), the value \(-\sum_{i=1}^n \ln F_\theta(Y_i) \sim \frac{1}{2} \chi^2_{2n}\) and, therefore, is a pivot. However, in the general case, it might be difficult (if possible) to invert the corresponding confidence interval for this pivot to a confidence interval for the original parameter of interest \(\theta\). More convenient pivots are easy to find when the distribution belongs to a scale-location family.

Is a confidence interval for \(\theta\) necessarily unique? Definitely not! Different choices of pivots lead to different forms of confidence intervals. Moreover, even for a given pivot, one can typically construct an infinite set of confidence intervals at the same confidence level. What is the “best” choice for a confidence interval? A conventional, somewhat ad hoc approach is based on error symmetry: set \(P(L>\theta) = P(U<\theta) = \alpha/2\). A more appealing approach would be to seek a confidence interval of a minimal expected length; however, in general, this leads to a nonlinear minimization problem that might not have a solution in closed form.

A parameter of interest may be not the original parameter \(\theta\) but its function \(g(\theta)\). Let \((L,U)\) be a \((1-\alpha)100\%\) confidence interval for \(\theta\) and \(g(\cdot)\) a strictly increasing function. Then \((g(L),g(U))\) is a \((1-\alpha)100\%\) confidence interval for \(g(\theta)\).

The normal confidence intervals are undoubtedly the most important. We do not claim that most data sets are sampled from a normal distribution – this is definitely far from being true. However, what really matters is whether the distribution of an estimator \(\hat{\theta} = \hat{\theta}(Y_1, \ldots, Y_n)\) is close to normal rather than the distribution of \(Y_1, \ldots, Y_n\) themselves. When discussing asymptotics later, we argue that many estimators based on large or even medium sized samples are indeed approximately normal and, therefore, the normal confidence intervals can (at least approximately) be used.

4 Hypothesis Testing

In many situations the research question yields one of two possible answers “yes” or “no” and a statistician is required to choose the “correct” one based on the data. The two possible answers can be viewed as two hypotheses—the null hypothesis (\(H_0\)) and the alternative (\(H_1\)). The process of choosing between them based on the data is known as hypothesis testing.

More formally, suppose \(\mathbf{Y} \sim f_\theta(\mathbf{y})\) where \(f_\theta(\mathbf{y})\) belongs to a parametric family of distributions \(\mathcal{F}_\theta\), \(\theta \in \Theta\). Under the null hypothesis \(\theta \in \Theta_0 \subset \Theta\), while under the alternative \(\theta \in \Theta_1 \subset \Theta\), where \(\Theta_0 \cap \Theta_1 = \emptyset\). Given the data \(\mathbf{Y}\) we need to find a rule (test) to decide whether we accept or reject the null hypothese \(H_0\) based on a test statistic \(T(\mathbf{Y})\). Typically \(T(\mathbf{Y})\) is chosen in such a way that it tends to be small under \(H_0\): the larger it is, the stronger the evidence against \(H_0\) in favor of \(H_1\). The null hypothesis is then rejected if \(T(\mathbf{Y})\) is “too large” for it, that is, if \(T(\mathbf{Y}) > C\) for some critical value \(C\).

It is essential to understand that since the inference on the hypothesis is based on random data, there is always a possibility for a wrong decision: we might either erroneously reject \(H_0\) (Type I error) or accept it (Type II error). The key question in hypothesis testing is a proper choice of a test statistic and the corresponding critical value that would minimize the probability of an erroneous decision. Unfortunately, as we shall see, it is generally impossible to minimize the probability of errors of both types: decreasing one of them comes at the expense of increasing the other.

The probability of a Type I error (erroneous rejection of the null hypothesis) also known as the (significance) level of the test is \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right) = P_{\theta_0} \left( \mathbf{Y} \in \Omega_1 \right)\). Similarly, the probability of a Type II error (erroneous acceptance of the null hypothesis) is \(\beta = P_{\theta_1} \left( T(\mathbf{Y}) < C \right) = P_{\theta_1} \left( \mathbf{Y} \in \Omega_0 \right)\). The power of a test is \(\pi = 1 - \beta = P_{\theta_1} \left( \mathbf{Y} \in \Omega_1 \right)\). Note that hypothesis testing induces a partition of the sample space \(\Omega\) into two disjoint regions: the rejection region \(\Omega_1 = \{ \mathbf{y}: T(\mathbf{y}) \ge C \}\) and its complement, the acceptance region \(\Omega_0 = \{ \mathbf{y}: T(\mathbf{y}) < C \}\).

The choice of \(C\) is usually not obvious. \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right)\) and \(\beta = P_{\theta_1} \left( T(\mathbf{Y}) < C \right)\). Increasing \(C\) leads to the reduction of the rejection region \(\Omega_1\) and therefore to a smaller \(\alpha\) but larger \(\beta\) and vice versa. It is impossible to find an optimal \(C\) that would minimize both \(\alpha\) and \(\beta.\) Typically, one controls the probability of the Type I error at some conventional low level (say, 0.05 or 0.01) and finds the corresponding \(C\) from the equation \(\alpha = P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right)\). In this case, there is no longer symmetry between the two hypotheses and the choice of \(H_0\) becomes crucial. Two researchers analyzing the same data might come to different conclusions depending on their choices for the null and alternative. Although there are no formal rules, the null hypothesis usually represents the current or conventional state and a “strong evidence” should be provided to reject it in favor of the alternative corresponding to a change.

Up until now, for a given test statistic \(T(\mathbf{Y})\), we calculate its value \(t_{obs}=T(\mathbf{y})\) for the observed data \(\mathbf{y}\), compared \(t_{obs}\) with a critical value \(C\) corresponding to a significance level \(\alpha\), and rejected the null hypothesis if \(t_{obs} \ge C\). We would like to standardize \(t_{obs}\) to a scale which will tell us directly its “unlikeness” under the null. Namely, consider the probability under \(H_0\) that the test statistic \(T(\mathbf{Y})\) has a value of at least \(t_{obs}\). Such a probability is called a p-value, that is, \(\textrm{p-value} = P_{\theta_0} \left( T(\mathbf{Y}) \ge t_{obs} \right)\). A p-value can be viewed as the minimal significance level for which one would reject the null hypothesis for a given value of a test statistic.

So far, a test statistic \(T(\mathbf{Y})\) was chosen in advance and we discussed the corresponding probabilities of both error types, the power, p-value, critical value, etc. The core question in hypothesis testing, however, is the proper choice of \(T(\mathbf{Y})\). What is the “optimal” test?

Let \(\mathbf{Y} \sim f_\theta(\mathbf{y})\) and test two simple hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta = \theta_1\). Define the likelihood ratio

\[ \lambda(\mathbf{Y}) = \frac{ L(\theta_1; \mathbf{Y}) }{ L(\theta_0; \mathbf{Y})} = \frac{ f_{\theta_1}(\mathbf{Y}) }{ f_{\theta_0}(\mathbf{Y}) } \]

The value of \(\lambda(\mathbf{Y})\), in fact, shows how much \(\theta_1\) is more likely for the observed data than \(\theta_0\). Obviously, the larger \(\lambda(\mathbf{Y})\), the stronger the evidence against the null. The corresponding likelihood ratio test (LRT) at a given level \(\alpha\) is of the form: reject \(H_0\) in favor of \(H_1\) if \(\lambda(\mathbf{Y}) \ge C\), where the critical value \(C\) satisfies \(P_{\theta_0} \left( \lambda(\mathbf{Y}) \ge C \right) = \alpha\). As it happens, the LRT is exactly the optimal test we are looking for that is, the one with maximal power (MP) among all tests at a significance level \(\alpha\), which is formalized as the Neyman-Pearson Lemma.

A main problem is in its practical implementation, specifically of finding critical value \(C\). For a fixed level \(\alpha\), \(C\) is the solution of \(P_{\theta_0} \left( \lambda(\mathbf{Y}) \ge C \right) = \alpha\), where the distribution of \(\lambda(\mathbf{Y})\) might be quite complicated. For large samples various asymptotic approximations for the distribution of \(\lambda(\mathbf{Y})\) can be used.

The situation in which both the null and the alternative hypotheses are simple is quite rare in practical applications — it is usual that at least the alternative hypothesis is composite: \(H_1: \theta \in \Theta_1 \subset \Theta\). So we now have the power function \(\pi(\theta)\) of a test is the probability of rejecting \(H_0\) as a function of \(\theta\), that is, \(\pi(\theta) = P_\theta \left( \textrm{reject}\ H_0 \right) = P_\theta \left( \mathbf{Y} \in \Omega_1 \right)\). The significance level is now the maximal probability of a Type I error over all possible \(\theta \in \Theta_0\) (ie, the worst case scenario): \(\alpha = \sup_{\theta \in \Theta_0} \pi(\theta)\). And similarly the p-value is the \(\sup_{\theta \in \Theta_0} P_\theta \left( T(\mathbf{Y}) \ge t_{obs} \right)\).

As we have argued, the main problem in hypothesis testing is the proper choice of a test. We discussed that for simple hypothesis one compares the performance of two different tests with the same significance level by their powers and looks for the maximum power test. What happens with composite hypothesis, where there is no single power value but a power function? Consider two tests with power function \(\pi_1(\theta)\) and \(\pi_2(\theta)\) with the same significance level \(\alpha\). If \(\pi_1(\theta) > \pi_2(\theta)\) for all \(\theta \in \Theta_1\), we can say that the first test uniformly outperforms its counterpart. Unfortunately, this is not common. Rather it is typical for \(\pi_1(\theta) > \pi_2(\theta)\) for only some \(\theta \in \Theta_1\) and so neither test is uniformly better and we cannot prefer either of them. This is reminiscent of the situation of comparing estimators by MSE in Section 2.

The requirement of uniformly most powerfulness for all \(\theta \in \Theta_1\) among all possible tests at given level \(\alpha\) is very strong. For example, there is a UMP for a one-sided test for the normal mean, but not for a two-sided test for the normal mean. And in multiparameter cases, UMP tests do not exist at all except for some singular cases. Nevertheless, it seems tempting to generalize the likelihood ratio test of Neyman-Pearson to composite hypotheses. The natural generalization of the likelihood ratio for composite hypotheses is

\[ \lambda^*(\mathbf{y}) = \frac{ \sup_{\theta\in\Theta_1} L(\theta_1; \mathbf{Y}) }{ \sup_{\theta\in\Theta_0} L(\theta_0; \mathbf{Y}) } = \frac{ \sup_{\theta\in\Theta_1} f_{\theta_1}(\mathbf{Y}) }{ \sup_{\theta\in\Theta_0} f_{\theta_0}(\mathbf{Y}) } \]

Usually it is more convenient to use the equivalent generalized likelihood ratio (GLR) statistic

\[ \lambda^*(\mathbf{y}) = \frac{ \sup_{\theta\in\Theta} L(\theta_1; \mathbf{Y}) }{ \sup_{\theta\in\Theta_0} L(\theta_0; \mathbf{Y}) } \]

(Note the change in subscript on \(\sup\) in the numerator). Also note that there is no proof that the GLR is UMP. But, as it happens, several well-known and widely used statistical tests are GLRs: one and two-sided t-tests for normal means, \(\chi^2\) test for normal variance, F-test for normal variance, F-tests for comparing nested regression models, Pearson’s goodness-of-fit \(\chi^2\) test, etc.

There is a duality between hypothesis testing and confidence intervals. If one has derived a \((1-\alpha)100\%\) confidence interval for a single unknown parameter \(\theta\) and then wants to test \(H_0: \theta = \theta_0\) against a two sided alternative \(H_1: \theta \ne \theta_0\) at level \(\alpha\), one should simply check whether \(\theta_0\) lies within the confidence interval and then accept \(H_0\) or (if outside the interval) reject \(H_0\). The duality extends straightforwardly to the multiparameter case.

For a discussion of sequential testing or multiple testing, see section 4.5 and 4.6.

5 Asymptotic Analysis

In many situations an estimator might be a complicated function of the data, and the exact calculation of its characteristics (e.g., bias, variance, and MSE) might be quite difficult if possible at all. Nevertheless, rather often, good approximate properties can be obtained if the sample size is “sufficiently large.” Intuitively, everyone would agree that a “good” estimator should improve and approach the estimated parameter as the sample size increases: the larger the poll, the more representative its results are. Conclusions based on a large sample have more ground than those based on a small sample. But what is the formal justification for this claim?

Similar issues arise with confidence intervals and hypothesis testing. To construct a confidence interval, we should know the distribution of the underlying statistic, and that may be difficult to calculate. Can we derive at least an approximate confidence interval, without knowing the exact distribution of the statistic for a “sufficiently large” sample? In hypothesis testing we need the distribution of the test statistic \(T(\mathbf{Y})\) under the null hypothesis in order to find the critical value for a given significance level \(\alpha\) by solving \(P_{\theta_0} \left( T(\mathbf{Y}) \ge C \right) = \alpha\) or to calculate the p-value. The statistic \(T(\mathbf{Y})\) might be a complicated function of the data whose distribution is hard to derive, if doable at all. Is it possible then to find its simple asymptotic approximation?

Statisticians like to answer these questions by considering a sequence of similar problems that differ from one another only in their sample size. Let \(\hat{\theta}_n = \hat{\theta}(Y_1, \ldots, Y_n)\) be an estimator for \(\theta\) based on a sample of size \(n\). We can then study the asymptotic properties fo the sequence \(\{\hat\theta_n\}\) as \(n\) increases. A thorough history (not included here) of prerequisites includes discussion of convergence of a sequence of real numbers, then a sequence of vectors, then a sequence of functions, and finally the convergence of a sequence of estimators \(\{\hat\theta_n\}\) to a parameter \(\theta\). We consider three types of convergence of estimators: in mean-squared error, in probability, and in distribution.

Recall that the most common measure of goodness-of-estimation for \(\hat\theta_n\) is its mean squared error \(MSE(\hat\theta_n, \theta) = E(\hat\theta_n-\theta)^2\). We say that \(\hat\theta_n\) converges in MSE to \(\theta\) if \(MSE(\hat\theta_n, \theta) \rightarrow 0\) as the sample size \(n\) increases. We denote this as \(\hat\theta_n \stackrel{\textrm{MSE}}{\rightarrow} \theta\). Consistency in MSE, however, might be hard to prove and sometimes even too strong to ask for.

A more realistic approach is then to relax it by the weaker convergence in probability. The sequence of random variables \(\{Y_n\}\) is said to converge to \(c\) if for any \(\varepsilon > 0\) the probability \(P(|Y_n-c|>\varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\). We write this as \(Y_n \stackrel{p}{\rightarrow} c\). With a proof of a continuous mapping theorem — i.e., that \(g(Y_n) \stackrel{p}{\rightarrow} g(c)\) — we an define consistency in probability of an estimator (usually just called consistency for brevity) as \(\hat\theta_n \stackrel{p}{\rightarrow} \theta\) as \(n \rightarrow \infty\).

The weak law of large numbers (WLLN) implies that a sample mean is always a consistent estimator of the expectation. Let \(Y_1, Y_2, \ldots\) be i.i.d. random variables with finite second moment. Then \(\bar{Y}_n = \frac{1}{n}\sum_{i=1}^n Y_i \stackrel{p}{\rightarrow} EY\) as \(n \rightarrow \infty\). The proof follows immediately from Chebyshev’s inequality: \(P(|\bar{Y}_n-\mu| \ge \varepsilon) \le \textrm{Var}(\bar{Y}_n) / \varepsilon^2 = \textrm{Var}(Y)/(n\varepsilon^2) \rightarrow 0\). Although consistency itself is important, the rate of convergence also matters. One would obviously prefer a consistent estimator with a faster convergence rate. We will not consider these issues in detail but just mention that in many situations, \(n^{-1/2}\) is the best available convergence rate for estimating a parameter; such estimator are called \(\sqrt{n}\)-consistent.

Consistency (in MSE or probability) considered in the previous sections still does not say anything about the asymptotic distribution of an estimator needed, for example, to construct an approximate confidence interval for an unknown parameter or to calculate the critical value in hypothesis testing. What we need is convergence in distribution. A sequence of random variables \(Y_1, Y_2, \ldots\) with cdf’s \(F_1(\cdot), F_2(\cdot),\ldots\) is said to converge in distribution to a random variable \(Y\) with cdf \(F(\cdot)\) if \(\lim_{n\rightarrow\infty}F_n(x) = F(x)\) for every continuity point \(x\) of \(F(\cdot)\). We denote this by \(Y_n \stackrel{\mathscr{D}}{\rightarrow} Y\). Convergence in distribution is important because it enables us to approximate probabilities. Quite often \(F_n\) is either complex or not explicitly known, while \(F\) is simple and known. If \(Y_n \stackrel{\mathscr{D}}{\rightarrow} F\), we can use \(F\) to calculate approximate probabilities related to \(Y_n\).

Prime examples of limiting distributions include Poisson (as an approximation for binomial random variables with large \(n\) and small \(p\)), the normal distribution (for properly scaled means), the exponential distribution (for the minimum of many random variables), and the extreme value distribution (for the distribution of extremely large values). The fact that these distributions are obtained as a limit is the reason why they are so useful.

The Central Limit Theorem (CLT): Suppose \(Y_1, Y_2, \ldots\) are i.i.d. random variables with the common (finite) expectation \(\mu\) and variance \(\sigma^2\). Let \(\bar{Y}_n = \frac{1}{n}\sum_{i=1}^nY_i\). Then as \(n \rightarrow \infty\), \(\sqrt{n}(\bar{Y}_n - \mu) \stackrel{\mathscr{D}}{\rightarrow} \mathscr{N}(0,\sigma^2)\). This is a formal way to say that for sufficiently large \(n\), the distribution of a sample mean \(\bar{Y}_n\) is approximately normal with mean \(\mu\) and variance \(\sigma^2/n\). Note that it is always true that \(E\bar{Y}_n=\mu\) and \(\textrm{Var}(\bar{Y}_n)=\sigma^2/n\). The CLT adds that the distribution of \(\bar{Y}_n\) is approximately normal.

The CLT is a real refinement of the WLLN. Both consider the asymptotic behavior of \(\bar{Y}_n\). The WLLN only says that \(\bar{Y}_n \stackrel{p}{\rightarrow} \mu\), or that \(\bar{Y}_n\) is very close to \(\mu\) when \(n\) is large, but does not tell us how close. The CLT gives a precise answer. It is as if we look at the cdf of \(\bar{Y}_n\) through a magnifying glass: we center it around \(\mu\) and magnify by \(\sqrt{n}\). As \(n\) increases, the cdf looks more and more smooth, and becomes closer and closer to a normal cdf.

An estimator \(\hat{\theta}_n\) is called a consistent asymptotically normal (CAN) estimator of \(\theta\) if \(\sqrt{n}(\hat{\theta}_n - \theta) \stackrel{\mathscr{D}}{\rightarrow} \mathscr{N}\left(0,\sigma^2(\theta)\right)\) for some asymptotic variance \(\sigma^2(\theta)\) that may generally depend on \(\theta\). If \(\hat{\theta}_n\) is a CAN estimator of \(\theta\), then it is also \(\sqrt{n}\)-consistent. A theorem heavily based on the Delta Method explains why so many estimators are CAN: if \(\hat{\theta}_n\) is a CAN estimator of \(\theta\) and \(g(\cdot)\) is any differentiable function with \(g'(\theta)\ne0\), then \(g(\hat{\theta}_n)\) is a CAN estimator of \(g(\theta)\).

The notations asymptotic mean and asymptotic variance for \(\theta\) and \(\sigma^2(\theta)\) might be somewhat misleading: although it is tempting to think that for a CAN estimator \(\hat{\theta}_n\) of \(\theta\), \(E\hat{\theta}_n \rightarrow \theta\) and \(\textrm{Var}(\hat{\theta}_n) \rightarrow \sigma^2(\theta)\) as \(n\) increases, generally it is not true. The convergence in distribution does not necessarily yield convergence of moments.

Asymptotical normality of an estimator can be used to construct an approximate normal confidence interval when its exact distribution is complicated or even not completely known.

\[ P \left( \hat{\theta}_n - z_{\alpha/2} \frac{\sigma(\theta)}{\sqrt{n}} \le \theta \le \hat{\theta}_n - z_{\alpha/2} \frac{\sigma(\theta)}{\sqrt{n}} \right) = P\left( - z_{\alpha/2} \le \sqrt{n} \frac{\hat{\theta}_n - \theta}{\sigma(\theta)} \le z_{\alpha/2} \right) \rightarrow 1 - \alpha \]

This equation is not much use on its own since the asymptotic variance \(\sigma^2(\theta)\) generally also depends on \(\theta\) or is an additional unknown parameter. Intuitively, a possible approach would be to replace \(\sigma(\theta)\) by some estimator and hope that it will not strongly affect the resulting asymptotic coverage probability. As it happens, the idea indeed works if \(\hat{\sigma}_n\) is any consistent estimator of \(\sigma(\theta)\).

An example is given whereby \(\hat{\lambda}\) estimates \(\lambda\) but also \(\hat{\theta}\) estimates \(\theta = 1/\lambda\) and \(\hat{p}\) estimates \(p=e^{-\lambda}\). Asymptotically normal confidence intervals are constructed for \(\lambda\), \(\theta\), and \(p\). These approaches, unsurprisingly, lead to different results — there is no unique way to construct a confidence interval (as already mentioned in Chapter 3). Which one is better? One likely prefers a confidence interval with shorter expected length. For asymptotic confidence intervals, another (probably even more important) issue is the goodness of normal approximation for distributions of the underlying statistics. Another possible approach to deal with an unknown asymptotic variance \(\sigma^2(\theta)\) of a CAN estimator \(\hat{\theta}_n\) when constructing an asymptotic confidence interval for \(\theta\) is by the use of the variance stabilizing transformation \(g(\theta)\) whose asymptotic variance \(g'(\theta)\sigma^2(\theta)\) does not depend on \(\theta\). Evidently, this transformation \(g(\theta)\) should satisfy \(g'(\theta) = c/\sigma(\theta)\) for a chosen constant \(c\). A \((1-\alpha)100\%\) asymptotic confidence interval for the resulting \(g(\theta)\) is then \(g(\hat{\theta}_n) \pm z_{\alpha/2}\frac{c}{\sqrt{n}}\), which can be used to construct an asymptotic confidence interval for \(\theta\) itself.

Maximum likelihood estimators were introduced in Chapter 2, but the motivation for them there was mostly heuristic. The real justification for the MLE, however, is not just “philosophical.” It is genuinely a good estimation approach and its theoretical justification is based on its strong asymptotic properties: the MLE of \(\theta\) is CAN with the minimum possible asymptotic variance among all “regular” estimators, that is, \(\sqrt{n}(\hat{\theta}_n - \theta) \stackrel{\mathscr{D}}{\rightarrow} N(0, I^*(\theta)^{-1})\). Note that the asymptotic variance of the MLE is exactly the Cramer-Rao variance lower bound for unbiased estimators.

Section 5.10 provides an extension of the asymptotic results for MLEs to general M-estimators.

6 Bayesian Inference

So far we have treated \(\theta\) as a fixed (unknown) constant. Its estimation and inference was based solely on the observed data, which was considered as a random sample from the set of all possible samples of a given size from a population (sample space). The inference about \(\theta\) was essentially based on averaging over all those (hypothetical!) samples, which might raise conceptual questions about its appropriateness for a particular single observed sample.

In the Bayesian framework, the initial uncertainty about \(\theta\) is modeled by considering \(\theta\) as a random variable with a certain prior distribution that expresses one’s prior beliefs about the plausibility of every possible value of \(\theta \in \Theta\). At first glance, it might seem somewhat confusing to treat \(\theta\) as a random variable – one does not really assume that its value randomly changes from experiment to experiment. The problem is with the interpretation of randomness: probability here is a measure of belief. Philosophical disputes on the nature of probability has a long history but, luckily, the corresponding formal probability calculus does not depend on its interpretation.

Thus, in addition to the observed data \(\mathbf{y}=(y_1,\ldots,y_n)\), a Bayesian specifies a prior distribution \(\pi(\theta)\), \(\theta \in \Theta\) for the unknown parameter \(\theta\). The model distribution for the data \(f_\theta(\mathbf{Y})\) is treated as a conditional distribution \(f(\mathbf{y}|\theta)\) of \(\mathbf{Y}\) given \(\theta\). Thus, the Bayesian specifies before observing the data a joint distribution for the parameter and the data. The inference on \(\theta\) is based then on the posterior distribution \(\pi(\theta | \mathbf{Y}=\mathbf{y})\) (denoted \(\pi(\theta|\mathbf{y})\) for brevity) calculated by Bayes’ Theorem:

\[ \pi(\theta | \mathbf{y}) = \frac{f(\mathbf{y}|\theta)\pi(\theta)}{f(\mathbf{y})} \]

where the marginal distribution in the denominator \(f(\mathbf{y}) = \int_\Theta f(\mathbf{y}|\theta')\pi(\theta')d\theta'\) is a normalizing constant that does not depend on \(\theta\). Note also that \(f(\mathbf{y}|\theta) = f_\theta(\mathbf{y})\) is a likelihood function \(L(\theta; \mathbf{y})\) and thus the posterior is proportional to the likelihood and the prior: \(\pi(\theta|\mathbf{y}) \propto L(\theta; \mathbf{y})\pi(\theta)\).

Bayesian inference in inherently conditional on a specific observed sample rather than the result of averaging over the entire sample space of the experiment. In this sense some consider it as having a more “natural” interpretation and as more “conceptually appealing.” The main criticism with respect of Bayesians, however, is due to the use of a prior which is the core of the approach. A choice for \(\pi(\theta)\) is subjective and represents one’s beliefs or information about \(\theta\) before the experiment. Different priors can change the results of a Bayesian analysis while in many practical cases, subjective quantification of uncertainty in terms of a prior distribution might be difficult.

A prior distribution should ideally be defined solely on one’s prior beliefs and knowledge, but such situations are rare in the real world. One can often specify a (usually parametric) class or family of possible priors that is sufficiently “rich” to include priors that reflect various possible scenarios for the problem at hand and at the same time allow simple computation of the corresponding posterior distribution. Conjugate priors are probably the most known and used of these classes:

A (parametric) class of priors \(\mathscr{P}\) is called conjugate for the data if for any prior \(\pi(\theta) \in \mathscr{P}\), the corresponding posterior distribution \(\pi(\theta|\mathbf{y})\) also belongs to \(\mathscr{P}\).

When choosing a parametric (e.g., conjugate) class of priors, one should still specify the corresponding parameters of a prior (called hyperparameters). Often it can be done subjectively—to define parameters of a distribution is after all a much easier task than to elicit the entire prior distribution itself!

Unfortunately, in most cases the resulting marginal prior \(\pi(\theta)\) and the posterior \(\pi(\theta|\mathbf{y})\) for hierarchical prior models cannot be obtained in closed form. However, the existing Bayes computational techniques allow one to derive them numerically in an efficient way.

Alternatively, one can estimate the unknown hyperparameters of the prior from the data. Such an approach is known as empirical Bayes and utilizes the marginal distribution of the data: \[ f_\alpha(\mathbf{y}) = \int_\Theta f(\mathbf{y}|\theta)\pi_\alpha(\theta)d\theta \]

The density \(f_\alpha(\mathbf{y})\) does not depend on \(\theta\) but only on the unknown hyperparameters \(\alpha\) of the prior that can be estimated by any classical method, e.g., maximum likelihood or method of moments. Although empirical Bayes essentially contradicts the core Bayesian principle that a prior should not depend on the data, it is nevertheless a useful approach in many practical situations.

Suppose we know nothing about \(\theta \in \Theta\). A natural way to model ignorance would then be to place a uniform prior distribution for \(\theta\) over the parameter space \(\Theta\). Such a prior is called noninformative or objective. The idea looks appealing but unfortunately cannot be applied in common situations, where the parameter space \(\Theta\) is an infinite measure, e.g. for the normal mean where \(\theta = \mu\) and \(\Theta = (-\infty, \infty)\).

Nevertheless, let’s “ignore” this basic probabilistic law (at least for a while) and consider a prior density \(\pi(\theta) = c\) over \(\Theta\) for some fixed \(c>0\). It is obviously an improper density since \(\int_\Theta \pi(\theta)d\theta = \infty\). However, the corresponding posterior density \[ \pi(\theta|\mathbf{y}) = \frac{f(\mathbf{y}|\theta)c}{\int_\Theta f(\mathbf{y}|\theta)c d\theta} = \frac{L(\theta;\mathbf{y})}{\int_\Theta L(\theta;\mathbf{y}) d\theta} \] can still be proper (provided that the integral in the denominator is finite for all \(\mathbf{y}\)), and, moreover, it does not depend on a specific \(c\). Thus, although the proposed prior is improper, it can be used in Bayesian inference, as long as the corresponding posterior is proper.

There is, however, another, probably a more serious problem with using uniform distributions (proper or improper) as a general remedy for noninformative priors. The lack of invariance of a uniform distribution under transformations leads to the logical paradox and raises then the question whether one should impose a uniform distribution on the original scale of \(\theta\) or the transformed one.

We would like to construct a noninformative prior that will be invariant under one-to-one differentiable transformations \(\xi = \xi(\theta)\). By standard probabilistic arguments, a prior or \(\xi\) is \(\pi_\xi(\xi) = \pi_\theta(\theta(\xi)) \cdot \vert \theta'_\xi \vert\). The Jeffreys prior for \(\theta\) is \[ \pi(\theta) = \sqrt{I^*_\theta(\theta)} \] where \(I^*_\theta(\theta)\) is the Fisher information number of \(f(y|\theta)\). Note that in terms of \(\xi\), \(I^*_\xi(\xi) = I^*_\theta(\theta)\cdot(\theta'_\xi)^2\) and, therefore, \(\sqrt{I^*_\xi(\xi)} = \sqrt{I^*_\theta(\theta(\xi))} \cdot \vert \theta'_\xi \vert\). Hence, such class of priors is transformation invariant.

An immediate extension of Jeffreys noninformative prior for the multiparameter case is \[ \pi(\theta) \propto \sqrt{\text{det}(I^*(\theta))} \]

Given the posterior distribution \(\pi(\theta|\mathbf{y})\), a Bayes point estimator \(\hat{\theta}\) for \(\theta\) depends on the measure of goodness-of-estimation. Similar to non-Bayesian estimation, the most conventional measure in the squared-error \((\hat{\theta} - \theta)^2\). The difference between the two approaches is that Bayesians consider the expectation of a squared error w.r.t. the posterior distribution \(\pi(\theta|\mathbf{y})\) rather than the data distribution \(f(\mathbf{y}|\theta)\) as in the MSE.

The Bayes point estimator for \(\theta\) with respect to the squared error i the posterior mean \(\hat{\theta} = E(\theta|\mathbf{y})\). The Bayes point estimator for \(\theta\) with respect to the absolute error is the posterior median \(\hat{\theta} = \text{Med}(\theta|\mathbf{y})\).

Bayesian approach to interval estimation is conceptually straightforward and simple. The posterior distribution of the unknown parameter of interest essentially contains all the information available on it from the data and prior knowledge. A Bayesian analog of confidence intervals (or sets in general) are credible sets: A \(100(1-\alpha)\%\) credible set for \(\theta\) is a subset (usually interval) \(S_y \subset \Theta\) such that \(P(\theta \in S_y | \mathbf{y}) = \int_{S_y} \pi(\theta|\mathbf{y}) d\theta = 1 - \alpha\).

Credible sets can usually be easily calculated from a posterior distribution in contrast to their frequentist counterparts. It is important to emphasize that the probability is on \(\theta\) rather than on a set that makes probabilistic interpretation of Bayesian credible sets more straightforward.

Similar to confidence intervals, a credible set can usually be chosen in a nonunique way. It may be desirable then to find a credible set of a minimal size. Such a set would evidently include the “most likely” values of \(\theta\). This leads one to the definition of highest posterior density sets.

Simple hypotheses: Suppose that given the data \(\mathbf{Y} \sim f_\theta(\mathbf{y})\), we want to test two simple hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta = \theta_1\). In such a setup, the parameter space \(\Theta\) is essentially reduced to a set of two points: \(\Theta = \{\theta_0, \theta_1\}\). A Bayesian defines a prior distribution \(\pi(\cdot)\) on \(\Theta\), where \(\pi(\theta_0) = \pi_0\) and \(\pi(\theta_1) = \pi_1 = 1 - \pi_0\), and calculates the corresponding posterior probabilities \(\pi(\theta_0|\mathbf{y})\) and \(\pi(\theta_1|\mathbf{y})\). In fact, it is often more convenient to work with the posterior odds ratio (or simply “posterior odds”) of \(H_1\) and \(H_0\), that is \(\pi(\theta_1|\mathbf{y}) / \pi(\theta_0|\mathbf{y})\). In particular, one does not need then to calculate the normalizing constant in the posterior distribution.

Thus, for two simple hypotheses, the posterior odds is the multiplication of the likelihood ratio and the prior odds. In particular, assuming both hypotheses to be equally likely a priori (i.e., \(\pi_0 = \pi_1 = 0.5\)), the posterior odds is identical to the likelihood ratio. Clearly, the larger the posterior odds, the larger \(P(\theta_1|\mathbf{y})\) and the more evidence there is against \(H_0\) in favor of \(H_1\). A Bayesian then sets a threshold \(C\) and rejects the null if the posterior odds is larger than \(C\).

The ideas for testing simple hypotheses can be extended to composite hypotheses. Let \(H_0: \theta \in \Theta_0\) vs. \(H_1: \theta \in \Theta_1\) where \(\Theta_0 \cap \Theta_1 = \emptyset\) and \(\Theta_0 \cup \Theta_1 = \Theta\). Consider testing a point null hypothesis about a continuous parameter \(\theta\): \(H_0: \theta = \theta_0\) vs \(H_1: \theta \ne \theta_0\). Defining a continuous prior \(\pi(\theta)\) on the entire parameter space \(\Theta\) is meaningless in this case since it assigns a zero prior (and, therefor, zero posterior!) probability for the point null hypothesis \(\theta = \theta_0\). There are several Bayesian approaches to tackle this problem.

Similar to frequentist asymptotics, it is interesting to consider asymptotic properties of Bayesian approach when a sample size \(n\) tends to infinity. Intuitively, the more data we have, the more information we can elicit from it, the less should be the impact of a prior belief on inference. In this sense Bayesian and frequentist inferences should somehow merge as \(n \rightarrow \infty\). The Bernstein-von Mises theorem shows that under general regularity conditions, the posterior distribution \(\pi(\theta|\mathbf{y})\) approaches a normal distribution centered at the MLE \(\hat{\theta}_n\) and the variance \(I^{-1}(\theta_0)\), where \(I(\theta_0)\) is the Firsher information number at the true \(\theta_0\). Simply stated, for large samples, the posterior distribution of \(\theta\) is approximately normal with mean \(\hat{\theta}_n\) and variance \(I^{-1}(\theta_0) = (nI^*(\hat{\theta}_n))^{-1}\).

7 Elements of Statistical Decision Theory

What is statistical inference? in fact, it can generally be viewed as making a conclusion about an underlying random process from the data generated by it. Statistical inference can, therefore, be considered as a decision making under uncertainty based on analysis of data. Each decision has consequences as there is a price for an error, and the goal is then to find a proper rule for making a “good” decision. All three main problems of statistical inference – namely point estimation, confidence intervals, and hypothesis testing – can be put within the general statistical decision theory framework.

The main goal of statistical decision theory is to find the “best” decision rule \(\delta = \delta(\mathbf{Y})\). How to measure the “goodness” of a rule? A first naive way would be to consider its loss \(L(\theta, \delta)\) but it is a random variable since \(\delta(\mathbf{Y})\) depends on a random sample. We define then the expected loss of a rule \(\delta\) called its risk function.

We can compare two rules \(\delta_1\) and \(\delta_2\) and choose the one with the smaller risk. The problem is, however, that similar to MSE in the estimation setup, the risk function depends also on the specific value of the unknown \(\theta\). Hence, we can claim that \(\delta_1\) outperforms \(\delta_2\) only if its risk is uniformly smaller for all \(\theta \in \Theta\). A decision rule \(\delta\) is inadmissible (w.r.t. a givel loss and class of rules) if there exists another decision rule \(\tilde{\delta}\) from the class such that \(R(\theta_0,\tilde{\delta}) \le R(\theta,\delta)\) for all \(\theta \in \Theta\) and there exists \(\theta_0 \in \Theta\) for which the inequality is strict. A decision rule is admissible if it is not inadmissible, that is, it cannot be uniformly improved. Generally, however, it might be hard (if possible) to check the admissibility (or inadmissibility) of a given rule directly. Remarkably, admissibility of a wide variety of rules can be easily established through their Bayesian interpretation as we a bit later.

Admissibility is quite a weak requirement – it essentially declares the absence of a negative property, rather than possession of a positive one. Admissibility actually only means that there exists some value(s) of \(\theta\) where the rule cannot be improved, but it still does not tell anything about the goodness of the rule for other values of \(\theta\).

We would obviously like to find the rule that outperforms all other counterparts within a given class uniformly over all \(\theta \in \Theta\). However, as we have discussed, in most realistic situations it is impossible unless the action set is very restricted. What can be done then? Instead of looking at a risk function pointwise at each \(\theta\) we need some global criteria over all \(\theta \in \Theta\). One possible approach is to consider the worst case scenario, that is, \(\sup_{\theta\in\Theta} R(\theta,\delta)\) and seek the rule that will minimize the maximal possible risk. This rule is called minimax.

By its definition, the minimax criterion is conservative, where one wants to ensure a maximal protection in the worst possible case essentially ignoring all other less pessimistic scenarios. Minimax rules may not be unique. Moreover, they are not even necessarily admissible (although usually they are). Note that, if a minimax rule is unique, it is admissible.

An alternative global risk criterion to the conservative minimax risk approach is a weighted average risk \(\int_\Theta R(\theta,\delta)\pi(\theta)d\theta\) (for continuous \(\pi(\theta)\)) or \(\sum_i R(\theta_i, \delta)\pi(\theta_i)\) (for discrete \(\pi(\theta)\)) of a rule \(\delta\), where the weight function \(\pi(\theta) \ge 0\) gives more emphais to more “relevant” values of \(\theta\) and reduces the influence of others. In this sense \(\pi(\theta)\) can be treated as a prior distribution of \(\theta\) over \(\Theta\) and the resulting weighted average risk is thus called a Bayes risk of a rule \(\delta\).

The Bayes rule \(\delta^*_\pi = \delta^*_\pi(\mathbf{Y})\) (w.r.t. a given loss, class of rules, and a prior) is a decision rule that minimizes the Bayes risk \(\rho(\pi,\delta)\). The Bayes risk is not larger than the minimax risk w.r.t. the same loss.

The name “Bayes risk” is, in fact, somewhat misleading since the risk function \(R(\theta,\delta)\) being the average loss over all possible \(\mathbf{y} \in \Omega\) is not considered as a relevant object by the “Bayesians” who restrict attention to quantities that are conditioned on the observed sample \(\mathbf{y}\). However, this is a common terminology. Besides, one can argue that a weight function \(\pi(\theta)\) can be interpreted not necessarily as a prior but as a measure of importance of different values of \(\theta\). Moreover, as we shall see next, a “pure” Bayesian approach that utilizes a posterior expected loss \(E(L(\theta,\delta)|\mathbf{y})\) leads to the same rules. This equivalency is usually used for deriving Bayes rules.

A natural Bayesian way to find “the best” rule w.r.t. the loss \(L(\theta,a)\) and the prior \(\pi(\theta)\) is to consider a Bayesian (posterior) expected loss: \[ r(a,\mathbf{y}) = E(L(a,\theta)|\mathbf{y}) \] and to find a rule (also called a Bayes action) \(\delta^*_\pi = \delta^*_\pi(\mathbf{y})\) that minimizes \(r(a,\mathbf{y})\) over \(a \in \mathscr{A}\) for a given \(\mathbf{y}\). In particular, for the quadratic loss \(L(\theta,a) = (\theta - a)^2\) and the absolute error loss \(L(\theta,a) = |\theta - a|\) the corresponding Bayesian actions are respectively the posterior mean \(E(\theta|\mathbf{y})\) and the posterior median \(\text{Med}(\theta|\mathbf{y})\).

We now show [omitted here] that the Bayes rule that minimizes the Bayes risk w.r.t. a certain loss function and a prior coincides with the Bayes action that minimizes the corresponding expected posterior loss. From now on we will not distinguish between ayesian rules and actions (though they are still conceptually different!). Based on this equivalency, Bayes rules are typically derived through minimizing the posterior expected loss.

The remarkable property of Bayes rules is that they are typically admissible as demonstrated in the following two theorems. If a Bayes rule \(\delta^*_\pi\) is unique and the Bayes risk \(\rho(\pi,\delta^*_\pi) < \infty\), then \(\delta^*_\pi\) is admissible. Let \(\delta^*_\pi\) be a Bayes rule w.r.t. a prior \(\pi(\theta)\). Suppose that one of the two following conditions holds: (1) \(\Theta\) is finite and \(\pi(\theta) > 0\) for all \(\theta \in \Theta\), or (2) \(\Theta\) is an interval (possibly infinite), \(\pi(\theta) > 0\) for all \(\theta \in \Theta\), the risk function \(R(\theta,\delta)\) is a continuous function of \(\theta\) for all \(\delta\), and the Bayes risk \(\rho(\pi,\delta^*_\pi)<\infty\). Then, \(\delta^*_\pi\) is admissible.

These results show that it is easy to verify admissibility of a rule \(\delta\) once one can find a prior such that \(\delta\) is the corresponding Bayes rule. Note that the prior does not have to be proper as long as the corresponding Bayes risk is finite.

As we have mentioned, derivation of a minimax rule is generall a nontrivial problem. It turns out that a Bayesian approach can also be useful for finding minimax risks and rules. Recall that or any prior \(\pi(\theta)\), the Bayes risk of the corresponding Bayes rule \(\delta^*_\pi\) is not larger than the minimax risk \(\rho(\pi, \delta^*_\pi)\) and therefore \(\rho(\pi, \delta^*_\pi)\) is a lower bound for the minimax risk. On the other hand, for any rule (and for \(\delta^*_\pi\) in particular) its maximum risk is evidently an upper bound for the minimax risk. We would naturally seek the least favorable prior which implies the largest Bayes risk to make the bounds for the minimax risk as tight as possible. A prior \(\pi(\theta)\) is called a least favorable prior if \(\rho(\pi, \delta^*_\pi) \ge \rho(\tilde{\pi}, \delta^*_\tilde{\pi})\).

Let \(\delta^*_\pi\) be the Bayes rule corresponding to a (proper) prior \(\pi(\theta)\) and \(\rho(\pi, \delta^*_\pi) = \sum_{\theta \in \Theta} R(\theta, \delta^*_\pi)\). Then \(\delta^*_\pi\) is a minimax rule and \(\pi(\theta)\) is a least favorable prior. An important corrolary provides a particular case that is useful for deriving minimax rules: the theorem holds if \(R(\theta, \delta^*_\pi)\) is constant. While this theorem and corrolary can be applied only for proper least favorable priors, there are examples of rules that can be viewed as Bayes w.r.t. improper priors. An improper prior can often be considered as a limiting distribution of a sequence of proper priors.