CH 7 and 9 Estimation and
Statistical Inference
Recall our various clinical trial examples. What would we say is the probability that
a future patient will respond successfully to treatment after we observe the results
from a collection of other patients? This is the kind of question that statistical
inference is designed to address. In general, statistical inference consists of making
probabilistic statements about unknown quantities. For example, we can compute
means, variances, quantiles, probabilities, and some other quantities yet to be
introduced concerning unobserved random variables and unknown parameters
of distributions. Our goal will be to say what we have learned about the unknown
quantities after observing some data that we believe contain relevant information.
Here are some other examples of questions that statistical inference can try to
answer. What can we say about whether a machine is functioning properly after we
observe some of its output? In a civil lawsuit, what can we say about whether there
was discrimination after observing how different ethnic groups were treated? The
methods of statistical inference, which we shall develop to address these questions,
are built upon the theory of probability covered in the earlier chapters of this text.
Probability and Statistical Models In the earlier chapters of this book, we discussed the theory and methods of probability. As new concepts in probability were introduced, we also introduced examples of the use of these concepts in problems that we shall now recognize as statistical inference. Before discussing statistical inference formally, it is useful to remind ourselves of those probability concepts that will underlie inference.
There is another group of statisticians who believe that in many problems it is not appropriate to assign a distribution to a parameter but claim instead that the true value of the parameter is a certain fixed number whose value happens to be unknown to the experimenter. These statisticians would assign a distribution to a parameter only when there is extensive previous information about the relative frequencies with which similar parameters have taken each of their possible values in past experiments. If two different scientists could agree on which past experiments were similar to the present experiment, then they might agree on a distribution to be assigned to the parameter. For example, suppose that the proportion θ of defective items in a certain large manufactured lot is unknown. Suppose also that the same manufacturer has produced many such lots of items in the past and that detailed records have been kept about the proportions of defective items in past lots. The relative frequencies for past lots could then be used to construct a distribution for θ. Statisticians who would argue this way are said to adhere to the frequentist philosophy of statistics and are called frequentists. The frequentists rely on the assumption that there exist infinite sequences of random variables in order to make sense of most of their probability statements. Once one assumes the existence of such an infinite sequence, one finds that the parameters of the distributions being used are limits of functions of the infinite sequences, just as do the Bayesians described above. In this way, the parameters are random variables because they are functions of random variables. The point of disagreement between the two groups is whether it is useful or even possible to assign a distribution to such parameters. Both Bayesians and frequentists agree on the usefulness of families of distributions for observations indexed by parameters. Bayesians refer to the distribution indexed by parameter value θ as the conditional distribution of the observations given that the parameter equals θ. Frequentists refer to the distribution indexed by θ as the distribution of the observations when θ is the true value of the parameter. The two groups agree that whenever a distribution can be assigned to a parameter, the theory and methods to be described in this chapter are applicable and useful. In Sections 7.2–7.4, we shall explicitly assume that each parameter is a random random variable and we shall assign it a distribution that represents the probabilities that the parameter lies in various subsets of the parameter space. Beginning in Sec. 7.5, we shall consider techniques of estimation that are not based on assigning distributions to parameters.
7.2 Prior and Posterior Distributions The distribution of a parameter before observing any data is called the prior distribution of the parameter. The conditional distribution of the parameter given the observed data is called the posterior distribution. If we plug the observed values of the data into the conditional p.f. or p.d.f. of the data given the parameter, the result is a function of the parameter alone, which is called the likelihood function
Summary The prior distribution of a parameter describes our uncertainty about the parameter before observing any data. The likelihood function is the conditional p.d.f. or p.f. of the data given the parameter when regarded as a function of the parameter with the observed data plugged in. The likelihood tells us how much the data will alter our uncertainty. Large values of the likelihood correspond to parameter values where the posterior p.d.f. or p.f. will be higher than the prior. Low values of the likelihood occur at parameter values where the posterior will be lower than the prior. The posterior distribution of the parameter is the conditional distribution of the parameter given the data. It is obtained using Bayes’ theorem for random variables, which we first saw on page 148. We can predict future observations that are conditionally independent of the observed data given θ by using the conditional version of the law of total probability that we saw on page 163.
7.3 Conjugate Prior Distributions For each of the most popular statistical models, there exists a family of distributions for the parameter with a very special property. If the prior distribution is chosen to be a member of that family, then the posterior distribution will also be a member of that family. Such a family of distributions is called a conjugate family. Choosing a prior distribution from a conjugate family will typically make it particularly simple to calculate the posterior distribution.
Summary For each of several different statistical models for data given the parameter, we found a conjugate family of distributions for the parameter. These families have the property that if the prior distribution is chosen from the family, then the posterior distribution is a member of the family. For data with distributions related to the Bernoulli, such as binomial, geometric, and negative binomial, the conjugate family for the success probability parameter is the family of beta distributions. For data with distributions related to the Poisson process, such as Poisson, gamma (with known first parameter), and exponential, the conjugate family for the rate parameter is the family of gamma distributions. For data having a normal distribution with known variance, the conjugate family for the mean is the normal family. We also described the use of improper priors. Improper priors are not true probability distributions, but if we pretend that they are, we will compute posterior distributions that approximate the posteriors that we would have obtained using proper conjugate priors with extreme values of the prior hyperparameters
7.4 Bayes Estimators An estimator of a parameter is some function of the data that we hope is close to the parameter. A Bayes estimator is an estimator that is chosen to minimize the posterior mean of some measure of how far the estimator is from the parameter, such as squared error or absolute error.
Summary An estimator of a parameter θ is a function δ of the data X. If X = x is observed, the value δ(x) is called our estimate, the observed value of the estimator δ(X). A loss 416 Chapter 7 Estimation function L(θ , a) is designed to measure how costly it is to use the value a to estimate θ. A Bayes estimator δ∗(X) is chosen so that a = δ∗(x) provides the minimum value of the posterior mean of L(θ , a). That is, E[L(θ , δ∗(x))|x] = mina E[L(θ , a)|x]. If the loss is squared error, L(θ , a) = (θ − a)2, then δ∗(x) is the posterior mean of θ, E(θ|x). If the loss is absolute error, L(θ , a) = |θ − a|, then δ∗(x) is a median of the posterior distribution of θ. For other loss functions, locating the minimum might have to be done numerically.
7.5 Maximum Likelihood Estimators Maximum likelihood estimation is a method for choosing estimators of parameters that avoids using prior distributions and loss functions. It chooses as the estimate of θ the value of θ that provides the largest value of the likelihood function.
7.6 Properties of Maximum Likelihood Estimators In this section, we explore several properties of M.L.E.’s, including: . The relationship between the M.L.E. of a parameter and the M.L.E. of a function of that parameter . The need for computational algorithms . The behavior of the M.L.E. as the sample size increases . The lack of dependence of the M.L.E. on the sampling plan We also introduce a popular alternative method of estimation (method of moments) that sometimes agrees with maximum likelihood, but can sometimes be computationally simpler.
Summary The M.L.E. of a function g(θ ) is g(θ )ˆ , where θˆ is the M.L.E. of θ. For example, if θ is the rate at which customers are served in a queue, then 1/θ is the average service time. The M.L.E. of 1/θ is 1 over the M.L.E. of θ. Sometimes we cannot find a closed form expression for the M.L.E. of a parameter and we must resort to numerical methods to find or approximate theM.L.E. In most problems, the sequence ofM.L.E.’s, as sample size increases, converges in probability to the parameter. When data are collected in such a way that the decision to stop collecting data is based solely on the data already observed or on other considerations that are not related to the parameter, then the M.L.E. will not depend on the sampling plan. That is, if two different sampling plans lead to proportional likelihood functions, then the value of θ that maximizes one likelihood will also maximize the other.
7.7 Sufficient Statistics In the first six sections of this chapter, we presented some inference methods that are based on the posterior distribution of the parameter or on the likelihood function alone. There are other inference methods that are based neither on the posterior distribution nor on the likelihood function. These methods are based on the conditional distributions of various functions of the data (i.e., statistics) given the parameter. There are many statistics available in a given problem, some more useful than others. Sufficient statistics turn out to be the most useful in some sense.
7.8 Jointly Sufficient Statistics When a parameter θ is multidimensional, sufficient statistics will typically need to be multidimensional as well. Sometimes, no one-dimensional statistic is sufficient even when θ is one-dimensional. In either case, we need to extend the concept of sufficient statistic to deal with cases in which more than one statistic is needed in order to be sufficient.
8.1 The Sampling Distribution of a Statistic A statistic is a function of some observable random variables, and hence is itself a random variable with a distribution. That distribution is its sampling distribution, and it tells us what values the statistic is likely to assume and how likely it is to assume those values prior to observing our data. When the distribution of the observable data is indexed by a parameter, the sampling distribution is specified as the distribution of the statistic for a given value of the parameter.
chapter 9 hypothesis testing
In general, hypothesis testing concerns trying to decide whether a parameter θ lies in one subset of the parameter space or in its complement. When θ is one-dimensional, at least one of the two subsets will typically be an interval, possibly degenerate. In this section, we introduce the notation and some common methodology associated with hypothesis testing. We also demonstrate an equivalence between hypothesis tests and confidence intervals.
9.5 The t Test We begin the treatment of several special cases of testing hypotheses about parameters of a normal distribution. In this section, we handle the case in which both the mean and the variance are unknown. We develop tests for hypotheses concerning the mean. These tests will be based on the t distribution.
9.6 Comparing the Means of Two Normal Distributions It is very common to compare two distributions to see which has the higher mean or just to see how different the two means are. When the two distributions are normal, the tests and confidence intervals based on the t distribution are very similar to the ones that arose when we considered a single distribution.
9.7 The F Distributions In this section, we introduce the family of F distributions. This family is useful in two different hypothesis-testing situations. The first situation is when we wish to test hypotheses about the variances of two different normal distributions. These tests, which we shall derive in this section, are based on a statistic that has an F distribution. The second situation will arise in Chapter 11 when we test hypotheses concerning the means of more than two normal distributions.
Probability and Statistical Models In the earlier chapters of this book, we discussed the theory and methods of probability. As new concepts in probability were introduced, we also introduced examples of the use of these concepts in problems that we shall now recognize as statistical inference. Before discussing statistical inference formally, it is useful to remind ourselves of those probability concepts that will underlie inference.
There is another group of statisticians who believe that in many problems it is not appropriate to assign a distribution to a parameter but claim instead that the true value of the parameter is a certain fixed number whose value happens to be unknown to the experimenter. These statisticians would assign a distribution to a parameter only when there is extensive previous information about the relative frequencies with which similar parameters have taken each of their possible values in past experiments. If two different scientists could agree on which past experiments were similar to the present experiment, then they might agree on a distribution to be assigned to the parameter. For example, suppose that the proportion θ of defective items in a certain large manufactured lot is unknown. Suppose also that the same manufacturer has produced many such lots of items in the past and that detailed records have been kept about the proportions of defective items in past lots. The relative frequencies for past lots could then be used to construct a distribution for θ. Statisticians who would argue this way are said to adhere to the frequentist philosophy of statistics and are called frequentists. The frequentists rely on the assumption that there exist infinite sequences of random variables in order to make sense of most of their probability statements. Once one assumes the existence of such an infinite sequence, one finds that the parameters of the distributions being used are limits of functions of the infinite sequences, just as do the Bayesians described above. In this way, the parameters are random variables because they are functions of random variables. The point of disagreement between the two groups is whether it is useful or even possible to assign a distribution to such parameters. Both Bayesians and frequentists agree on the usefulness of families of distributions for observations indexed by parameters. Bayesians refer to the distribution indexed by parameter value θ as the conditional distribution of the observations given that the parameter equals θ. Frequentists refer to the distribution indexed by θ as the distribution of the observations when θ is the true value of the parameter. The two groups agree that whenever a distribution can be assigned to a parameter, the theory and methods to be described in this chapter are applicable and useful. In Sections 7.2–7.4, we shall explicitly assume that each parameter is a random random variable and we shall assign it a distribution that represents the probabilities that the parameter lies in various subsets of the parameter space. Beginning in Sec. 7.5, we shall consider techniques of estimation that are not based on assigning distributions to parameters.
7.2 Prior and Posterior Distributions The distribution of a parameter before observing any data is called the prior distribution of the parameter. The conditional distribution of the parameter given the observed data is called the posterior distribution. If we plug the observed values of the data into the conditional p.f. or p.d.f. of the data given the parameter, the result is a function of the parameter alone, which is called the likelihood function
Summary The prior distribution of a parameter describes our uncertainty about the parameter before observing any data. The likelihood function is the conditional p.d.f. or p.f. of the data given the parameter when regarded as a function of the parameter with the observed data plugged in. The likelihood tells us how much the data will alter our uncertainty. Large values of the likelihood correspond to parameter values where the posterior p.d.f. or p.f. will be higher than the prior. Low values of the likelihood occur at parameter values where the posterior will be lower than the prior. The posterior distribution of the parameter is the conditional distribution of the parameter given the data. It is obtained using Bayes’ theorem for random variables, which we first saw on page 148. We can predict future observations that are conditionally independent of the observed data given θ by using the conditional version of the law of total probability that we saw on page 163.
7.3 Conjugate Prior Distributions For each of the most popular statistical models, there exists a family of distributions for the parameter with a very special property. If the prior distribution is chosen to be a member of that family, then the posterior distribution will also be a member of that family. Such a family of distributions is called a conjugate family. Choosing a prior distribution from a conjugate family will typically make it particularly simple to calculate the posterior distribution.
Summary For each of several different statistical models for data given the parameter, we found a conjugate family of distributions for the parameter. These families have the property that if the prior distribution is chosen from the family, then the posterior distribution is a member of the family. For data with distributions related to the Bernoulli, such as binomial, geometric, and negative binomial, the conjugate family for the success probability parameter is the family of beta distributions. For data with distributions related to the Poisson process, such as Poisson, gamma (with known first parameter), and exponential, the conjugate family for the rate parameter is the family of gamma distributions. For data having a normal distribution with known variance, the conjugate family for the mean is the normal family. We also described the use of improper priors. Improper priors are not true probability distributions, but if we pretend that they are, we will compute posterior distributions that approximate the posteriors that we would have obtained using proper conjugate priors with extreme values of the prior hyperparameters
7.4 Bayes Estimators An estimator of a parameter is some function of the data that we hope is close to the parameter. A Bayes estimator is an estimator that is chosen to minimize the posterior mean of some measure of how far the estimator is from the parameter, such as squared error or absolute error.
Summary An estimator of a parameter θ is a function δ of the data X. If X = x is observed, the value δ(x) is called our estimate, the observed value of the estimator δ(X). A loss 416 Chapter 7 Estimation function L(θ , a) is designed to measure how costly it is to use the value a to estimate θ. A Bayes estimator δ∗(X) is chosen so that a = δ∗(x) provides the minimum value of the posterior mean of L(θ , a). That is, E[L(θ , δ∗(x))|x] = mina E[L(θ , a)|x]. If the loss is squared error, L(θ , a) = (θ − a)2, then δ∗(x) is the posterior mean of θ, E(θ|x). If the loss is absolute error, L(θ , a) = |θ − a|, then δ∗(x) is a median of the posterior distribution of θ. For other loss functions, locating the minimum might have to be done numerically.
7.5 Maximum Likelihood Estimators Maximum likelihood estimation is a method for choosing estimators of parameters that avoids using prior distributions and loss functions. It chooses as the estimate of θ the value of θ that provides the largest value of the likelihood function.
7.6 Properties of Maximum Likelihood Estimators In this section, we explore several properties of M.L.E.’s, including: . The relationship between the M.L.E. of a parameter and the M.L.E. of a function of that parameter . The need for computational algorithms . The behavior of the M.L.E. as the sample size increases . The lack of dependence of the M.L.E. on the sampling plan We also introduce a popular alternative method of estimation (method of moments) that sometimes agrees with maximum likelihood, but can sometimes be computationally simpler.
Summary The M.L.E. of a function g(θ ) is g(θ )ˆ , where θˆ is the M.L.E. of θ. For example, if θ is the rate at which customers are served in a queue, then 1/θ is the average service time. The M.L.E. of 1/θ is 1 over the M.L.E. of θ. Sometimes we cannot find a closed form expression for the M.L.E. of a parameter and we must resort to numerical methods to find or approximate theM.L.E. In most problems, the sequence ofM.L.E.’s, as sample size increases, converges in probability to the parameter. When data are collected in such a way that the decision to stop collecting data is based solely on the data already observed or on other considerations that are not related to the parameter, then the M.L.E. will not depend on the sampling plan. That is, if two different sampling plans lead to proportional likelihood functions, then the value of θ that maximizes one likelihood will also maximize the other.
7.7 Sufficient Statistics In the first six sections of this chapter, we presented some inference methods that are based on the posterior distribution of the parameter or on the likelihood function alone. There are other inference methods that are based neither on the posterior distribution nor on the likelihood function. These methods are based on the conditional distributions of various functions of the data (i.e., statistics) given the parameter. There are many statistics available in a given problem, some more useful than others. Sufficient statistics turn out to be the most useful in some sense.
7.8 Jointly Sufficient Statistics When a parameter θ is multidimensional, sufficient statistics will typically need to be multidimensional as well. Sometimes, no one-dimensional statistic is sufficient even when θ is one-dimensional. In either case, we need to extend the concept of sufficient statistic to deal with cases in which more than one statistic is needed in order to be sufficient.
8.1 The Sampling Distribution of a Statistic A statistic is a function of some observable random variables, and hence is itself a random variable with a distribution. That distribution is its sampling distribution, and it tells us what values the statistic is likely to assume and how likely it is to assume those values prior to observing our data. When the distribution of the observable data is indexed by a parameter, the sampling distribution is specified as the distribution of the statistic for a given value of the parameter.
chapter 9 hypothesis testing
In general, hypothesis testing concerns trying to decide whether a parameter θ lies in one subset of the parameter space or in its complement. When θ is one-dimensional, at least one of the two subsets will typically be an interval, possibly degenerate. In this section, we introduce the notation and some common methodology associated with hypothesis testing. We also demonstrate an equivalence between hypothesis tests and confidence intervals.
9.5 The t Test We begin the treatment of several special cases of testing hypotheses about parameters of a normal distribution. In this section, we handle the case in which both the mean and the variance are unknown. We develop tests for hypotheses concerning the mean. These tests will be based on the t distribution.
9.6 Comparing the Means of Two Normal Distributions It is very common to compare two distributions to see which has the higher mean or just to see how different the two means are. When the two distributions are normal, the tests and confidence intervals based on the t distribution are very similar to the ones that arose when we considered a single distribution.
9.7 The F Distributions In this section, we introduce the family of F distributions. This family is useful in two different hypothesis-testing situations. The first situation is when we wish to test hypotheses about the variances of two different normal distributions. These tests, which we shall derive in this section, are based on a statistic that has an F distribution. The second situation will arise in Chapter 11 when we test hypotheses concerning the means of more than two normal distributions.
Comments
Post a Comment