Econometrics course outline
INTRO
chapter 1 discusses the scope of econometrics and raises general issues that arise in the
application of econometric methods. Section 1.1 provides a brief discussion about
the purpose and scope of econometrics, and how it fits into economics analysis.
Section 1.2 provides examples of how one can start with an economic theory and build a
model that can be estimated using data. Section 1.3 examines the kinds of data sets that
are used in business, economics, and other social sciences. Section 1.4 provides an intuitive
discussion of the difficulties associated with the inference of causality in the social
sciences.
1.1 What Is Econometrics?
Imagine that you are hired by your state government to evaluate the effectiveness of a
publicly funded job training program. Suppose this program teaches workers various ways
to use computers in the manufacturing process. The twenty-week program offers courses
during nonworking hours. Any hourly manufacturing worker may participate, and enrollment
in all or part of the program is voluntary. You are to determine what, if any, effect
the training program has on each worker’s subsequent hourly wage.
Now, suppose you work for an investment bank. You are to study the returns on different
investment strategies involving short-term U.S. treasury bills to decide whether they
comply with implied economic theories.
The task of answering such questions may seem daunting at first. At this point, you
may only have a vague idea of the kind of data you would need to collect. By the end of
this introductory econometrics course, you should know how to use econometric methods
to formally evaluate a job training program or to test a simple economic theory.
Econometrics is based upon the development of statistical methods for estimating
economic relationships, testing economic theories, and evaluating and implementing government
and business policy. The most common application of econometrics is the forecasting
of such important macroeconomic variables as interest rates, inflation rates, and
gross domestic product. Whereas forecasts of economic indicators are highly visible and
often widely published, econometric methods can be used in economic areas that have
nothing to do with macroeconomic forecasting. For example, we will study the effects of political campaign expenditures on voting outcomes. We will consider the effect of school
spending on student performance in the field of education. In addition, we will learn how
to use econometric methods for forecasting economic time series.
Econometrics has evolved as a separate discipline from mathematical statistics
because the former focuses on the problems inherent in collecting and analyzing
nonexperimental
economic data. Nonexperimental data are not accumulated through
controlled experiments on individuals, firms, or segments of the economy. (Nonexperimental
data are sometimes called observational data, or retrospective data, to emphasize
the fact that the researcher is a passive collector of the data.) Experimental data
are often collected in laboratory environments in the natural sciences, but they are much
more difficult to obtain in the social sciences. Although some social experiments can be
devised, it is often impossible, prohibitively expensive, or morally repugnant to conduct
the kinds of controlled experiments that would be needed to address economic issues. We
give some specific examples of the differences between experimental and nonexperimental
data in Section 1.4.
Naturally, econometricians have borrowed from mathematical statisticians whenever
possible. The method of multiple regression analysis is the mainstay in both fields, but its
focus and interpretation can differ markedly. In addition, economists have devised new
techniques to deal with the complexities of economic data and to test the predictions of
economic theories.
1.2 Steps in Empirical Economic Analysis
Econometric methods are relevant in virtually every branch of applied economics. They
come into play either when we have an economic theory to test or when we have a relationship
in mind that has some importance for business decisions or policy analysis. An
empirical analysis uses data to test a theory or to estimate a relationship.
How does one go about structuring an empirical economic analysis? It may seem
obvious, but it is worth emphasizing that the first step in any empirical analysis is the
careful formulation of the question of interest. The question might deal with testing a
certain aspect of an economic theory, or it might pertain to testing the effects of a government
policy. In principle, econometric methods can be used to answer a wide range of
questions.
In some cases, especially those that involve the testing of economic theories, a formal
economic model is constructed. An economic model consists of mathematical equations
that describe various relationships. Economists are well known for their building of
models to describe a vast array of behaviors. For example, in intermediate microeconomics,
individual consumption decisions, subject to a budget constraint, are described by
mathematical models. The basic premise underlying these models is utility maximization.
The assumption that individuals make choices to maximize their well-being, subject to
resource constraints, gives us a very powerful framework for creating tractable economic
models and making clear predictions. In the context of consumption decisions, utility
maximization leads to a set of demand equations. In a demand equation, the quantity
demanded
of each commodity depends on the price of the goods, the price of substitute
and complementary goods, the consumer’s income, and the individual’s characteristics
that affect taste. These equations can form the basis of an econometric analysis of
consumer
demand.
Part 1 regression analysis with cross sectional data
art 1 of the text covers regression analysis with cross-sectional data. It builds
upon a solid base of college algebra and basic concepts in probability and
statistics.
Appendices A, B, and C contain complete reviews of these topics.
Chapter 2 begins with the simple linear regression model, where we explain one
variable in terms of another variable. Although simple regression is not widely used
in applied econometrics, it is used occasionally and serves as a natural starting point
because
the algebra and interpretations are relatively straightforward.
Chapters 3 and 4 cover the fundamentals of multiple regression analysis, where we
allow more than one variable to affect the variable we are trying to explain. Multiple
regression is still the most commonly used method in empirical research, and so these
chapters deserve careful attention. Chapter 3 focuses on the algebra of the method of
ordinary least squares (OLS), while also establishing conditions under which the OLS
estimator
is unbiased and best linear unbiased. Chapter 4 covers the important topic of
statistical inference.
Chapter 5 discusses the large sample, or asymptotic, properties of the OLS
estimators.
This provides justification of the inference procedures in Chapter 4 when
the errors in a regression model are not normally distributed. Chapter 6 covers some
additional topics in regression analysis, including advanced functional form issues, data
scaling, prediction, and goodness-of-fit. Chapter 7 explains how qualitative information
can be incorporated into multiple regression models.
Chapter 8 illustrates how to test for and correct the problem of heteroskedasticity,
or nonconstant variance, in the error terms. We show how the usual OLS statistics can
be adjusted, and we also present an extension of OLS, known as weighted least squares,
that explicitly accounts for different variances in the errors. Chapter 9 delves further
into the very important problem of correlation between the error term and one or more
of the explanatory variables. We demonstrate how the availability of a proxy variable
can solve the omitted variables problem. In addition, we establish the bias and inconsistency
in the OLS estimators in the presence of certain kinds of measurement errors in the
variables.
Various data problems are also discussed, including the problem of outliers.
Chapter 2 simple regression
The simple regression model can be used to study the relationship between two variables.
For reasons we will see, the simple regression model has limitations as a
general tool for empirical analysis. Nevertheless, it is sometimes appropriate as an
empirical tool. Learning how to interpret the simple regression model is good practice for
studying multiple regression, which we will do in subsequent chapters.
2.1 Definition of the Simple Regression Model
Much of applied econometric analysis begins with the following premise: y and x are two
variables, representing some population, and we are interested in “explaining y in terms
of x,” or in “studying how y varies with changes in x.” We discussed some examples in
Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is hourly wage
and x is years of education; and y is a community crime rate and x is number of police
officers.
In writing down a model that will “explain y in terms of x,” we must confront three
issues.
First, since there is never an exact relationship between two variables, how do we
allow for other factors to affect y? Second, what is the functional relationship between
y and x? And third, how can we be sure we are capturing a ceteris paribus relationship
between
y and x (if that is a desired goal)?
We can resolve these ambiguities by writing down an equation relating y to x. A simple
equation is
y 5 b0 1 b1x 1 u. [2.1]
Equation (2.1), which is assumed to hold in the population of interest, defines the simple
linear regression model. It is also called the two-variable linear regression model or
bivariate
linear regression model because it relates the two variables x and y. We now discuss
the meaning of each of the quantities in (2.1). [Incidentally, the term “regression” has
origins that are not especially important for most modern econometric applications, so we
will not explain it here. See Stigler (1986) for an engaging history of regression analysis.]
When related by (2.1), the variables y and x have several different names used interchangeably,
as follows: y is called the dependent variable, the explained variable, the response variable, the predicted variable, or the regressand; x is called the independent
variable, the explanatory variable, the control variable, the predictor variable,
or the regressor. (The term covariate is also used for x.) The terms “dependent variable”
and “independent variable” are frequently used in econometrics. But be aware that the
label “independent” here does not refer to the statistical notion of independence between
random variables (see Appendix B).
The terms “explained” and “explanatory” variables are probably the most descriptive.
“Response” and “control” are used mostly in the experimental sciences, where the
variable x is under the experimenter’s control. We will not use the terms “predicted variable”
and “predictor,” although you sometimes see these in applications that are purely
about prediction and not causality. Our terminology for simple regression is summarized
in Table 2.1.
The variable u, called the error term or disturbance in the relationship, represents
factors other than x that affect y. A simple regression analysis effectively treats all factors
affecting y other than x as being unobserved. You can usefully think of u as standing for
“unobserved.”
Equation (2.1) also addresses the issue of the functional relationship between y and x.
If the other factors in u are held fixed, so that the change in u is zero, Du 5 0, then x has a
linear effect on y:
Dy 5 b1Dx if Du 5 0. [2.2]
Thus, the change in y is simply b1 multiplied by the change in x. This means that b1 is the
slope parameter in the relationship between y and x, holding the other factors in u fixed;
it is of primary interest in applied economics. The intercept parameter b0, sometimes
called the constant term, also has its uses, although it is rarely central to an analysis.
chapter 3 regression analysis estimation
In Chapter 2, we learned how to use simple regression analysis to explain a dependent
variable, y, as a function of a single independent variable, x. The primary drawback in
using simple regression analysis for empirical work is that it is very difficult to draw
ceteris paribus conclusions about how x affects y: the key assumption, SLR.4 —that all
other factors affecting y are uncorrelated with x—is often unrealistic.
Multiple regression analysis is more amenable to ceteris paribus analysis because
it allows us to explicitly control for many other factors that simultaneously affect the
dependent
variable. This is important both for testing economic theories and for evaluating
policy effects when we must rely on nonexperimental data. Because multiple regression
models can accommodate many explanatory variables that may be correlated, we can
hope to infer causality in cases where simple regression analysis would be misleading.
Naturally, if we add more factors to our model that are useful for explaining y, then
more of the variation in y can be explained. Thus, multiple regression analysis can be used
to build better models for predicting the dependent variable.
An additional advantage of multiple regression analysis is that it can incorporate fairly
general functional form relationships. In the simple regression model, only one function
of a single explanatory variable can appear in the equation. As we will see, the multiple
regression model allows for much more flexibility.
Section 3.1 formally introduces the multiple regression model and further discusses
the advantages of multiple regression over simple regression. In Section 3.2, we demonstrate
how to estimate the parameters in the multiple regression model using the method of
ordinary least squares. In Sections 3.3, 3.4, and 3.5, we describe various statistical properties
of the OLS estimators, including unbiasedness and efficiency.
The multiple regression model is still the most widely used vehicle for empirical
analysis in economics and other social sciences. Likewise, the method of ordinary least
squares is popularly used for estimating the parameters of the multiple regression model.
Chapter 4 inference
This chapter continues our treatment of multiple regression analysis. We now turn
to the problem of testing hypotheses about the parameters in the population regression
model. We begin by finding the distributions of the OLS estimators under the
added assumption that the population error is normally distributed. Sections 4.2 and 4.3
cover hypothesis testing about individual parameters, while Section
4.4 discusses how to
test a single hypothesis involving more than one parameter. We focus on testing multiple
restrictions in Section 4.5 and pay particular attention to determining whether a group of
independent variables can be omitted from a model.
4.1 Sampling Distributions of the OLS Estimators
Up to this point, we have formed a set of assumptions under which OLS is unbiased;
we have also derived and discussed the bias caused by omitted variables. In Section 3.4,
we obtained the variances of the OLS estimators under the Gauss-Markov assumptions.
In Section 3.5, we showed that this variance is smallest among linear unbiased estimators.
Knowing the expected value and variance of the OLS estimators is useful for describing
the precision of the OLS estimators. However, in order to perform statistical inference,
we need to know more than just the first two moments of b ˆj; we need to know the full sampling
distribution of the b ˆj. Even under the Gauss-Markov assumptions, the distribution of b ˆj
can have virtually any shape.
When we condition on the values of the independent variables in our sample, it is
clear that the sampling distributions of the OLS estimators depend on the underlying distribution
of the errors. To make the sampling distributions of the b ˆj tractable, we now assume
that the unobserved error is normally distributed in the population. We call this the
normality assumption.
Chapter 5 Multiple regression analysis OLS asymptotics
in Chapters 3 and 4, we covered what are called finite sample, small sample, or exact
properties of the OLS estimators in the population model
y 5 b0 1 b1x1 1 b2 x2 1 ... 1 bk xk 1 u. [5.1]
For example, the unbiasedness of OLS (derived in Chapter 3) under
the first four Gauss-
Markov assumptions is a finite sample property because
it holds for any sample size n (
subject
to the mild restriction that n must be at least as large as the total number of parameters
in the
regression model, k 1 1). Similarly, the fact that OLS is the best linear unbiased estimator
under the full set of Gauss-Markov assumptions (MLR.1 through MLR.5) is a finite sample
property.
In Chapter 4, we added the classical linear model Assumption MLR.6, which states
that the error term u is normally distributed and independent of the explanatory variables.
This allowed us to derive the exact sampling
distributions of the OLS estimators (conditional
on the explanatory variables
in the sample). In particular, Theorem 4.1 showed that
the OLS estimators have normal sampling distributions, which led directly to the t and F
distributions for t and F statistics. If the error is not normally distributed,
the distribution
of a t statistic is not exactly t, and an F statistic does not have an exact F distribution for
any sample size.
In addition to finite sample properties, it is important to know the asymptotic
properties
or large sample properties of estimators and test statistics. These properties are
not defined
for a particular sample size; rather, they are defined as the sample size grows
without bound. Fortunately, under the assumptions we have made, OLS has satisfactory
large sample properties.
One practically important finding is that even without the normality
assumption
(Assumption MLR.6), t and F statistics have approximately t and F distributions,
at least in large sample sizes. We discuss this in more detail in Section 5.2, after we cover the
consistency
of OLS in Section 5.1.
Because the material in this chapter is more difficult to understand, and because one
can conduct empirical work without a deep understanding of its contents, this chapter may
be skipped. However, we will necessarily refer
to large sample properties of OLS when
we relax the homoskedasticity
assumption
in Chapter 8 and when we delve into estimation
using time series
data in Part 2. Furthermore, virtually all advanced econometric methods
derive
their justification using large-sample analysis, so readers who will continue into Part 3
should be familiar with the contents of this chapter.
Chapter 6 miscellaneous errors
This chapter brings together several issues in multiple regression analysis that we
could not conveniently cover in earlier chapters. These topics are not as fundamental
as the material in Chapters 3 and 4, but they are important for applying multiple
regression
to a broad range of empirical problems.
6.1 Effects of Data Scaling on OLS Statistics
In Chapter 2 on bivariate regression, we briefly discussed the effects of changing the units of
measurement on the OLS intercept and slope estimates. We also showed that changing the
units of measurement did not affect R-squared. We now return to the issue of data scaling
and examine the effects of rescaling the dependent or independent variables on standard
errors,
t statistics, F statistics, and confidence intervals.
We will discover that everything we expect to happen, does happen. When variables
are rescaled, the coefficients, standard errors, confidence intervals, t statistics, and F
statistics
change in ways that preserve all measured effects and testing outcomes. Although
this is no great surprise—in fact, we would be very worried if it were not the case—it is
useful to see what occurs explicitly. Often, data scaling is used for cosmetic purposes,
such as to reduce the number of zeros after a decimal point in an estimated coefficient. By
judiciously
choosing units of measurement, we can improve the appearance of an estimated
equation while changing nothing that is essential.
Chapter 7 Regression with qualitative information and dummy variables
In previous chapters, the dependent and independent variables in our multiple regression
models have had quantitative meaning. Just a few examples include hourly wage
rate, years of education, college grade point average, amount of air pollution, level
of firm sales, and number of arrests. In each case, the magnitude of the variable conveys
useful information. In empirical work, we must also incorporate qualitative factors into
regression models. The gender or race of an individual, the industry of a firm (manufacturing,
retail, and so on), and the region in the United States where a city is located (South,
North, West, and so on) are all considered to be qualitative factors.
Most of this chapter is dedicated to qualitative independent variables. After we
discuss
the appropriate ways to describe qualitative information in Section 7.1, we show
how qualitative explanatory variables can be easily incorporated into multiple regression
models in Sections 7.2, 7.3, and 7.4. These sections cover almost all of the popular ways
that qualitative independent variables are used in cross-sectional regression analysis.
In Section 7.5, we discuss a binary dependent variable, which is a particular kind of
qualitative
dependent variable. The multiple regression model has an interesting interpretation
in this case and is called the linear probability model. While much maligned by
some econometricians, the simplicity of the linear probability model makes it useful in
many empirical contexts. We will describe its drawbacks in Section 7.5, but they are often
secondary in empirical work.
Chapter 8 Heteroskedasticity
The homoskedasticity assumption, introduced in Chapter 3 for multiple regression,
states that the variance of the unobserved error, u, conditional on the explanatory
variables, is constant. Homoskedasticity fails whenever the variance of the unobserved
factors changes across different segments of the population, where the segments
are determined by the different values of the explanatory variables. For example, in a
savings equation, heteroskedasticity is present if the variance of the unobserved factors
affecting savings increases with income.
In Chapters 4 and 5, we saw that homoskedasticity is needed to justify the usual t tests,
F tests, and confidence intervals for OLS estimation of the linear regression model, even
with large sample sizes. In this chapter, we discuss the available remedies when heteroskedasticity
occurs, and we also show how to test for its presence. We begin by briefly
reviewing the consequences of heteroskedasticity for ordinary least squares estimation.
Chapter 9 Specification and data issues
In Chapter 8, we dealt with one failure of the Gauss-Markov assumptions. While
heteroskedasticity
in the errors can be viewed as a problem with a model, it is a
relatively
minor one. The presence of heteroskedasticity does not cause bias or inconsistency
in the OLS estimators. Also, it is fairly easy to adjust confidence intervals and
t and F statistics to obtain valid inference after OLS estimation, or even to get more
efficient
estimators by using weighted least squares.
In this chapter, we return to the much more serious problem of correlation between
the error, u, and one or more of the explanatory variables. Remember from Chapter 3 that
if u is, for whatever reason, correlated with the explanatory variable xj, then we say that xj
is an endogenous explanatory variable. We also provide a more detailed discussion on
three reasons why an explanatory variable can be endogenous; in some cases, we discuss
possible remedies.
We have already seen in Chapters 3 and 5 that omitting a key variable can cause
correlation
between the error and some of the explanatory variables, which generally leads
to bias and inconsistency in all of the OLS estimators. In the special case that the omitted
variable is a function of an explanatory variable in the model, the model suffers from
functional
form misspecification.
We begin in the first section by discussing the consequences of functional form misspecification
and how to test for it. In Section 9.2, we show how the use of proxy variables
can solve, or at least mitigate, omitted variables bias. In Section 9.3, we derive and explain
the bias in OLS that can arise under certain forms of measurement error. Additional data
problems are discussed in Section 9.4.
All of the procedures in this chapter are based on OLS estimation. As we will see,
certain problems that cause correlation between the error and some explanatory variables
cannot be solved by using OLS on a single cross section. We postpone a treatment of
alternative
estimation methods until Part 3.
chapter 1 discusses the scope of econometrics and raises general issues that arise in the
application of econometric methods. Section 1.1 provides a brief discussion about
the purpose and scope of econometrics, and how it fits into economics analysis.
Section 1.2 provides examples of how one can start with an economic theory and build a
model that can be estimated using data. Section 1.3 examines the kinds of data sets that
are used in business, economics, and other social sciences. Section 1.4 provides an intuitive
discussion of the difficulties associated with the inference of causality in the social
sciences.
1.1 What Is Econometrics?
Imagine that you are hired by your state government to evaluate the effectiveness of a
publicly funded job training program. Suppose this program teaches workers various ways
to use computers in the manufacturing process. The twenty-week program offers courses
during nonworking hours. Any hourly manufacturing worker may participate, and enrollment
in all or part of the program is voluntary. You are to determine what, if any, effect
the training program has on each worker’s subsequent hourly wage.
Now, suppose you work for an investment bank. You are to study the returns on different
investment strategies involving short-term U.S. treasury bills to decide whether they
comply with implied economic theories.
The task of answering such questions may seem daunting at first. At this point, you
may only have a vague idea of the kind of data you would need to collect. By the end of
this introductory econometrics course, you should know how to use econometric methods
to formally evaluate a job training program or to test a simple economic theory.
Econometrics is based upon the development of statistical methods for estimating
economic relationships, testing economic theories, and evaluating and implementing government
and business policy. The most common application of econometrics is the forecasting
of such important macroeconomic variables as interest rates, inflation rates, and
gross domestic product. Whereas forecasts of economic indicators are highly visible and
often widely published, econometric methods can be used in economic areas that have
nothing to do with macroeconomic forecasting. For example, we will study the effects of political campaign expenditures on voting outcomes. We will consider the effect of school
spending on student performance in the field of education. In addition, we will learn how
to use econometric methods for forecasting economic time series.
Econometrics has evolved as a separate discipline from mathematical statistics
because the former focuses on the problems inherent in collecting and analyzing
nonexperimental
economic data. Nonexperimental data are not accumulated through
controlled experiments on individuals, firms, or segments of the economy. (Nonexperimental
data are sometimes called observational data, or retrospective data, to emphasize
the fact that the researcher is a passive collector of the data.) Experimental data
are often collected in laboratory environments in the natural sciences, but they are much
more difficult to obtain in the social sciences. Although some social experiments can be
devised, it is often impossible, prohibitively expensive, or morally repugnant to conduct
the kinds of controlled experiments that would be needed to address economic issues. We
give some specific examples of the differences between experimental and nonexperimental
data in Section 1.4.
Naturally, econometricians have borrowed from mathematical statisticians whenever
possible. The method of multiple regression analysis is the mainstay in both fields, but its
focus and interpretation can differ markedly. In addition, economists have devised new
techniques to deal with the complexities of economic data and to test the predictions of
economic theories.
1.2 Steps in Empirical Economic Analysis
Econometric methods are relevant in virtually every branch of applied economics. They
come into play either when we have an economic theory to test or when we have a relationship
in mind that has some importance for business decisions or policy analysis. An
empirical analysis uses data to test a theory or to estimate a relationship.
How does one go about structuring an empirical economic analysis? It may seem
obvious, but it is worth emphasizing that the first step in any empirical analysis is the
careful formulation of the question of interest. The question might deal with testing a
certain aspect of an economic theory, or it might pertain to testing the effects of a government
policy. In principle, econometric methods can be used to answer a wide range of
questions.
In some cases, especially those that involve the testing of economic theories, a formal
economic model is constructed. An economic model consists of mathematical equations
that describe various relationships. Economists are well known for their building of
models to describe a vast array of behaviors. For example, in intermediate microeconomics,
individual consumption decisions, subject to a budget constraint, are described by
mathematical models. The basic premise underlying these models is utility maximization.
The assumption that individuals make choices to maximize their well-being, subject to
resource constraints, gives us a very powerful framework for creating tractable economic
models and making clear predictions. In the context of consumption decisions, utility
maximization leads to a set of demand equations. In a demand equation, the quantity
demanded
of each commodity depends on the price of the goods, the price of substitute
and complementary goods, the consumer’s income, and the individual’s characteristics
that affect taste. These equations can form the basis of an econometric analysis of
consumer
demand.
Part 1 regression analysis with cross sectional data
art 1 of the text covers regression analysis with cross-sectional data. It builds
upon a solid base of college algebra and basic concepts in probability and
statistics.
Appendices A, B, and C contain complete reviews of these topics.
Chapter 2 begins with the simple linear regression model, where we explain one
variable in terms of another variable. Although simple regression is not widely used
in applied econometrics, it is used occasionally and serves as a natural starting point
because
the algebra and interpretations are relatively straightforward.
Chapters 3 and 4 cover the fundamentals of multiple regression analysis, where we
allow more than one variable to affect the variable we are trying to explain. Multiple
regression is still the most commonly used method in empirical research, and so these
chapters deserve careful attention. Chapter 3 focuses on the algebra of the method of
ordinary least squares (OLS), while also establishing conditions under which the OLS
estimator
is unbiased and best linear unbiased. Chapter 4 covers the important topic of
statistical inference.
Chapter 5 discusses the large sample, or asymptotic, properties of the OLS
estimators.
This provides justification of the inference procedures in Chapter 4 when
the errors in a regression model are not normally distributed. Chapter 6 covers some
additional topics in regression analysis, including advanced functional form issues, data
scaling, prediction, and goodness-of-fit. Chapter 7 explains how qualitative information
can be incorporated into multiple regression models.
Chapter 8 illustrates how to test for and correct the problem of heteroskedasticity,
or nonconstant variance, in the error terms. We show how the usual OLS statistics can
be adjusted, and we also present an extension of OLS, known as weighted least squares,
that explicitly accounts for different variances in the errors. Chapter 9 delves further
into the very important problem of correlation between the error term and one or more
of the explanatory variables. We demonstrate how the availability of a proxy variable
can solve the omitted variables problem. In addition, we establish the bias and inconsistency
in the OLS estimators in the presence of certain kinds of measurement errors in the
variables.
Various data problems are also discussed, including the problem of outliers.
Chapter 2 simple regression
The simple regression model can be used to study the relationship between two variables.
For reasons we will see, the simple regression model has limitations as a
general tool for empirical analysis. Nevertheless, it is sometimes appropriate as an
empirical tool. Learning how to interpret the simple regression model is good practice for
studying multiple regression, which we will do in subsequent chapters.
2.1 Definition of the Simple Regression Model
Much of applied econometric analysis begins with the following premise: y and x are two
variables, representing some population, and we are interested in “explaining y in terms
of x,” or in “studying how y varies with changes in x.” We discussed some examples in
Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is hourly wage
and x is years of education; and y is a community crime rate and x is number of police
officers.
In writing down a model that will “explain y in terms of x,” we must confront three
issues.
First, since there is never an exact relationship between two variables, how do we
allow for other factors to affect y? Second, what is the functional relationship between
y and x? And third, how can we be sure we are capturing a ceteris paribus relationship
between
y and x (if that is a desired goal)?
We can resolve these ambiguities by writing down an equation relating y to x. A simple
equation is
y 5 b0 1 b1x 1 u. [2.1]
Equation (2.1), which is assumed to hold in the population of interest, defines the simple
linear regression model. It is also called the two-variable linear regression model or
bivariate
linear regression model because it relates the two variables x and y. We now discuss
the meaning of each of the quantities in (2.1). [Incidentally, the term “regression” has
origins that are not especially important for most modern econometric applications, so we
will not explain it here. See Stigler (1986) for an engaging history of regression analysis.]
When related by (2.1), the variables y and x have several different names used interchangeably,
as follows: y is called the dependent variable, the explained variable, the response variable, the predicted variable, or the regressand; x is called the independent
variable, the explanatory variable, the control variable, the predictor variable,
or the regressor. (The term covariate is also used for x.) The terms “dependent variable”
and “independent variable” are frequently used in econometrics. But be aware that the
label “independent” here does not refer to the statistical notion of independence between
random variables (see Appendix B).
The terms “explained” and “explanatory” variables are probably the most descriptive.
“Response” and “control” are used mostly in the experimental sciences, where the
variable x is under the experimenter’s control. We will not use the terms “predicted variable”
and “predictor,” although you sometimes see these in applications that are purely
about prediction and not causality. Our terminology for simple regression is summarized
in Table 2.1.
The variable u, called the error term or disturbance in the relationship, represents
factors other than x that affect y. A simple regression analysis effectively treats all factors
affecting y other than x as being unobserved. You can usefully think of u as standing for
“unobserved.”
Equation (2.1) also addresses the issue of the functional relationship between y and x.
If the other factors in u are held fixed, so that the change in u is zero, Du 5 0, then x has a
linear effect on y:
Dy 5 b1Dx if Du 5 0. [2.2]
Thus, the change in y is simply b1 multiplied by the change in x. This means that b1 is the
slope parameter in the relationship between y and x, holding the other factors in u fixed;
it is of primary interest in applied economics. The intercept parameter b0, sometimes
called the constant term, also has its uses, although it is rarely central to an analysis.
chapter 3 regression analysis estimation
In Chapter 2, we learned how to use simple regression analysis to explain a dependent
variable, y, as a function of a single independent variable, x. The primary drawback in
using simple regression analysis for empirical work is that it is very difficult to draw
ceteris paribus conclusions about how x affects y: the key assumption, SLR.4 —that all
other factors affecting y are uncorrelated with x—is often unrealistic.
Multiple regression analysis is more amenable to ceteris paribus analysis because
it allows us to explicitly control for many other factors that simultaneously affect the
dependent
variable. This is important both for testing economic theories and for evaluating
policy effects when we must rely on nonexperimental data. Because multiple regression
models can accommodate many explanatory variables that may be correlated, we can
hope to infer causality in cases where simple regression analysis would be misleading.
Naturally, if we add more factors to our model that are useful for explaining y, then
more of the variation in y can be explained. Thus, multiple regression analysis can be used
to build better models for predicting the dependent variable.
An additional advantage of multiple regression analysis is that it can incorporate fairly
general functional form relationships. In the simple regression model, only one function
of a single explanatory variable can appear in the equation. As we will see, the multiple
regression model allows for much more flexibility.
Section 3.1 formally introduces the multiple regression model and further discusses
the advantages of multiple regression over simple regression. In Section 3.2, we demonstrate
how to estimate the parameters in the multiple regression model using the method of
ordinary least squares. In Sections 3.3, 3.4, and 3.5, we describe various statistical properties
of the OLS estimators, including unbiasedness and efficiency.
The multiple regression model is still the most widely used vehicle for empirical
analysis in economics and other social sciences. Likewise, the method of ordinary least
squares is popularly used for estimating the parameters of the multiple regression model.
Chapter 4 inference
This chapter continues our treatment of multiple regression analysis. We now turn
to the problem of testing hypotheses about the parameters in the population regression
model. We begin by finding the distributions of the OLS estimators under the
added assumption that the population error is normally distributed. Sections 4.2 and 4.3
cover hypothesis testing about individual parameters, while Section
4.4 discusses how to
test a single hypothesis involving more than one parameter. We focus on testing multiple
restrictions in Section 4.5 and pay particular attention to determining whether a group of
independent variables can be omitted from a model.
4.1 Sampling Distributions of the OLS Estimators
Up to this point, we have formed a set of assumptions under which OLS is unbiased;
we have also derived and discussed the bias caused by omitted variables. In Section 3.4,
we obtained the variances of the OLS estimators under the Gauss-Markov assumptions.
In Section 3.5, we showed that this variance is smallest among linear unbiased estimators.
Knowing the expected value and variance of the OLS estimators is useful for describing
the precision of the OLS estimators. However, in order to perform statistical inference,
we need to know more than just the first two moments of b ˆj; we need to know the full sampling
distribution of the b ˆj. Even under the Gauss-Markov assumptions, the distribution of b ˆj
can have virtually any shape.
When we condition on the values of the independent variables in our sample, it is
clear that the sampling distributions of the OLS estimators depend on the underlying distribution
of the errors. To make the sampling distributions of the b ˆj tractable, we now assume
that the unobserved error is normally distributed in the population. We call this the
normality assumption.
Chapter 5 Multiple regression analysis OLS asymptotics
in Chapters 3 and 4, we covered what are called finite sample, small sample, or exact
properties of the OLS estimators in the population model
y 5 b0 1 b1x1 1 b2 x2 1 ... 1 bk xk 1 u. [5.1]
For example, the unbiasedness of OLS (derived in Chapter 3) under
the first four Gauss-
Markov assumptions is a finite sample property because
it holds for any sample size n (
subject
to the mild restriction that n must be at least as large as the total number of parameters
in the
regression model, k 1 1). Similarly, the fact that OLS is the best linear unbiased estimator
under the full set of Gauss-Markov assumptions (MLR.1 through MLR.5) is a finite sample
property.
In Chapter 4, we added the classical linear model Assumption MLR.6, which states
that the error term u is normally distributed and independent of the explanatory variables.
This allowed us to derive the exact sampling
distributions of the OLS estimators (conditional
on the explanatory variables
in the sample). In particular, Theorem 4.1 showed that
the OLS estimators have normal sampling distributions, which led directly to the t and F
distributions for t and F statistics. If the error is not normally distributed,
the distribution
of a t statistic is not exactly t, and an F statistic does not have an exact F distribution for
any sample size.
In addition to finite sample properties, it is important to know the asymptotic
properties
or large sample properties of estimators and test statistics. These properties are
not defined
for a particular sample size; rather, they are defined as the sample size grows
without bound. Fortunately, under the assumptions we have made, OLS has satisfactory
large sample properties.
One practically important finding is that even without the normality
assumption
(Assumption MLR.6), t and F statistics have approximately t and F distributions,
at least in large sample sizes. We discuss this in more detail in Section 5.2, after we cover the
consistency
of OLS in Section 5.1.
Because the material in this chapter is more difficult to understand, and because one
can conduct empirical work without a deep understanding of its contents, this chapter may
be skipped. However, we will necessarily refer
to large sample properties of OLS when
we relax the homoskedasticity
assumption
in Chapter 8 and when we delve into estimation
using time series
data in Part 2. Furthermore, virtually all advanced econometric methods
derive
their justification using large-sample analysis, so readers who will continue into Part 3
should be familiar with the contents of this chapter.
Chapter 6 miscellaneous errors
This chapter brings together several issues in multiple regression analysis that we
could not conveniently cover in earlier chapters. These topics are not as fundamental
as the material in Chapters 3 and 4, but they are important for applying multiple
regression
to a broad range of empirical problems.
6.1 Effects of Data Scaling on OLS Statistics
In Chapter 2 on bivariate regression, we briefly discussed the effects of changing the units of
measurement on the OLS intercept and slope estimates. We also showed that changing the
units of measurement did not affect R-squared. We now return to the issue of data scaling
and examine the effects of rescaling the dependent or independent variables on standard
errors,
t statistics, F statistics, and confidence intervals.
We will discover that everything we expect to happen, does happen. When variables
are rescaled, the coefficients, standard errors, confidence intervals, t statistics, and F
statistics
change in ways that preserve all measured effects and testing outcomes. Although
this is no great surprise—in fact, we would be very worried if it were not the case—it is
useful to see what occurs explicitly. Often, data scaling is used for cosmetic purposes,
such as to reduce the number of zeros after a decimal point in an estimated coefficient. By
judiciously
choosing units of measurement, we can improve the appearance of an estimated
equation while changing nothing that is essential.
Chapter 7 Regression with qualitative information and dummy variables
In previous chapters, the dependent and independent variables in our multiple regression
models have had quantitative meaning. Just a few examples include hourly wage
rate, years of education, college grade point average, amount of air pollution, level
of firm sales, and number of arrests. In each case, the magnitude of the variable conveys
useful information. In empirical work, we must also incorporate qualitative factors into
regression models. The gender or race of an individual, the industry of a firm (manufacturing,
retail, and so on), and the region in the United States where a city is located (South,
North, West, and so on) are all considered to be qualitative factors.
Most of this chapter is dedicated to qualitative independent variables. After we
discuss
the appropriate ways to describe qualitative information in Section 7.1, we show
how qualitative explanatory variables can be easily incorporated into multiple regression
models in Sections 7.2, 7.3, and 7.4. These sections cover almost all of the popular ways
that qualitative independent variables are used in cross-sectional regression analysis.
In Section 7.5, we discuss a binary dependent variable, which is a particular kind of
qualitative
dependent variable. The multiple regression model has an interesting interpretation
in this case and is called the linear probability model. While much maligned by
some econometricians, the simplicity of the linear probability model makes it useful in
many empirical contexts. We will describe its drawbacks in Section 7.5, but they are often
secondary in empirical work.
Chapter 8 Heteroskedasticity
The homoskedasticity assumption, introduced in Chapter 3 for multiple regression,
states that the variance of the unobserved error, u, conditional on the explanatory
variables, is constant. Homoskedasticity fails whenever the variance of the unobserved
factors changes across different segments of the population, where the segments
are determined by the different values of the explanatory variables. For example, in a
savings equation, heteroskedasticity is present if the variance of the unobserved factors
affecting savings increases with income.
In Chapters 4 and 5, we saw that homoskedasticity is needed to justify the usual t tests,
F tests, and confidence intervals for OLS estimation of the linear regression model, even
with large sample sizes. In this chapter, we discuss the available remedies when heteroskedasticity
occurs, and we also show how to test for its presence. We begin by briefly
reviewing the consequences of heteroskedasticity for ordinary least squares estimation.
Chapter 9 Specification and data issues
In Chapter 8, we dealt with one failure of the Gauss-Markov assumptions. While
heteroskedasticity
in the errors can be viewed as a problem with a model, it is a
relatively
minor one. The presence of heteroskedasticity does not cause bias or inconsistency
in the OLS estimators. Also, it is fairly easy to adjust confidence intervals and
t and F statistics to obtain valid inference after OLS estimation, or even to get more
efficient
estimators by using weighted least squares.
In this chapter, we return to the much more serious problem of correlation between
the error, u, and one or more of the explanatory variables. Remember from Chapter 3 that
if u is, for whatever reason, correlated with the explanatory variable xj, then we say that xj
is an endogenous explanatory variable. We also provide a more detailed discussion on
three reasons why an explanatory variable can be endogenous; in some cases, we discuss
possible remedies.
We have already seen in Chapters 3 and 5 that omitting a key variable can cause
correlation
between the error and some of the explanatory variables, which generally leads
to bias and inconsistency in all of the OLS estimators. In the special case that the omitted
variable is a function of an explanatory variable in the model, the model suffers from
functional
form misspecification.
We begin in the first section by discussing the consequences of functional form misspecification
and how to test for it. In Section 9.2, we show how the use of proxy variables
can solve, or at least mitigate, omitted variables bias. In Section 9.3, we derive and explain
the bias in OLS that can arise under certain forms of measurement error. Additional data
problems are discussed in Section 9.4.
All of the procedures in this chapter are based on OLS estimation. As we will see,
certain problems that cause correlation between the error and some explanatory variables
cannot be solved by using OLS on a single cross section. We postpone a treatment of
alternative
estimation methods until Part 3.
Comments
Post a Comment