textbook summaries ch 1 2 3


1.4 Summary We will use set theory for the mathematical model of events. Outcomes of an experiment are elements of some sample space S, and each event is a subset of S. Two events both occur if the outcome is in the intersection of the two sets. At least one of a collection of events occurs if the outcome is in the union of the sets. Two events cannot both occur if the sets are disjoint. An event fails to occur if the outcome is in the complement of the set. The empty set stands for every event that cannot possibly occur. The collection of events is assumed to contain the sample space, the complement of each event, and the union of each countable collection of events.

1.5 The Definition of Probability  We have presented the mathematical definition of probability through the three axioms. The axioms require that every event have nonnegative probability, that the whole sample space have probability 1, and that the union of an infinite sequence of disjoint events have probability equal to the sum of their probabilities. Some important results to remember include the following: 21 . If A1,...,Ak are disjoint, Pr ∪k i=1Ai = k i=1 Pr(Ai). . Pr(Ac) = 1 − Pr(A). . A ⊂ B implies that Pr(A) ≤ Pr(B). . Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). It does not matter how the probabilities were determined. As long as they satisfy the three axioms, they must also satisfy the above relations as well as all of the results that we prove later in the text.

1.7 Counting Methods Summary A simple sample space is a finite sample space S such that every outcome in S has the same probability. If there are n outcomes in a simple sample space S, then each one must have probability 1/n. The probability of an event E in a simple sample space is the number of outcomes in E divided by n. In the next three sections, we will present some useful methods for counting numbers of outcomes in various events.

Summary Suppose that the following conditions are met: . Each element of a set consists of k distinguishable parts x1,...,xk. . There are n1 possibilities for the first part x1. . For each i = 2,...,k and each combination (x1,...,xi−1) of the firsti − 1 parts, there are ni possibilities for the ith part xi. Under these conditions, there are n1 ... nk elements of the set. The third condition requires only that the number of possibilities for xi be ni no matter what the earlier parts are. For example, for i = 2, it does not require that the same n2 possibilities be available for x2 regardless of what x1 is. It only requires that the number of possibilities for x2 be n2 no matter what x1 is. In this way, the general rule includes the multiplication rule, the calculation of permutations, and sampling with replacement as special cases. For permutations of m items k at a time, we have ni = m − i + 1 for i = 1,...,k, and the ni possibilities for part i are just the ni items that have not yet appeared in the first i − 1 parts. For sampling with replacement from m items, we have ni = m for all i, and the m possibilities are the same for every part. In the next section, we shall consider how to count elements of sets in which the parts of each element are not distinguishable

1.8 combinatorial methods Summary We showed that the number of size k subsets of a set of size n is n k = n!/[k!(n − k)!]. This turns out to be the number of possible samples of size k drawn without replacement from a population of size n as well as the number of arrangements of n items of two types with k of one type and n − k of the other type. We also saw several examples in which more than one counting technique was required at different points in the same problem. Sometimes, more than one technique is required to count the elements of a single set.

1.9 multinomial coefficients Summary Multinomial coefficients generalize binomial coefficients. The coefficient n n1,..., nk is the number of ways to partition a set of n items into distinguishable subsets of sizes n1,...,nk where n1 + ... + nk = n. It is also the number of arrangements of n items of k different types for which ni are of type i fori = 1,...,k. Example 1.9.4 illustrates another important point to remember about computing probabilities: There might be more than one correct method for computing the same probability.

1.10 The Probability of a Union of Events Summary We generalized the formula for the probability of the union of two arbitrary events to the union of finitely many events. As an aside, there are cases in which it is easier to compute Pr(A1 ∪ ... ∪ An) as 1 − Pr(Ac 1 ∩ ... ∩ Ac n) using the fact that (A1 ∪ ... ∪ An)c = Ac 1 ∩ ... ∩ Ac n.

Chapter 2 conditional probabilities summary The revised probability of an event A after learning that event B (with Pr(B) > 0) has occurred is the conditional probability of A given B, denoted by Pr(A|B) and computed as Pr(A ∩ B)/ Pr(B). Often it is easy to assess a conditional probability, such as Pr(A|B), directly. In such a case, we can use the multiplication rule for conditional probabilities to compute Pr(A ∩ B) = Pr(B) Pr(A|B). All probability results have versions conditional on an event B with Pr(B) > 0: Just change all probabilities so that they are conditional on B in addition to anything else they were already conditional on. For example, the multiplication rule for conditional probabilities becomes Pr(A1 ∩ A2|B) = Pr(A1|B) Pr(A2|A1 ∩ B). A partition is a collection of disjoint events whose union is the whole sample space. To be most useful, a partition is chosen so that an important source of uncertainty is reduced if we learn which one of the partition events occurs. If the conditional probability of an event A is available given each event in a partition, the law of total probability tells how to combine these conditional probabilities to get Pr(A).

2.2 independent events Summary A collection of events is independent if and only if learning that some of them occur does not change the probabilities that any combination of the rest of them occurs. Equivalently, a collection of events is independent if and only if the probability of the intersection of every subcollection is the product of the individual probabilities. The concept of independence has a version conditional on another event. A collection of events is independent conditional on B if and only if the conditional probability of the intersection of every subcollection given B is the product of the individual conditional probabilities given B. Equivalently, a collection of events is conditionally independent given B if and only if learning that some of them (and B) occur does not change the conditional probabilities given B that any combination of the rest of them occur. The full power of conditional independence will become more apparent after we introduce Bayes’ theorem in the next section.


2.3 bayes theorem Summary Bayes’ theorem tells us how to compute the conditional probability of each event in a partition given an observed event A. A major use of partitions is to divide the sample space into small enough pieces so that a collection of events of interest become conditionally independent given each event in the partition.

2.4 gamblers ruin Summary We considered a gambler and an opponent who each start with finite amounts of money. The two then play a sequence of games against each other until one of them runs out of money. We were able to calculate the probability that each of them would be the first to run out as a function of the probability of winning the game and of how much money each has at the start.

Chapter 3 random variables and distributions

3.1 random variables and discrete distributions A random variable is a real-valued function defined on a sample space. The distribution of a random variable X is the collection of all probabilities Pr(X ∈ C) for all subsets C of the real numbers such that {X ∈ C} is an event. A random variable X is
discrete if there are at most countably many possible values for X. In this case, the
distribution of X can be characterized by the probability function (p.f.) of X, namely,
f (x) = Pr(X = x) for x in the set of possible values. Some distributions are so famous
that they have names. One collection of such named distributions is the collection of
uniform distributions on finite sets of integers. A more famous collection is the collection of binomial distributions whose parameters are n and p, where n is a positive
integer and 0 <p< 1, having p.f. (3.1.4). The binomial distribution with parameters
n = 1 and p is also called the Bernoulli distribution with parameter p. The names of
these distributions also characterize the distributions.

3.2 continuous distributions Summary A continuous distribution is characterized by its probability density function (p.d.f.). A nonnegative function f is the p.d.f. of the distribution of X if, for every interval [a, b], Pr(a ≤ X ≤ b) = b a f (x) dx. Continuous random variables satisfy Pr(X = x) = 0 for every value x. If the p.d.f. of a distribution is constant on an interval [a, b] and is 0 off the interval, we say that the distribution is uniform on the interval [a, b].

3.3 cumulative distribution function Summary
The c.d.f. F of a random variable X is F (x) = Pr(X ≤ x) for all real x. This function
is continuous from the right. If we let F (x−) equal the limit of F (y) as y approaches
x from below, then F (x) − F (x−) = Pr(X = x). A continuous distribution has a
continuous c.d.f. and F
(x) = f (x), the p.d.f. of the distribution, for all x at which
F is differentiable. A discrete distribution has a c.d.f. that is constant between the
possible values and jumps by f (x) at each possible value x. The quantile function
F −1(p) is equal to the smallest x such that F (x) ≥ p for 0 <p< 1.

3.4 bivariate distributions Summary The joint c.d.f. of two random variables X and Y is F (x, y) = Pr(X ≤ x and Y ≤ y). The joint p.d.f. of two continuous random variables is a nonnegative function f such that the probability of the pair(X, Y ) being in a setC is the integral of f (x, y) over the set C, if the integral exists. The joint p.d.f. is also the second mixed partial derivative of the joint c.d.f. with respect to both variables. The joint p.f. of two discrete random variables is a nonnegative function f such that the probability of the pair(X, Y ) being in a setC is the sum of f (x, y) over all points inC. A joint p.f. can be strictly positive at countably many pairs (x, y) at most. The joint p.f./p.d.f. of a discrete random variable X and a continuous random variable Y is a nonnegative function f such that the probability of the pair (X, Y ) being in a set C is obtained by summing f (x, y) over all x such that (x, y) ∈ C for each y and then integrating the resulting function of y.

3.5 marginal distributions Summary Let f (x, y) be a joint p.f., joint p.d.f., or joint p.f./p.d.f. of two random variables X and Y . The marginal p.f. or p.d.f. of X is denoted by f1(x), and the marginal p.f. or p.d.f. of Y is denoted by f2(y). To obtain f1(x), compute y f (x, y) if Y is discrete or ∞ −∞ f (x, y) dy if Y is continuous. Similarly, to obtain f2(y), compute x f (x, y) if X is discrete or ∞ −∞ f (x, y) dx if X is continuous. The random variables X and Y are independent if and only if f (x, y) = f1(x)f2(y) for all x and y. This is true regardless of whether X and/or Y is continuous or discrete. A sufficient condition for two continuous random variables to be independent is that R = {(x, y) : f (x, y) > 0} be rectangular with sides parallel to the coordinate axes and that f (x, y) factors into separate functions of x of y in R

3.6 conditional distributions Summary The conditional distribution of one random variable X given an observed value y of another random variable Y is the distribution we would use for X if we were to learn that Y = y. When dealing with the conditional distribution of X given Y = y, it is safe to behave as if Y were the constant y. If X and Y have joint p.f., p.d.f., or p.f./p.d.f. f (x, y), then the conditional p.f. or p.d.f. of X given Y = y is g1(x|y) = 3.6 Conditional Distributions 151 f (x, y)/f2(y), where f2 is the marginal p.f. or p.d.f. of Y . When it is convenient to specify a conditional distribution directly, the joint distribution can be constructed from the conditional together with the other marginal. For example, f (x, y) = g1(x|y)f2(y) = f1(x)g2(y|x). In this case, we have versions of the law of total probability and Bayes’ theorem for random variables that allow us to calculate the other marginal and conditional. Two random variables X and Y are independent if and only if the conditional p.f. or p.d.f. of X given Y = y is the same as the marginal p.f. or p.d.f. of X for all y such that f2(y) > 0. Equivalently, X and Y are independent if and only if the conditional p.f. of p.d.f. of Y given X = x is the same as the marginal p.f. or p.d.f. of Y for all x such that f1(x) > 0.

3.7 multivariate distributions Summary A finite collection of random variables is called a random vector. We have defined joint distributions for arbitrary random vectors. Every random vector has a joint c.d.f. Continuous random vectors have a joint p.d.f. Discrete random vectors have a joint p.f. Mixed distribution random vectors have a joint p.f./p.d.f. The coordinates of an n-dimensional random vector X are independent if the joint p.f., p.d.f., or p.f./p.d.f. f (x) factors into "n i=1 fi(xi). We can compute marginal distributions of subvectors of a random vector, and we can compute the conditional distribution of one subvector given the rest of the vector. We can construct a joint distribution for a random vector by piecing together a marginal distribution for part of the vector and a conditional distribution for the rest given the first part. There are versions of Bayes’ theorem and the law of total probability for random vectors. An n-dimensional random vector X has coordinates that are conditionally independent given Z if the conditional p.f., p.d.f., or p.f./p.d.f. g(x|z) of X given Z = z factors into "n i=1 gi(xi|z). There are versions of Bayes’ theorem, the law of total probability, and all future theorems about random variables and random vectors conditional on an arbitrary additional random vector.

3.8 random variables Summary We learned several methods for determining the distribution of a function of a random variable. For a random variable X with a continuous distribution having p.d.f. f , if r is strictly increasing or strictly decreasing with differentiable inverse s (i.e., s(r(x)) = x and s is differentiable), then the p.d.f. of Y = r(X) is g(y) = 174 Chapter 3 Random Variables and Distributions f (s(y))|ds(y)/dy|. A special transformation allows us to transform a random variable X with the uniform distribution on the interval [0, 1]into a random variable Y with an arbitrary continuous c.d.f. G by Y = G−1(X). This method can be used in conjunction with a uniform pseudo-random number generator to generate random variables with arbitrary continuous distributions.

3.9 functions of two or more random variables Summary We extended the construction of the distribution of a function of a random variable to the case of several functions of several random variables. If one only wants the distribution of one function r1 of n random variables, the usual way to find this is to first find n − 1 additional functionsr2,...,rn so that the n functions together compose a one-to-one transformation. Then find the joint p.d.f. of the n functions and finally find the marginal p.d.f. of the first function by integrating out the extra n − 1 variables. The method is illustrated for the cases of the sum and the range of several random variables.

3.10 markov chains Summary A Markov chain is a stochastic process, a sequence of random variables giving the states of the process, in which the conditional distribution of the state at the next time given all of the past states depends on the past states only through the most recent state. For Markov chains with finitely many states and stationary transition distributions, the transitions over time can be described by a matrix giving the probabilities of transition from the state indexing the row to the state indexing the column (the transition matrix P). The initial probability vector v gives the distribution of the state at time 1. The transition matrix and initial probability vector together allow calculation of all probabilities associated with the Markov chain. In particular, Pn gives the probabilities of transitions over n time periods, and vPn gives the distribution of the state at time n + 1. A stationary distribution is a probability vector v such that vP = v. Every finite Markov chain with stationary transition distributions has at least one stationary distribution. For many Markov chains, there is a unique stationary distribution and the distribution of the chain after n transitions converges to the stationary distribution as n goes to ∞.





Comments

Popular posts from this blog

ft

gillian tett 1