Assumption: Set theory and proofs familiarity
Source(s):
Most of this material is derived from "Mathematical Statistics" by Wackerly.
Some of this material is also derived from "Probability and Statistics for Engineering and the Sciences", by Jay Devore, but to a much lesser degree.
The Wackerly book has more formulas (instead of tables), introductions to the mn rule, and other important concepts of combinatorics and statistics.
- Probability Definition: Events, Sample Points and Sequencing Events Techniques
- How to calculate probability: Combinations, Permutations, Bayes Theorem
- Expected Value, Variance, Standard Deviation, Quartiles
- Discrete Random Variables
- Discrete Probability Distributions: Binomial
- Discrete Probability Distributions: Geometric
- Discrete Probability Distributions: Hypergeometric
- Discrete Probability Distributions: Negative Binomial
- Discrete Probability Distributions: Poisson
- Continuous Random Variables
- Probability Distributions "Distribution Functions" for all types of variables
- What is Density? A Mathematician's Perspective (and prep for Density Functions)
- Probability Density Functions: PDF
- Expected Value for a Continuous Random Variable
- Cumulative Distribution Functions (CDFs)
- Uniform Probability Distribution
- Normal Probability Distribution
- Standard Normal Distribution
- Gamma and Exponential Distributions
- Multivariate (Bivariate, Joint) Probability Distributions
- Marginal and Conditional Probability Distributions
- Independent Random Variables
- Expected Value of a Function of Random Variables
- Covariance of Two Random Variables
- Central-Limit-Theorem
Probability is the likelihood that an event will occur.
Events The probability of an event
E
is the cardinality of the event|E|
divided by the cardinality of the sample space|S|
(the "universe",S
,) that the event is in.
• For any event, the probability is nonnegative.
• Probability of entire sample space is
• The likelihood of at least 1 event occurring is the sum of all events.
Law of Total Probability:
Law of Conditional Probability:
Independent Events:
One really interesting quality about independent events is that reliant events are dependent;
"negative number" versus "positive number" are dependent events.
Mutually Exclusive is not Independent
Take the "negative number" versus "positive number" setup. "if A, then not B".
Here, events are dependent, and mutually exclusive.
Multiplicative:
If A and B independent,
Additive:
If A and B are mutually exclusive,
•
• The complement of "at most one" is "at least two."
• The complement of "at least one type" is "only one type."
The Wackerly probability book is great, and describes the sample-point method for calculating probability.
One example is to toss a pair of dice. The sample space, via the mn rule
, is
There will be a list of events such as
See the Wackerly book for more details on this technique, as well as sequenced events.
Another technique, after sample point technique, is sequenced events.
Ordering n items:
$n!$ ways.
Combinations: Order Doesn't Matter
Examples: Out of the set S = {A, B, C}
, a combination set would include AAA
, AAB
, ABC
, .... etc, and ABA = BAA
because order doesn't matter. When order doesn't matter, you don't need to count as many things, e.g. if AAB
is equivalent to ABA
, then those items count as one element of the set, not two.
Permutations: Order Matters
Note that the denominator is smaller than in combinations. Permuations possibilities are much larger because order matters, so we have to count it all.
Examples: Out of the set S= {A, B, C}
, a combination set would include AAA
, AAB
, ABC
, .... etc, and ABA != BAA.
Bayes Theorem:
Usually used for inversion techniques. "Find probability of a cause, given effect."
Let
Then,
Cardinality
Cardinality is the number of elements in a Set.
Expected Value,
$\mu$ or$E[Y]$ : The average
Expected value or mean is a calculation whose computation will differ depending on the probability distribution technique.
Variance,
$\sigma^2$ : Dispersion From the Mean
Variance is a measure of how far a set of numbers "spreads out" from the mean or average value.
Standard Deviation,
$\sigma$ : Amount of variance from the mean
A low standard deviation means values are close to the mean, and high standard deviation, more distributed values.
Quartiles:
A measure in statistics; we've heard "upper quartile", etc. There are three actual quartiles, first is 25th percentile, then 50th (median) and 75th; the four quartiles are just data that fits around those quartiles.
Expected Value or Mean of a Discrete Random Variable
Variance of a Discrete Random Variable,
$\sigma^2$
Hacking variance:
$Var[Y] = E[Y^2] = [E(Y)]^2$
A trick that's nice to know.
Standard Deviation of a Discrete Random Variable,
$\sigma$
Scalar, discrete values of probability. Stepwise functions. Best described via pmf.
pmf: Probability "mass" function
A pmf measures the scalar value of a discrete variable; the probability that a discrete random variable has a particular value.
This could be denoted as P(Y = y)
, or more concretely, P(Y = 1)
for example.
Probability mass functions will depend on the particular problem you're trying to solve.
Axioms of pmf's and discrete random variable probabilities:
- Each possible value of the random variable must be assigned a nonzero probability;
- All of the probabilities must sum to a total probability of
1
.
To be continued when there is more time :) Essentially, repeated uniform experiments of a series of failures and successes, for example
Distribution:
Using the binomial probability distribution formula, we know that for
the pmf represented by:
Or, more canonically, let
for
Mean, Variance, Std Deviation of Binomial:
The geometric probability distribution is built on the binomial distribution idea; that of a series of uniform trials occurring of successes and failures; the geometric distribution of a random variable is where value
Looking at the sample space (Wackerly 3.5), we see that
...
$E_k: F, F, F .... S $ with success on
where there are
As such,
Geometric Probability Distribution:
Mean, Variance, Std Deviation of Geometric Distribution:
Proofs for these are in the Wackerly book chapter 3.5 and are interesting.
Distribution:
For random sampling of sample size
The denominator: counting the number of ways to select a subset of
Then for the numerator, we think of
Mean, Variance, Std Deviation of Hypergeometric:
Then if we define
Note the factor
As
So for larger population sizes, the variance of the hypergeometric distribution is the same as binomial, e.g.
As
then obviously the hypergeometric distribution variance is smaller than that of the binomial distribution, as we'd have variance of
Having lesser variance can be a good thing, so we can see how the hypergeometric distribution is useful for cases where the sample size approaches the population size. "For sampling from a finite population" such as, quality control, genetic hypothesis testing, or statistical hypothesis testing.
Recall the geometric distribution, which is finding the probability of the first success. The negative binomial distribution focuses on the use case for multiple successes occurring.
Depending on the textbook you are using, this is either counting the number of failures, or counting the trial where the $r$th success occurs.
The "rth success".
Distribution (TODO): (case 1, Wackerly)
Distribution (TODO): Case 2, Devore
The Poisson probability distribution, used for rare events over a period of time, is also used to approximate the binomial distribution since the binomial distribution converges to the Poisson distribution. The Poisson distribution can approximate the binomial distribution in use cases for: large
The Poisson distribution's probability function is
Continuous random variables are defined on a continuum, e.g. an interval.
Take the real number line
Hence, axioms of probability for continuous variables cannot be similar to those of discrete.
- If each possible value of the random variable must be assigned a probability,
- And each possible value is a subset of an infinite set within an interval,
- Then the probabilities cannot all sum to 1, as they are infinite.
- Therefore a new set of axioms for continuous random variables must be defined, as follows.
From Wackerly 4.2, this is an important note about the definition of distribution functions, because distribution functions, e.g. cumulative distributions or probability distributions, can be for ANY random variable, whether discrete or continuous:
"Before we can state a formal definition for a continuous random variable, we must define the distribution function (or cumulative distribution function) associated with a random variable."
Let
Y
denote any random variable. Then,F(y) = P(Y <= y)
, for example,P(Y <= 2)
.
The nature of the distribution function associated with a random variable, determines whether the variable is discrete or continuous.
- Discrete random variables have a stepwise function.
- Continuous random variables have a continuous function.
- Continuous random variables have a smooth curve graph that is the result of histograms, or Riemann summations.
-
Variables are continuous if their distributions are, and, lots of real analysis continuity stuff, regarding "absolute continuity." More importantly,
-
For a continuous random variable
Y
, then$\forall y \in \mathbb{R}, P(Y = y) = 0$ , that is,
Continuous random variables have a zero probability at discrete points.
Wackerly uses the example of daily rainfall; probability of exactly 2.312 inches, a discrete point, is quite unlikely; probability of between 2 and 3 inches is quite likely; an interval.
Semantics and Idioms of
R
language for probability distributions: Considered separate from pure mathematical theory.
Note in R, the "density function," invoked via dhyper(y, r, N-r, n)
, this function measures a discrete random variable's scalar value, such as our hypergeometric example in R; there's a bit of oddness here, since we've used this function for discrete random variables.
Also in R, the "probability distribution function" is invoked via phyper(4, r, N-r, n)
.
And a preparation for density functions in probability.
Note: This is often considered grad-student level Real Analysis work, and the real numbers can arguably be constructed in various ways; the Dedekind cuts are merely my personal favorite.
I ran across this material with Jay Cummings' Real Analysis book, this is a book that's $20 on Amazon and used by the Wrath of Math (excellent Youtube math channel).
If you'd prefer to have a social life, you can skip this section, but frankly, without density in Real Analysis, density functions in probability are a bit nonsensical to me.
Recall Real Analysis, and that the real numbers can be constructed via Dedekind cuts of rational numbers link; recall that "rationals are dense in the reals," stack exchange, Wikipedia dense set and topology here.
We could also say "density of
Basically, there are a lot of "density" discussions with the real numbers, as such.
Take any interval on the real number line. "Subdivide" that interval into many "subdivisions."
There are "infinite" real numbers, or subdivisions, in that interval (arguably countable or uncountable).
The big picture is, they're infinite, or close enough to infinite that it doesn't matter.
This is what "density" looks like. (The articles above are about this, regarding the real numbers, as well as rational and irrational numbers, and constructing the real number line from a hybrid of rational and irrational numbers like Dedekind, which is very fun Real Analysis stuff).
So, that's what "density" is: take an interval on the real number line, subdivide it quite a lot into infinite subdivisions, and hey, that's "dense."
Continuous variables are analyzed on an interval, so we care about density in that interval, as the previous section discusses.
PDF: Probability Density Function
A PDF is a function that provides a "likelihood" that a continuous random variable's value is close to that of the value of a sample, or multiple samples.
For more on PDFs, see Wikipedia PDF article.
Probability density: Probability per unit length that RV is near one or more samples.
Probability density is the probability per unit length, while the absolute likelihood for a continuous random variable to take on any particular value is 0 (since there is an infinite set of possible values to begin with), the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample." wikipedia
PDF formula: The PDF of continuous random var
$Y$ is the function$f(y)$ , such that
for interval
$[a,b], a \leq b$ ,
$P(a \leq Y \leq b) = \int_a^b f(y) dy$ .
That is, the probability that the continuous random variable is within an interval,
is the area under the curve of the density function between
PDF Axioms:
- The total area under the curve of $f(x)$, from $(-\infty, \infty) = 1$:
That is,
Continuous variables have a "smooth curve" graph
This axiom is analogous to the discrete RV's having all probabilities sum to 1 discretely.
-
$f(x) \geq 0, \forall x$ . All probabilities of the PDF function are positive.
Mean or Expected Value of a continuous random variable:
Similarly, for
Variance of a continuous random variable with PDF
$f(x)$ :
The CDF for a continuous random variable
Using $F(x) to compute probabilities:
Let
Then,
Relating PDF and CDF via fundamental theorem of calculus:
If
Then,
In a uniform distribution, every possible outcome is equiprobable - for example, handing out a dollar to random passersby without discernment.
Uniform Distributions look like a "block" most of the time, where probability is constant within an interval.
Uniform Distributions for Discrete Random Variables
The probability is 1, divided by total outcomes.
Use cases include the possible outcomes of rolling a 6-sided die,
probability of drawing a particular suit within a deck of cards,
flipping a coin, etc.
All of these are equiprobable discrete cases.
Uniform Distributions for Continuous Random Variables
This can include a random number generator, temperature ranges, and many use cases with an infinite number of possible outcomes within an interval of measurement.
For the continuous random variables, we'll present the probability density function, the cumulative distribution function, and mean and variance.
PDF:
PDF of uniform distributions is
In the uniform distribution, the probability over a subinterval is proportional to the length of that subinterval.
CDF:
$\mu, \sigma^2$ :
This is the famous "bell curve," the most widely used probability distribution, where the mean is at the center, and standard deviation depicts width around that mean of the curve, indicating its variance - or, its volatility. This relation to volatility helps us understand the bell curve's importance in measuring the relative stability of a metric.
The normal distribution is common in statistics, economicics and finance.
The little underlying standard deviations from the mean create the bell shape.
Normal Distribution for a continuous random variable has the PDF:
Parameters of the Normal Distribution:
$\mu, \sigma$
We consider
we consider
The notation
Area under the normal density function from a to b:
R code: pnorm, qnorm
$qnorm(p \mu, \sigma) \Rightarrow $ the pth quantile such that
Z Values: Distance in standard deviations from the mean
This is the normal distribution, with param values
The PDF of a random continuous variable with standard normal distribution is:
The "z -curve" is the standard normal curve.
Z-scores: How many std dev from the mean a value is; areas under the curve
68-95-99 rule:
68% of the distribution is within one standard deviation; 95% within two; 99% within three.
So,
• 68% of all scores:
• 95% of all scores:
• 99% of all scores:
• and 50% of all scores:
Z-notation for z-critical values; percentiles
The
Standardizing (nonstandard) distributions:
$\mu = 1, \sigma = 1$
Recall distance from the mean in standard deviations was
This is similar; the "standardized variable Y" is
• Subtracting
• Dividing by
Standard normal distribution axioms:
•
Then, when we see
$\phi$ , that means to use probability distribution tables:
•
•
• The CDF of Z =
**Please note the normal distribution markdown file to see an application
of the axioms of std normal distribution, as that is the best way to learn.**
Standard Normal Approximation of Binomial:
An interesting quality of the normal distribution is that its curve approximates the histogram Riemann-sums-like binomial distribution when a random variable under the binomial distribution has histograms that aren't "too skewed". For these cases, use the normal approximation.
Normal approximation:
This approximation is adequate if
The gamma distribution, like the Poisson, is often used for waiting times and other measurements during temporal intervals.
Exponential Distribution:
With scale param
-
$\mu = \dfrac{1}{\lambda}$ , and$\sigma^2 = \dfrac{1}{\lambda^2}$ -
PDF:
$f(x, \lambda) = \lambda e^{-\lambda x}, x \geq 0$ , else$0$ -
CDF:
$F(x, \lambda) = 1 - e^{-\lambda x}, x > 0$ , else$0$
Gamma Distribution
With params
- PDF:
$f(y; \alpha, \beta) = \dfrac{y^{\alpha - 1}e^{-y/\beta}}{\beta^{\alpha}\tau(\alpha)}$ ,
where gamma function
-
PDF, Standard Gamma Distribution (
$\beta = 1$ ):$f(y; \alpha) = \dfrac{y^{\alpha - 1}e^{-y}}{\tau(\alpha)}$ -
CDF:
$F(y, \alpha) = \int_0^{y} \dfrac{y^{\alpha - 1}e^{-y}}{\tau(\alpha)}$ -
$\mu = \alpha\beta$ -
$\sigma^2 = \alpha\beta^2$
Until now we've seen univariate probability distributions. The same basic axioms and rules tend to apply to multivariate distributions.
Example: toss a pair of dice.
The sample space by the mn
rule is
with events such as
Hence, the bivariate probability function is
Joint or Bivariate PMFs for discrete random multiple variables is their sum:
- Axioms: Probabilities all nonzero, and all probabilities sum to 1.
Example: For tossing two die, find
Simply sum the probabilities:
Joint or Bivariate CDFs for two jointly continuous random variables is a double integral:
"To find p1(y1), we sum p(y1, y2) over all values of y2 and hence accumulate the probabilities on the y1 axis (or margin)." - Wackerly
Bivariate events such as
Marginal Probability Functions: Fix one var, iterate (sum, integrate) over the other; accumulate.
-
Discrete PMF:
$p_x(x) = \sum_{\forall y} p(x,y), \forall x$ . -
Continuous CDF:
$f_x(x) = \int_{\forall y} f(x,y) dy$ .
We know that bivariate or joint events such as
Generally,
Less generally:
Conditional: Discrete:
$P(y_1, y_2) = P(y_1 \cap y_2) = P(Y_1 = y_1, Y_2 = y_2)$
Conditional: Continuous:
$P(y_1 | y_2) = P(y_1 \cap y_2) = P(Y_1 \leq y_1 | Y_2 = y_2)$
If Y1 and Y2 are independent, the joint probability can be written as the product of the marginal probabilities:
This is the same as in univariate situations, just multiply the variable value by the (density/mass/PDF/pmf) function.
Covariance and Correlation are measures of dependency. The larger the covariance, the larger the correlation (zero covariance, zero correlation).
If
After some algebra, we can see that's also
Positive covariance indicates proportionality; negative indicate inverse proportionality.
Since covariance is hard to use, we often use the correlation coefficient instead:
Random Sampling for any distribution:
If the sample size is large,