Science Library - free educational site

Probability distributions


Everybody has heard that scientists and researchers in fields as diverse as sociology, history, and economics, all have at least one thing in common: data.

An empirical science, such as physics or chemistry, can run experiments and make observations, to test hypotheses. That is because they often deal with direct natural phenomena, which can be isolated in a laboratory.

However, other fields do not always have such a convenient means to test theories. It is not that easy to put a group of consumers in a laboratory and 'experiment' on their economic behaviour. Where possible, this is done, but in most cases the best means of testing ideas is to collect data and do one form or another of statistical analysis.

Statistics look very precise, with their neat tables of multi-decimal place figures, but in fact behind it all there are inevitably assumptions. There are many things about nature that statistics cannot do: predict earthquakes, explain 'black swan' phenomena, create failsafe business models....

Furthermore, the same set of data can be processed and presented to produce seemingly contradictory results. There is always selectivity, weighting of importance, and even deliberate manipulation (aka politics).


Very large sets of data are often reduced to a 'model', which attempts to make the vast diversity of information comprehensible, while developing a means of understanding a system well enough to be able to predict future events. Climate change is a notoriously complex field which uses sophisticated modelling techniques to try to make sense of what is happening to our planet before it is too late. The interpretation of these models is not always scientific in nature.

Data Types

Data can be of two basic types:

  • Qualitative
  • Qualitative data uses words, and can also be called categorical data. Questionnaires often rely on qualitative questions, such as: 'How would you rank the quality of service?' Qualitative data needs to be interpreted, and its conversion to quantitative data (resulting in statements like '48% of people are satisfied with the service') can be very misrepresentative of the true situation.

  • Quantitative
  • Related to counting and measurements, and therefore involve numbers. 'How many times do you use the service in a week?' can result in a more usable statistic than qualitative interpretations. Statistical variables from quantitative data can be of two types: discrete or continuous.

Discrete Random Variables

Bar chart
Bar chart of discrete variable frequencies

A discrete data variable has exact numerical values, whereas a continuous variable can have any value. For example, the number of cars of different colours which pass the school gate is a discrete variable, and the speed of those cars is a continuous variable.

A sample is a subset of a population, and a random sample has the characteristics that: (i) all elements have an equal chance of being selected, (ii) the sample is broadly representative of the population.

The probability distribution function (PDF) of a discrete variable $X$ has the properties:

$0≤f(x)≤1$ and $∑f(x)=1$

Mean and Expected Value

For a random variable $X$, the mean μ (or expected value $E(X)$), is given by:

$$μ = ∑xP(X=x)$$

The mean or expected value is a measure of central tendency, and is $μ = {∑↙{i=0}↖{k} f_ix_i}/n$

where $∑↙{i=0}↖{k} f_ix_i$ is the sum of the data values, and $n$ is the number of data values in the population.

The median is the middle value when the data are arranged in order of size. With an even number of data values, the median is the average of the middle two values.

e.g. the data set 2, 3, 3, 4, 5, 5, 7, 8, 9:

Mean = $μ = {∑↙{i=0}↖{k} f_ix_i}/n = {2+3+3+4+5+5+7+8+9}/9 = 5.11$

Median = middle value of: 2, 3, 4, 5, 7, 8, 9. the median value is 5.

Mode: there are two modes, 3 and 5. So this dataset is bimodal.

The mode of a PDF is the value of $x$ for which the PDF is maximum. There may be more than one mode for a function.

Variance and standard deviation

The variance : $σ^2 = ∑(x-μ)^2P(X=x)$.

This could also be written as: $σ^2 = Var(X) = E(X^2) - E(X))^2 = ∑x^2P(X=x) - μ^2$

The variance is a mathematically derived description of how much the individual values vary from the mean of the data set. The variance is the sum of the squares of the difference of the values from the mean.

$$σ^2 = {∑↙{i=1}↖{k} f_i(x_i - μ)^2}/n = {∑↙{i=1}↖{k} f_ix_i^2}/n - μ^2$$

where $n = ∑↙{i=0}↖{k} f_i$

The standard deviation, $σ$, is the square root of the variance.

The variance can also be expressed in terms of probability as: $X$ is $σ^2 = {∑(x-μ)^2P(X=x)}$

$σ^2$ = Var(X) = $E(X^2) - E(X)^2$ where $E(X^2) = ∑x^2P(X=x)$

The standard deviation (sd) of $X$ is $σ = √{\text Var(X)}$

Binomial Distribution

$P(X=r)=(\table n;r) p^rq^{n-r}$, where $X ∼ B(n,p)$, $r= 0, 1,..., n$, and $q=1-p$.

Normal Distribution

If $X ∼ N(μ σ^2)$ and $Z={X-μ}/{σ}$, then $Z∼N(0,1)$.

$p_x$ = probability distribution P(X = x) of the discrete random variable X

f(x) = probability density function of the continuous random variable X

E(X) = the expected value of the random variable X

Var(X) = the variance of the random variable X

$μ$ = population mean

$σ$ = population standard deviation

$σ^2$ = population variance = $ {∑↙{i=1}↖k f_i(x_i - μ)^2}/n$, where $n = ∑↙{i=1}↖k f_i$

$x↖{‾}$ = sample mean

Bayes' Theorem

$P(B|A)={P(B∩A)}/{P(A)} = {P(B)×P(A|B)}/{P(B)×P(A|B)+P(B')×P(A|B')}$

Poisson Distribution

$X ∼ Po(m)$ has PDF given by $f(x)=P(X=x)={e^{-m}m^x}/{x!}$

where $x=0, 1, 2, ...$

If $X ∼ Po(m)$ the $E(X) = Var(X)=m$

Content © Andrew Bone. All rights reserved. Created : December 31, 2014

Latest Item on Science Library:

The most recent article is:

Air Resistance and Terminal Velocity

View this item in the topic:


and many more articles in the subject:

Subject of the Week


Environmental Science is the most important of all sciences. As the world enters a phase of climate change, unprecedented biodiversity loss, pollution and human population growth, the management of our environment is vital for our futures. Learn about Environmental Science on

Environmental Science

Great Scientists

Subrahmanyan Chandrasekhar

1910 - 1995

Subrahmanyan Chandrasekhar, 1910 - 1995, was an Indian astrophysicist, born in Punjab, and worked in the USA. He made significant contributions to many fields, including General Relativity and Black Holes.

Subrahmanyan Chandrasekhar, 1910 - 1995, Indian astrophysicist

Quote of the day...

All science is either physics or stamp collecting.

ZumGuy Internet Promotions

Vitruvian Boy