Science Library - free educational site

Probability distributions

Data

Everybody has heard that scientists and researchers in fields as diverse as sociology, history, and economics, all have at least one thing in common: data.

An empirical science, such as physics or chemistry, can run experiments and make observations, to test hypotheses. That is because they often deal with direct natural phenomena, which can be isolated in a laboratory.

However, other fields do not always have such a convenient means to test theories. It is not that easy to put a group of consumers in a laboratory and 'experiment' on their economic behaviour. Where possible, this is done, but in most cases the best means of testing ideas is to collect data and do one form or another of statistical analysis.

Statistics look very precise, with their neat tables of multi-decimal place figures, but in fact behind it all there are inevitably assumptions. There are many things about nature that statistics cannot do: predict earthquakes, explain 'black swan' phenomena, create failsafe business models....

Furthermore, the same set of data can be processed and presented to produce seemingly contradictory results. There is always selectivity, weighting of importance, and even deliberate manipulation (aka politics).

Modelling

Very large sets of data are often reduced to a 'model', which attempts to make the vast diversity of information comprehensible, while developing a means of understanding a system well enough to be able to predict future events. Climate change is a notoriously complex field which uses sophisticated modelling techniques to try to make sense of what is happening to our planet before it is too late. The interpretation of these models is not always scientific in nature.

Data Types

Data can be of two basic types:

  • Qualitative
  • Qualitative data uses words, and can also be called categorical data. Questionnaires often rely on qualitative questions, such as: 'How would you rank the quality of service?' Qualitative data needs to be interpreted, and its conversion to quantitative data (resulting in statements like '48% of people are satisfied with the service') can be very misrepresentative of the true situation.

  • Quantitative
  • Related to counting and measurements, and therefore involve numbers. 'How many times do you use the service in a week?' can result in a more usable statistic than qualitative interpretations. Statistical variables from quantitative data can be of two types: discrete or continuous.

Discrete Random Variables

Bar chart
Bar chart of discrete variable frequencies

A discrete data variable has exact numerical values, whereas a continuous variable can have any value. For example, the number of cars of different colours which pass the school gate is a discrete variable, and the speed of those cars is a continuous variable.

A sample is a subset of a population, and a random sample has the characteristics that: (i) all elements have an equal chance of being selected, (ii) the sample is broadly representative of the population.

The probability distribution function (PDF) of a discrete variable $X$ has the properties:

$0≤f(x)≤1$ and $∑f(x)=1$

Mean and Expected Value

For a random variable $X$, the mean μ (or expected value $E(X)$), is given by:

$$μ = ∑xP(X=x)$$

The mean or expected value is a measure of central tendency, and is $μ = {∑↙{i=0}↖{k} f_ix_i}/n$

where $∑↙{i=0}↖{k} f_ix_i$ is the sum of the data values, and $n$ is the number of data values in the population.

The median is the middle value when the data are arranged in order of size. With an even number of data values, the median is the average of the middle two values.

e.g. the data set 2, 3, 3, 4, 5, 5, 7, 8, 9:

Mean = $μ = {∑↙{i=0}↖{k} f_ix_i}/n = {2+3+3+4+5+5+7+8+9}/9 = 5.11$

Median = middle value of: 2, 3, 4, 5, 7, 8, 9. the median value is 5.

Mode: there are two modes, 3 and 5. So this dataset is bimodal.

The mode of a PDF is the value of $x$ for which the PDF is maximum. There may be more than one mode for a function.

Variance and standard deviation

The variance : $σ^2 = ∑(x-μ)^2P(X=x)$.

This could also be written as: $σ^2 = Var(X) = E(X^2) - E(X))^2 = ∑x^2P(X=x) - μ^2$

The variance is a mathematically derived description of how much the individual values vary from the mean of the data set. The variance is the sum of the squares of the difference of the values from the mean.

$$σ^2 = {∑↙{i=1}↖{k} f_i(x_i - μ)^2}/n = {∑↙{i=1}↖{k} f_ix_i^2}/n - μ^2$$

where $n = ∑↙{i=0}↖{k} f_i$

The standard deviation, $σ$, is the square root of the variance.

The variance can also be expressed in terms of probability as: $X$ is $σ^2 = {∑(x-μ)^2P(X=x)}$

$σ^2$ = Var(X) = $E(X^2) - E(X)^2$ where $E(X^2) = ∑x^2P(X=x)$

The standard deviation (sd) of $X$ is $σ = √{\text Var(X)}$

Binomial Distribution

$P(X=r)=(\table n;r) p^rq^{n-r}$, where $X ∼ B(n,p)$, $r= 0, 1,..., n$, and $q=1-p$.

Normal Distribution

If $X ∼ N(μ σ^2)$ and $Z={X-μ}/{σ}$, then $Z∼N(0,1)$.

$p_x$ = probability distribution P(X = x) of the discrete random variable X

f(x) = probability density function of the continuous random variable X

E(X) = the expected value of the random variable X

Var(X) = the variance of the random variable X

$μ$ = population mean

$σ$ = population standard deviation

$σ^2$ = population variance = $ {∑↙{i=1}↖k f_i(x_i - μ)^2}/n$, where $n = ∑↙{i=1}↖k f_i$

$x↖{‾}$ = sample mean

Bayes' Theorem

$P(B|A)={P(B∩A)}/{P(A)} = {P(B)×P(A|B)}/{P(B)×P(A|B)+P(B')×P(A|B')}$

Poisson Distribution

$X ∼ Po(m)$ has PDF given by $f(x)=P(X=x)={e^{-m}m^x}/{x!}$

where $x=0, 1, 2, ...$

If $X ∼ Po(m)$ the $E(X) = Var(X)=m$

Content © Andrew Bone. All rights reserved. Created : December 31, 2014

Latest Item on Science Library:

The most recent article is:

Air Resistance and Terminal Velocity

View this item in the topic:

Mechanics

and many more articles in the subject:

Subject of the Week

Computing

Information Technology, Computer Science, website design, database management, robotics, new technology, internet and much more. JavaScript, PHP, HTML, CSS, Python, ... Have fun while learning to make your own websites with ScienceLibrary.info.

Computer Science

Great Scientists

Archimedes

c. 287 - 212 BCE

Archimedes was a Greek, living in Magna Graecia. He is considered one of the greatest mathematicians and engineers of the ancient world, and the source for many fascinating stories and anecdotes.

Archimedes
SaraOrdine

Quote of the day...

De calcaria in carbonarium (out of the frying pan into the fire)

ZumGuy Internet Promotions

Website content services