My notes of the Udacity class - Intro to Descriptive Statistics

I just completed the online class of Udacity titled: “Intro to Descriptive Statistics”. Overall, the class is well polished and focuses on key concepts. IMO, it is a first step if you want to refresh up on the basics of Descriptive Statistics. There are also a lot of Quizzes to test your understanding of the concepts that are taught, and how they apply to real world problems.

Below are my notes from the class.

0. Some definitions

  • Expected Value of a random variable (aka the mean):

where is the probability of the variable

  • Variance is a measure of dispersion (std=)
  • Covariance: relationship (linear) between 2 random variable:
  • Correlation (coefficient) is unit-less and between -1 and 1:

1. Population and Sample


The standard deviation of a population is given by:

where is the size of the population.

If we treat a set of data as a sample randomly picked from a population, then the standard deviation of that sample, or more accurately the Standard Error is (Bessel’s correction):

where is the size of the sample.

We will see later than there exists a correlation between and the Standard Error: so knowing for example the standard error of a sample enables to deduce the standard deviation of the population it is drawn from.

2. Shape of the distribution


2.1. Normal distribution

In a normal distribution:

  • (mean) = median = mode

  • it is roughly symmetric

In a normal distribution, 68% of the data lies in the range [], and 95% of the data lies in the range []

The curve above is called “PDF”, short for “Probability Density Function”, where the area under the curve represents probability (percentile).

2.2. Skewed distribution

Another type of distribution is the skewed distribution. Within that family, there can be distribution skewed to the right, or skewed to the left.

The distribution below is skewed to the right.

3. z-score


For a normal distribution, we define the z-score of a score as the number of standard deviations below or above the population mean a raw score is.

We can look-up for the probability using the z-table. The z-table gives the probability of drawing a sample with value less than z.

You might ask, why do we need to define z-score, yet another statistical parameter.

The z-score is a way to standardize the score so that one can look-up the probability parameter. Imagine that you are dealing with several distributions (different means and standard deviations). In order to workout the probability, you would need a probability table for each distribution. That’s very inconvenient! One option is to standardize all the distribution curves: the distribution is centered to 0 with (), and then divided by .

4. Central Limit Theorem


Let’s assume a population with mean and standard deviation . Let’s say we take a sample from that population and calculate the sample mean. We take another sample of same size and calculate the new sample mean. If we keep doing that 100+ time, then the sampling distribution would look like a normal distribution.

  • The mean of the sampling distribution is about same as the mean of the population.

  • The standard deviation of the sampling distribution is called the Standard Error (SE):

where is the sample size.

The theorem would still hold true independently of the shape of the population distribution (multi-modal, skewed, …).

5. Quizz


This Quizz is a good example on how the z-score can be used:

A normally distributed population has a mean of and standard deviation . What is the probability of randomly selecting a sample of size 4 that has a mean greater than 110?

From , , we calculate the standard error: .

The z-score f that sample is: . From the z-table, we find that the probability associated with z-score=1 is

This probability is for a sample that has a mean smaller than 110 (i.e it is the area under the curve in the range to 110. So the probability that the sample has a mean greater than 110 is: P = 1 - 0.8413 = 0.1587.