Standard Deviation in Python

The standard deviation measures the amount of deviation (or perturbation) of the values of a set, when compared with their mean (or average).

If we have a set of values, e.g.

100
50
30
65
76
93
53
28
61
7

The idea is to compare each value with the mean (56.3), by for example subtracting the mean from the value.

import numpy as np
values = np.array(values)
distances = values - values.mean()
print(f"Distances:\n{distances}")

Distances:
[[ 43.7]
 [ -6.3]
 [-26.3]
 [  8.7]
 [ 19.7]
 [ 36.7]
 [ -3.3]
 [-28.3]
 [  4.7]
 [-49.3]]

Note that these values can be negative, so as in our discussion on linear regression and least square method we can square these values to have only positive distances.

In order to obtain a single value insight of the dispersion we could add the squared distances. Finally, to correct the units, we take the square root of the sum.

OBSERVATION: as the number of values considered increases, the described quantity starts to blow up, independently of the dispersion of the values! Hence, we introduce a normalization of the sum!

The standard deviation is then calculated using the formula

\begin{equation*} \sigma = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}}. \end{equation*}

Calculation in `python`

print(f"Values: \n {values}")

Values: 
 [[100], [50], [30], [65], [76], [93], [53], [28], [61], [7]]

import numpy as np
print(f"Numpy SD: {np.array(values).std()}")

Numpy SD: 27.777868888739466

import pandas as pd
print(f"Pandas SD: {pd.DataFrame(values).std()}")

Pandas SD: 0    29.280445
dtype: float64

import statistics
import numpy as np
values = np.array(values).flatten().astype(float)
print(f"Statistics Mean: {statistics.stdev(values)}")

Statistics Mean: 29.28044474464902

Why do the results differ?

Actually there are two types of standard deviation calculations: population and sample standard deviations. The population standard deviation pertains to the entire data set, while the sample standard deviation concerns a subset or sample of the population.

Their calculation formulas differ in the denominator factor,

\begin{align*} \sigma_p & = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}}, & \sigma_s & = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N - 1}}. \end{align*}

For large amount of data, their numerical difference shrinks.

From the above results we infer that by default numpy computes the population standard deviation, while pandas and statistics compute the sample standard deviation.

Table 1: Methods used to compute both standard deviations in the three `python` libraries.
	`numpy`	`pandas`	`statistics`
\(\sigma_p\)	`.std()`	`.std(ddof=0)`	`pstdev()`
\(\sigma_s\)	`.std(ddof=0)`	`.std()`	`stdev()`

Reflection

It is not enough to know the code to get a result. It is important to grasp how the code works.

Standard Deviation in Python

Calculation in python

Why do the results differ?

Reflection

Calculation in `python`