Standard Deviation in Python
The standard deviation measures the amount of deviation (or perturbation) of the values of a set, when compared with their mean (or average).
If we have a set of values, e.g.
| 100 |
| 50 |
| 30 |
| 65 |
| 76 |
| 93 |
| 53 |
| 28 |
| 61 |
| 7 |
The idea is to compare each value with the mean (56.3), by for example subtracting the mean from the value.
import numpy as np values = np.array(values) distances = values - values.mean() print(f"Distances:\n{distances}")
Distances: [[ 43.7] [ -6.3] [-26.3] [ 8.7] [ 19.7] [ 36.7] [ -3.3] [-28.3] [ 4.7] [-49.3]]
Note that these values can be negative, so as in our discussion on linear regression and least square method we can square these values to have only positive distances.
In order to obtain a single value insight of the dispersion we could add the squared distances. Finally, to correct the units, we take the square root of the sum.
OBSERVATION: as the number of values considered increases, the described quantity starts to blow up, independently of the dispersion of the values! Hence, we introduce a normalization of the sum!
The standard deviation is then calculated using the formula
\begin{equation*} \sigma = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}}. \end{equation*}Calculation in python
print(f"Values: \n {values}")
Values: [[100], [50], [30], [65], [76], [93], [53], [28], [61], [7]]
import numpy as np print(f"Numpy SD: {np.array(values).std()}")
Numpy SD: 27.777868888739466
import pandas as pd print(f"Pandas SD: {pd.DataFrame(values).std()}")
Pandas SD: 0 29.280445 dtype: float64
import statistics import numpy as np values = np.array(values).flatten().astype(float) print(f"Statistics Mean: {statistics.stdev(values)}")
Statistics Mean: 29.28044474464902
Why do the results differ?
Actually there are two types of standard deviation calculations: population and sample standard deviations. The population standard deviation pertains to the entire data set, while the sample standard deviation concerns a subset or sample of the population.
Their calculation formulas differ in the denominator factor,
\begin{align*} \sigma_p & = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N}}, & \sigma_s & = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N - 1}}. \end{align*}For large amount of data, their numerical difference shrinks.
From the above results we infer that by default numpy computes the population standard deviation, while pandas and statistics compute the sample standard deviation.
numpy |
pandas |
statistics |
|
|---|---|---|---|
| \(\sigma_p\) | .std() |
.std(ddof=0) |
pstdev() |
| \(\sigma_s\) | .std(ddof=0) |
.std() |
stdev() |
Reflection
It is not enough to know the code to get a result. It is important to grasp how the code works.