Standard Deviation - Machine Learning

Variance serves as a measure of the dispersion or spread of data within a dataset. It quantifies the average of the squared differences between each data point and the mean of the dataset. By squaring the differences, the variance accentuates the relative significance of deviations from the mean, thereby capturing the overall variability in the data.

To calculate the variance, one sums up the squared differences between each data point and the mean, then divides the sum by the total number of data points. The result offers a numeric representation of the average dispersion of data points around the mean, enabling analysts to assess the extent to which data is spread across the dataset.


Python Standard Deviation

Standard Deviation, often abbreviated as "std," arises as an inherent companion to the variance. It provides a concise numerical description of how spread out the values are in a dataset, particularly concerning their proximity to the mean. As an intuitive and widely-used measure, the standard deviation conveys the extent to which data points deviate from the mean, thereby revealing the distribution's overall variability.

How to find the Standard Deviation of a dataset

from numpy.random import randn from numpy import mean from numpy import std
data = randn(1000) mean, std = mean(data), std(data) print(" Mean of Dataset : " , mean) print(" Standard Deviation : " , std)
Mean of Dataset : 0.06621379297509393 Standard Deviation : 1.0152439350441724

The standard deviation can be used to identify outliers in your data. If a value is a certain number of Standard Deviation away from the mean, that data point is identified as an outlier.

A low standard deviation implies that the majority of data points cluster closely around the mean, signifying a more concentrated distribution. Conversely, a high standard deviation indicates that the data points are more dispersed and span a wider range from the mean, characterizing a more scattered distribution.

How to find outliers?

Z-scores can quantify the unusualness of an observation when your data follow the normal distribution . The following code calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean.

cutOff = std * 3 lowerVal, upperVal = mean - cutOff, mean + cutOff print(lowerVal, " , " , upperVal)
-2.9836523600783376 , 3.0015065497718862

Next step is to identify outliers that fall outside of the defined lower and upper limits.

outliers = [x for x in data if x < lowerVal or x > upperVal] print('Identified outliers: %d' % len(outliers)) print(outliers)
Identified outliers: 3 [-3.1031745665407713, -3.416992780294376, -3.673596518800302]

Alternately, you can filter out those values from the sample that are not within the defined limits.

outliers_removed = [x for x in data if x > lowerVal and x < upperVal] print('Non-outlier observations: %d' % len(outliers_removed))
Non-outlier observations: 997
Full Source
from numpy import mean from numpy import std data = 5 * randn(10000) + 50 mean, std = mean(data), std(data) cutOff = std * 3 lowerVal, upperVal = mean - cutOff, mean + cutOff outliers = [x for x in data if x < lowerVal or x > upperVal] print('Identified outliers: %d' % len(outliers)) outliers_removed = [x for x in data if x > lowerVal and x < upperVal] print('Non-outlier observations: %d' % len(outliers_removed))

Conclusion

Both the variance and the standard deviation offer valuable insights into the dispersion and variability present in the dataset, enabling researchers, analysts, and decision-makers to comprehend the extent of spread within the data. They continue to be vital tools in statistical analysis, enriching our understanding of datasets and paving the way for informed interpretations and conclusions.