Standard Deviation - Machine Learning

Variance and Standard Deviation (std) are essentially a measure of the spread of the data in the data set. Variance is the average of the squared differences from the mean. Mean is simply adding all numbers together and dividing by the number of numbers.
Python Standard Deviation
Standard Deviation is a number that describes how spread out the values are in a DataSet. A low standard deviation means that most of the values are close to the mean and a high standard deviation means that the numbers are spread out over a wider range.

How to find the Standard Deviation of a dataset

from numpy.random import randn from numpy import mean from numpy import std
data = randn(1000) mean, std = mean(data), std(data) print(" Mean of Dataset : " , mean) print(" Standard Deviation : " , std)
Mean of Dataset : 0.06621379297509393 Standard Deviation : 1.0152439350441724
The standard deviation can be used to identify outliers in your data. If a value is a certain number of Standard Deviation away from the mean, that data point is identified as an outlier.

How to find outliers?

Z-scores can quantify the unusualness of an observation when your data follow the normal distribution . The following code calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean.
cutOff = std * 3 lowerVal, upperVal = mean - cutOff, mean + cutOff print(lowerVal, " , " , upperVal)
-2.9836523600783376 , 3.0015065497718862
Next step is to identify outliers that fall outside of the defined lower and upper limits.
outliers = [x for x in data if x < lowerVal or x > upperVal] print('Identified outliers: %d' % len(outliers)) print(outliers)
Identified outliers: 3 [-3.1031745665407713, -3.416992780294376, -3.673596518800302]
Alternately, you can filter out those values from the sample that are not within the defined limits.
outliers_removed = [x for x in data if x > lowerVal and x < upperVal] print('Non-outlier observations: %d' % len(outliers_removed))
Non-outlier observations: 997
Full Source
from numpy import mean from numpy import std data = 5 * randn(10000) + 50 mean, std = mean(data), std(data) cutOff = std * 3 lowerVal, upperVal = mean - cutOff, mean + cutOff outliers = [x for x in data if x < lowerVal or x > upperVal] print('Identified outliers: %d' % len(outliers)) outliers_removed = [x for x in data if x > lowerVal and x < upperVal] print('Non-outlier observations: %d' % len(outliers_removed))