# Standard Deviation - Machine Learning

Variance and**Standard Deviation**(std) are essentially a measure of the spread of the data in the data set. Variance is the average of the

**squared differences**from the mean. Mean is simply adding all numbers together and dividing by the number of numbers.

**Standard Deviation**is a number that describes how spread out the values are in a DataSet. A low standard deviation means that most of the values are

**close to the mean**and a high standard deviation means that the numbers are spread out over a wider range.

## How to find the Standard Deviation of a dataset

from numpy.random import randn
from numpy import mean
from numpy import std

data = randn(1000)
mean, std = mean(data), std(data)
print(" Mean of Dataset : " , mean)
print(" Standard Deviation : " , std)

Mean of Dataset : 0.06621379297509393
Standard Deviation : 1.0152439350441724

The standard deviation can be used to identify **outliers**in your data. If a value is a certain number of

**Standard Deviation**away from the mean, that data point is identified as an outlier.

## How to find outliers?

Z-scores can quantify the unusualness of an observation when your data follow the**normal distribution**. The following code calculate the

**cut-off**for identifying outliers as more than 3 standard deviations from the mean.

cutOff = std * 3
lowerVal, upperVal = mean - cutOff, mean + cutOff
print(lowerVal, " , " , upperVal)

-2.9836523600783376 , 3.0015065497718862

Next step is to identify **outliers**that fall outside of the defined lower and upper limits.

outliers = [x for x in data if x < lowerVal or x > upperVal]
print('Identified outliers: %d' % len(outliers))
print(outliers)

Identified outliers: 3
[-3.1031745665407713, -3.416992780294376, -3.673596518800302]

Alternately, you can **filter out**those values from the sample that are not within the defined limits.

outliers_removed = [x for x in data if x > lowerVal and x < upperVal]
print('Non-outlier observations: %d' % len(outliers_removed))

Non-outlier observations: 997

**Full Source**

from numpy import mean
from numpy import std
data = 5 * randn(10000) + 50
mean, std = mean(data), std(data)
cutOff = std * 3
lowerVal, upperVal = mean - cutOff, mean + cutOff
outliers = [x for x in data if x < lowerVal or x > upperVal]
print('Identified outliers: %d' % len(outliers))
outliers_removed = [x for x in data if x > lowerVal and x < upperVal]
print('Non-outlier observations: %d' % len(outliers_removed))

**Related Topics**