Outliers — Where are they?

Sushrut Shitoot
3 min readMay 1, 2020

In the previous post we briefly discussed what outliers are, how they get introduced into the data and finally a few scenarios to understand how to approach them in different contexts. To recap, outliers are data points within a data series that are considerably far away from the majority of the data points.

The question still remains — how do we identify outliers in a given data series?

There are quite a few ways for you to identify outliers. Infact, outlier detection is one of the key steps in data processing while preparing the analytical dataset to run your machine learning algorithms.

I wanted to cover 4 methods for outlier detection in this post that I have found quite effective:

1. Visualization
2. Standard deviation
3. Inter-quartile range
4. Percentile thresholds

1. Visualization
Data visualization is probably on of the most important bits of data science. Plotting your data brings out additional dimensions to it that are not that evident from just eye-balling the data.
Box plots, histograms and fitted line charts are the popular methods you can try out to detect the presence of outliers.

2. Standard deviation
Most real world data series are assumed to follow a normal distribution. Leveraging this assumption, the data series is transformed using a Z-transformation (mean = 0, standard deviation = 1). Z-scores represent how many standard deviations away are respective data points from the mean of the distribution i.e. 0. Typically, Z-transformation results in the range of the datasets being restricted from -3 to +3. In the presence of outliers, you will most likely see much higher / lower values (Eg. +20.73, — 7.7 etc.).
For a normal distribution, about 98–99% of the data points fall within 3 standard deviations from the mean. It implies that if the transformed values after a Z-transformation lie beyond 3 standard deviations from the mean, these could be highly unusual values with an extremely low likelihood of occurrence. These can be further investigated.

3. Inter-quartile range (IQR)
The inter-quartile range for a data series is defined as the middle 50% of the values between the first quartile (Q1) and the third quartile (Q3). Conventionally, data points falling beyond the range of 1.5 times the IQR are considered to be outliers. The threshold points are referred to as inner fence to identify outliers. But this is in cases where the probability of occurrence of outliers is extremely low. In cases where there exists some ambiguity regarding the potential outliers, a more liberal criteria can also be used to label outliers by expanding it to 3 times the IQR instead of only 1.5 times. This is know as the outer fence to identify outliers.

4. Percentile thresholds
This is quite intuitive actually. Depending on the data series, the first and the last few percentiles are deemed to be outliers. However, not all scenarios might merit this and must be used with caution. Personally, I keep a 5 percentile guideline for strict labeling of outliers and 2% for a more liberal labeling.

The above list is in no way an exhaustive set of means to detect outliers. However, these are quick and quite effective considering the effort required to apply them.

However, ‘outlier detection’ will only tell you whether your data set has outliers or not and which of the observations can be considered as outlier. We still need to explore our options regarding what we can do about them. Watch out for the next post where I’ll be talking about the next logical step i.e. ‘outlier treatment’.

--

--