Outliers – Innocent bystanders or outlaws?
Outliers are trouble makers. But they are an interesting set of data points as well. Depending on the context of the problem, they can make or break the deal. It is imperative that while trying to solve any problem, special attention be given to them.
Let’s start with what they are.
Outliers are data points that are present at an abnormal distance from the data distribution. However, the definition of abnormal distance is subjective depending on the context. There are different ways to identify outliers within a dataset. We’ll take a look at them at a later time.
But let’s get back to outliers for now.
Why are we even talking about outliers? What happens if you retain some of these data points while you process the data and build some cool machine learning models?
Interesting fact about machine learning models: Quite a few of them are sensitive to the range of the data. Some are very particular about the distributions that the data series follow. Such complications imply that the presence of outliers in the data may mess with the model training, throw out incorrect parameters and bias the model results. It would also lead to the model giving a higher emphasis or weight to some observations that are extremely rare in their frequency of occurrence.
So now that we have seen what implications outliers might have if left untreated in a dataset, let’s take a minute to think about how do they usually get there in the first place.
I have borrowed these pointers from this very interesting blog [1] that will shed some light on the possible reasons why you might encounter outliers in your datasets. To summarize, your data may have error due to the following reasons:
- Human errors: Errors due to data entry
- Instrumental errors: Errors due to incorrect measurement
- Experimental errors: Errors committed while extracting data
- Processing errors: Data mutations due to incorrect data handling
- Sampling errors: Assimilation or collation of data from multiple sources
- Intentional placement of outliers for testing
- Natural variations in the data
It is extremely important that the data scientist understands what the outliers mean in the real world in the context of the problem at hand.
Scenario 1: In a classroom setting, the student that scores 99% on a test as well as another student that scores 9% on the same test are both outliers. The teacher evaluating and grading the papers observes these exceptional data points that lie far away from the class average of 72%. The teacher decides that both these are exceptional cases and need to be dealt highlighted. The effort put into the preparation for the exam by student 1 needs to be recognized and appreciated. However, student 2, who scored 9%, needs additional help in maybe understanding concepts.
Insight: within the same data series, outliers can be treated differently.
Scenario 2: Pharmaceutical companies wants to send some promotional content to physician about their product. But the team observes that the age for a particular physician, physician 1, was incorrectly entered as 4. They encounter an issue due to incorrect data entry. Another physician, physician , physician 2, is 89 years old and is still practicing. (Can’t thank them enough in the current situation as well as otherwise). But he is not too comfortable with the digital intervention that the companies are bringing in these days and has opted out from the study. The team decides to consider only the most reliable observations with no ambiguity and to drop the these 2 physicians out of the analysis.
Insight: Within the same data series, outliers can also be treated similarly.
Scenario 3: A credit card company notices that for the last 4 consecutive months, a user has increased spends on the credit card by almost 400% which makes him jump into the top 10 spenders for the last quarter. Such a pattern has been seen for a few other customers as well where the top 10 spenders’ list has changed completely with 6 first time entrants for their respective credit categories. The company decides to send them promotions and offers regarding credit card updates based on their spending capacity.
Insight: Within the same data series, outliers were of special interest and were retained
Hence, not all outliers are outlaws that might cause trouble. Some are gifted personalities that might make your life more interesting. It is worth weighing their relevance to the problem. In some cases, you will need to be persuasive for them to provide some insights while they might need to be strong-armed in case they create some sort of a nuisance within your analytical efforts.
Watch out for the next post where we will discuss how can we detect / identify outliers.
References
[1]: https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561