A data point that is significantly different from other data points in a data set is considered an outlier. If you find an outlier in your event log, should you remove it before you continue with your process mining analysis?
In process mining terms, an outlier can mean many different things:
- A case that has a much longer duration than others
- An event with a timestamp that lies in the future or way in the past
- A case that has many more events than other cases
- A variant that exhibits unique behavior
- An attribute value that occurs only very few times or much more often compared to others
- Activities that occur in a different order than what you normally see
- The process starts or ends in a strange place
In machine learning, outliers are sometimes removed from the data sample during a cleaning step to improve the model. So, what about process mining: Should you remove such outliers when you find them to better represent the mainstream behavior of your process?
It depends.
First, you need to check whether the outlier is a data quality problem or whether it really happened in the process. As a rule of thumb, you should then remove outliers if they are there due to data quality issues and keep the ones that truly happened.
For example, one reason that a case has a much longer duration than others could be that it contains an event with a zero timestamp (such as 1900, 1970, or 2999). Zero timestamps can be errors or indicate that an activity has not happened yet. Either way, they do not reflect the actual time of the activity and, therefore, are misleading.
Another reason could be that the one case that took 20 times as long as you would expect (for example, 20 months instead of 4 weeks) really belongs to a crazy customer case that took multiple rounds, lots of ping pong between different departments, and simply an unusually long time to resolve. This is part of the process reality.
When you should remove outliers
You should clean up your outliers in the following situations:
- Zero timestamps need to be first investigated and then you decide whether to remove just the event with the zero timestamp or the whole case, based on the situation.
- If you have a very long case that is due to missing case IDs, you need to remove this case.
- If you have activities that occur in a different order, first investigate the root cause. For example, if the different order is because of same timestamp activities, re-sort the data set and import it again (no removal of events is needed). If the different order is due to different timestamp granularities, import the data again on the most coarse-grained level. If the different order is due to different clocks, the differences need to be resolved before merging the data sets.
- Cases that have an unusual start or end point most likely are no errors but simply incomplete cases. Nevertheless, if you want to analyze the end-to-end process then you should remove incomplete cases to prepare your data set for the analysis.
Be mindful of how much data you remove in the cleaning process. If too much is removed then the remaining data set may not be representative anymore.
And keep in mind that not all data quality problems are outliers! For example, the recorded timestamps may not reflect the actual time of activities but look entirely normal.
When you should keep outliers
The idea behind keeping outliers if they reflect what really happened is that you want to see the whole picture of the process. Sometimes, exceptions in the process are the most interesting result of your analysis. Especially when they imply compliance issues or security risks in the process (say, a violation of the segregation of duties rule).
For example, you should keep outliers in the following situations:
- Cases with an unusually long duration that really took that long.
- Variants that exhibit unusual behavior if it really happened. In fact, auditors often deliberately filter their data set in such a way that they only see the low-frequent variants because they are interested in the exceptional cases.
- Activities that actually occurred in a different order.
- Even if activities occur in a different order due to a data quality problem such as missing timestamps for activity repetitions, you would not remove these cases but interpret the results with the knowledge of the underlying data issue.
- There are analyses for which incomplete cases should not be removed.
At the same time, there are reasons to specifically address - and sometimes even remove - outliers although they are “real”. For example:
- When you analyze performance measures such as case durations or waiting times, it is a good idea to use the median instead of the mean because the median is less influenced by outliers.
- When you try to understand a complex process, simplification strategies such as looking at the main process or focusing on the frequent variants are needed to get an overview. This is part of taking different views on your process during the analysis.
So, if outliers really happened in the process then you generally want to keep them. Because you want to see everything that is really there (just like you don’t need a minimum number of data points to perform a process mining analysis). But you want to be aware of them in the analysis.