This is the seventh article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
Last year, a Dutch insurance company completed the process mining analysis of several of their processes. For some processes, it went well and they could get valuable insights out of it. However, for the bulk of their most important core processes, they realized that the workflow system was not used in the way it was intended to be used.
What happened was that the employees took the dossier for a claim to their desk, worked on it there, and put it in a pile with other claims. At the end of the week, they then went to the IT system and logged in the information — Essentially documenting the work they had done earlier.
This way of working has two problems:
- It shows that the system is not supporting the case worker in what they have to do. Otherwise they would want to use the system to guide them along. Instead, the documentation in the system is an additional, tedious task that is delayed as much as possible.
- Of course, this also means that the timestamps that are recorded in the system do not represent the actual time when the activities in the process really happened. So, doing a process mining analysis based on this data is close to useless.
The company is now working on improving the system to better support their employees, and to — eventually — also be able to restart their process mining initiative again.
You might encounter such problems in different areas. For example, a doctor may be walking around all day, speak with patients, write prescriptions, etc. And then by the end of the day she sits down in her office and writes up the performed tasks for the administrative system. Another example is that the timestamps of a particular process step are manually provided and people make typos when entering them.
So, what can you do if you find that your data has the problem that the recorded time does not reflect the actual time of the activities?
How to fix:
First of all, you need to become aware that your data has this problem. That’s why the data validation step is so important (more on data validation sessions in a later article).
Once you can make an assessment of the severity of the gap between the recorded timestamps in your data and the actual timestamps of the recorded activities, you need to decide whether (a) the problem is localized or predictable, or (b) all-encompassing and too big to analyze the data in any useful way.
If the problem is only affecting a certain activity or part in your process (localized), you may choose to discard these particular activities for not being reliable enough. Afterwards, you can still analyze the rest of the process.
If the offset is not that big and predictable (like the doctor writing up her activities at the end of the day), you can choose to perform your analysis on a more coarse-grained scale. For example, you will know that it does not make sense to analyze the activities of the doctor in the hospital on the hour- or minute-level (even if the recorded timestamps carry the minutes, technically). But you can still analyze the process on a day-level.
Finally, if the problem is too big and you don’t know when any of the activities actually happened (like in the example of the insurance company), you may have to decide that the data is not good enough to use for your process mining analysis at the moment.