Data Quality Problems in Process Mining and What To Do About Them — Part 11: Data Validation Session with Domain Expert

Expert interview

This is the eleventh article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.

A common and unfortunate process mining scenario goes like this: You present a process problem that you have found in your process mining analysis to a group of process managers. They look at your process map and point out that this can’t be true. You dig into the data and find out that, actually, a data quality problem was the cause for the process pattern that you discovered.

The problem with this scenario is that, even if you then go and fix the data quality problem, the trust that you have lost on the business side can often not be won back. They won’t trust your future results either, because “the data is all wrong”. That’s a pity, because there could have been great opportunities in analyzing and improving this process!

To avoid this, we recommend to plan a dedicated data validation session with a process or domain expert before you start the actual analysis phase in your project. To manage expectations, communicate that the purpose of the session is explicitly not yet to analyze the process, but to ensure that the data quality is good before you proceed with the analysis itself.

You can ask both a domain expert and a data expert to participate in the session, but especially the input of the domain expert is needed here, because you want to spot problems in the data from the perspective of the process owner for whom you are performing the analysis (you can book a separate meeting with a data expert to walk through your data questions later). Ideally, your domain expert has access to the operational system during the session, so that you can look up individual cases together if needed.

To organize the data validation session with the domain expert, you can do the following:

  • Start by explaining briefly what process mining is. Show up to a maximum of 5 slides and consider giving a very short demo with a clean and simple example. Unless they have recently participated in a presentation about process mining, you should assume that they either do not know what process mining is at all or only have a vague idea.
  • Then, restate the purpose of the session and explain that you want to validate the data with them and collect potential issues and questions on the way.
  • Consider asking them to draw a very simple process map (just boxes and arrows) of the process from their perspective with up to a maximum of 7 steps at a flip-chart or whiteboard. This will be useful as a reference point, when you are trying to understand the meaning of certain process steps later on in the meeting.
  • Show them the data in raw format (for example, in Excel) and explain where you got the data and how it was extracted. Point out the Case ID, Activity, and Timestamp columns that you are using.
  • Then, import the data in front of their eyes and go over the summary information (showing the timeframe of the data, the attributes, etc.). Afterwards, look at the process map and inspect the top variants with them. Look at some example cases and ask them: “Does this make sense to you?”. Write down any issues that they mention.
  • If you find strange patterns in the process behavior, filter the data to get to some example cases for further context. Simplify the process map if needed (see this article on simplification strategies) and interactively look into the issues that you find together. Try to find answers to questions right in the session if possible and otherwise write them up as an action point.
  • If you can, look up a few cases in the operational system together (many systems allow you to search by case number, or customer number, and inspect the history of an individual case) and compare them with the case sequences that you find in Disco to see whether they match up as expected.
  • Of course, you may have already run into questions yourself while going through the data quality checklist before this data validation session. You can go through them with the domain expert to see whether they have some explanations for the problems that you have observed.

You may find that the domain expert brings up questions about the process that are relevant for the analysis itself. This is great and you should write them down, but do not get side-tracked by the analysis and steer the session back to your data quality questions to make sure you achieve the goal of this meeting: To validate the data quality and uncover any issues with the data that might need to be cleaned up.

After the validation session, follow-up on all of the discovered data problems and investigate them. Also, keep track which of your original process questions may be affected by the data quality issues that you found. Document the actions that you have taken, or intend to take, to fix them.