As Niels pointed out, analyzing faulty data cannot only have unpleasant effects like losing the trust of the process manager. In application areas like healthcare, it can have serious consequences that put people at risk.
In our latest Process Mining Café, we spoke with Kanika Goel from Queensland University of Technology and Niels Martin from Hasselt University about data quality. If you missed the live broadcast or want to re-watch the café, you can now watch the recording here.
First, we discussed why general data quality frameworks like the DAMA dimensions are insufficient when we talk about data quality in process mining: Process mining data has temporal relations as multiple events are linked to a case and ordered in time. This is why there are specific categorizations of data quality problems for process mining in the literature (see links below).
We then discussed several practical data quality examples and current research approaches along the four phases of dealing with data quality problems:
-
Detection. Checklists like our data quality checklist (click on the image below to see the complete checklist) help to detect problems in your data set.
Furthermore, Kanika and Niels discussed research approaches that support automated and domain knowledge-assisted data quality checks.
-
Cleaning. After finding and investigating the data quality problems, the data needs to be corrected. You can often do this cleaning step with the process mining tool (see the checklist above for examples). But sometimes, you must go back to the source data to fix it.
Kanika told us about a research project that repairs activity labels with a gamification and crowdsourcing approach.
-
Analyzing the cleaned data. Before you analyze the cleaned data, make sure to check whether the data is still representative! For example, if you had to remove 90% of the cases due to data quality problems, you cannot assume that the remaining 10% represent the entire process. It is also a good idea to create a new baseline for the cleaned data as the basis for your analysis (see Step 2 in this article for an example).
Kanika and Niels see that people often forget that the data has been cleaned and analyze the cleaned data as they would the initial data. They developed an approach that enhances the original data with annotations to maintain awareness about the performed data cleaning and transformation steps.
-
Root causes and prevention. We discussed that process mining newcomers should not expect their data to be perfect. You work with the data that you have. And often, detecting data quality issues is a valuable insight in itself! Strive for data that is “fit for use” use improve your data quality along the way.
To get at the root causes of data quality problems, you sometimes have to go outside the technical systems and include social and organizational dimensions like peer pressure and performance incentives. We discussed a research framework that captures the root causes of data quality problems in a holistic manner (see all the links to the discussed papers below).
Finally, we took a step back and looked at the broader field of data governance, where data quality is just one aspect. Niels and Kanika shared an example from ongoing research that reveals that process mining-specific approaches are needed in other data governance areas as well. 1
Thanks again to Kanika and Niels and all of you for joining us!
Links
Here are the links that we mentioned during the session:
-
The Six Primary Dimensions for Data Quality Assessment by the DAMA UK Working group on ‘Data Quality Dimensions’ describe the general data quality dimensions of completeness, consistency, uniqueness, validity, accuracy, and timeliness.
-
Wanna improve process mining results? by J.C. Bose and W.M.P. van der Aalst was the first process mining-specific data quality problem categorization.
-
The Event Log Imperfection Patterns for Process Mining by S. Suriadi, R. Andrews, M.T. Wynn and A.H.M. ter Hofstede build on this categorization and identify common data quality patterns based on practical experience.
-
Our data quality checklist for process mining helps you find and clean problems that are common in event log data.
-
The paper Enhancing event log quality: Detecting and quantifying timestamp imperfections by D. Fischer, K. Goel, R. Andrews, C. van Dun, M.T. Wynn and M. Roglinger presents an automated approach for detecting and quantifying timestamp-related issues.
-
The article DaQAPO: Supporting flexible and fine-grained event log quality assessment by N. Martin, G. Van Houdt and G. Janssenswillen introduces an R-package that supports event log quality assessments.
-
The paper Collaborative and Interactive Detection and Repair of Activity Labels in Process Event Logs by S. Sadeghianasl, A.H.M. ter Hofstede, S. Suriadi and S. Turkay shows how gamification and crowdsourcing can be leveraged for repairing activity labels.
-
The article Quality-Informed Process Mining: A Case for Standardised Data Quality Annotations by K. Goel, S.J.J. Leemans, N. Martin and M.T. Wynn proposes data quality annotations for event logs.
-
The paper An expert lens on data quality in process mining by R. Andrews, F. Emamjome, A.H.M. ter Hofstede and H. Reijers investigates root causes for data quality problems (see also our Process Mining Café with Hajo Reijers about this topic).
Contact us via cafe@fluxicon.com if you have questions or suggestions for the café anytime.
-
This study is currently under review and is not publicly available yet. We will link to the paper here once it becomes available. You can also follow Niels on Twitter to keep up with their research. ↩︎