Data Suitability Checklist for Process Mining

Lab testing

Once you start looking for process mining data within your organization, you will be faced with data sets for which you need to determine whether they are suitable for process mining or not.

Perhaps you have found an existing report and want to see if that data extract is usable for your process mining project. Or you have requested a data set set from your IT department and now you need to judge whether it fulfills the requirements for a process mining analysis.

What exactly do you need to look for? Here is a checklist with the questions that you can go through to assess the suitability of your data. You can also download this PDF version to print it out and check off each point.

Checklist Data Suitability

  1. Structured data? Do you have data with columns and rows?

  2. Case ID, Activity, and Timestamp columns available? Do you have at least one column that can be your case ID, your activity name, and your timestamp? See when a timestamp is not needed here.

  3. Same case ID in multiple rows? Does the same case ID show up in more than one row at least sometimes? If each row has a unique case ID, your data is either not usable or you may need to reformat it.

  4. Different activities in the same case? Does the activity name change at least sometimes within the same case? If the activity field does not change over time, it does not contain the history and you need to look for another activity column.

  5. Different timestamps in the same case? Does the timestamp change at least sometimes within the same case? If the timestamp field does not change over time, it does not contain the history and cannot be used as your timestamp column. You can import your data without timestamps if it is already sorted.

  6. Date and time in one column? Are the date and the time portion of your timestamp placed in the same column? Because you can have multiple timestamps, each timestamp needs to be in one column.

  7. Data in one file? If your data was distributed across multiple files (for example, because it comes from different IT systems), have you combined it into one file?

  8. Different timestamp patterns in separate columns? If you have timestamps with different timestamp patterns, are they placed in different columns?

  9. Activity names human-readable? Are your activity names understandable (not just a numeric value like an action code, or a transaction number)?

  10. Activity names generalized enough? Does the same activity in another case have the same activity label (not just a free-text field that is filled differently every time)?

Can you answer ‘Yes’ to all of the points above? Then you can import your data into Disco and continue by checking the quality of your data before starting the actual process mining analysis.

Anne Rozinat

Anne Rozinat

Market, customers, and everything else

Anne knows how to mine a process like no other. She has conducted a large number of process mining projects with companies such as Philips Healthcare, Océ, ASML, Philips Consumer Lifestyle, and many others.