This is Flux Capacitor, the company weblog of Fluxicon.
You can find more articles here.

You should follow us on Twitter here.

Data Quality Problems In Process Mining And What To Do About Them — Missing Complete Timestamps for Ongoing Activities

This is the 13th article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.

If you have ‘start’ and ‘complete’ timestamps in your data set, then you can sometimes encounter situations, where the ‘complete’ timestamp is missing for those activities that are currently still running.

For example, take a look at the data snippet below (click on the image to see a larger version). Two process steps were performed for case ID 1938. The second activity that was recorded for this case is ‘Analyze Purchase Requisition’. It has a ‘start’ timestamp but the ‘complete’ timestamp is empty, because the activity has not yet completed (it is ongoing).

Missing Complete Timestamp (click to enlarge)

In principle, this is not a problem. After importing the data set, you can simply analyze the process map and the variants, etc., as you would usually do. When you look at a concrete case, then the activity duration for the activities that have not completed yet is shown as “instant” (see the history for case ID 1938 in the screenshot below).

Activity duration is instant (click to enlarge)

However, where this does become a problem is when you analyze the activity duration statistics (see screenshot below). The “instant” activity durations influence the mean and the median duration of the activity. So, you want to remove those activities that are still ongoing from the calculation of the activity duration statistics.

The activity duration statistics are affected by this (click to enlarge)

How to fix:

  1. Import your data set again and only configure the complete timestamp as a ‘Timestamp’ column (keep the start timestamp column as an attribute via the ‘Other’ configuration). This will remove all events, where the complete timestamp is missing.
  2. Export your data set as a CSV file and import it again into Disco, now with both the start and the complete timestamp columns configured as ‘Timestamp’ column.

Your activity duration statistics will now only be based on those activities that actually have both a start and a complete timestamp.


Leave a reply