This is Flux Capacitor, the company weblog of Fluxicon.
You can find more articles here.

You should follow us on Twitter here.

Data Quality Problems In Process Mining And What To Do About Them — Part 5: Same Timestamps for Different Activities 4

Same Timestamps

This is the fifth article in our series on data quality problems for process mining. Make sure you also read the previous articles on formatting errors, missing data, Zero timestamps, and wrong timestamp configurations.

In the previous article on wrong timestamp configurations we have seen how timestamp problems can influence the process flows and the process variants. One reason for why timestamps can cause problems is that they are not sufficiently different. For example, if you only have a date (and no time) then it may easily happen that two activities within the same case happen on the same day. As a result you don’t know in which order they actually happened!

Take a look at the following example: We can see a simple document signing process with four activities and three cases.

Same Timestamps example

The order of the rows in each case is arbitrary. When importing this data set, the sequence of events is determined based on the timestamps. For example, the sequence of the steps ‘Created’ and ‘Sent to Customer’ for case 1 is reversed (compared to the original file), because the dates reflect that the two steps have happened in the opposite order (see screenshot below).

Case 1: First two events are re-sorted based on timestamp (click to enlarge)

However, if two activities happen at the same time (on the same day in this example), then Disco does not know in which order they actually occurred. So, it keeps the order in which they appear in the original file. Because the order of the activities in the example file is random, this creates some additional variation in the process map (and in the variants) that should not be there.

For example, the three cases in the above example come from a purely sequential process. However, because sometimes multiple steps happen on the same day, and the order between them is arbitrary, you can see some additional interleavings in the process map. They reflect the different orderings of the same timestamp activities in the file (see screenshot below).

Same Timestamps introduce more complexity than is necessarily there (click to enlarge)

So, if you don’t have sufficiently fine-granular timestamps to determine the order of all activities, or if you have many steps in your process occurring exactly at the same time, it often creates more complexity than is already there. What can you do to distinguish the real process complexity from the one just caused by the same timestamp problem?

How to fix: You can either leave out events that have the same timestamps by choosing a “representative” event (see strategy 1 below), or you can try pre-sorting the data (see strategies 2-4 below) to reduce the variation that is caused by the same timestamp activities.

Strategy 1: “Representative” (Leaving out events)

The reason for ‘Same Timestamp’ activities is not always an insufficient level of granularity in the timestamp pattern. Sometimes, it is simply a fact that many events are logged at the same time.

Imagine, for example, a workflow system in a municipality, where the service employee types in the new address of a citizen who moved to a new apartment. After the street, street number, postal code, city, etc., fields in the screen have been filled, they press ‘Next’ to finalize the registration change and print the receipt.

In the history log of the workflow system, you will most likely see individual records of the changes to each of these fields (for example, a record of the ‘Old value’ and the ‘New value’ of the ‘Street’ attribute). However, all of them may have the same timestamp, which is the time when the employee pressed the ‘Next’ button and the data field changes were all finalized (at once).

Below, you can see another example of a highly automated process. Many steps happen at the same time.

Same Timestamps: Choose representative event (click to enlarge)

However, you may not need all of these detailed events and can choose one of them to represent the whole subsequence. For example, in the case below the first of the four highlighted events could stand for the sequence of four. You can deselect the other steps via the ‘Keep selected’ option in the Attribute filter.

In general, focusing on just a few – the most relevant – milestone activities is one of the most effective methods to trim down the data set to more meaningful variants if you have too many – See also Strategy No. 9 in this article about How to Manage Complexity in Process Mining.

Strategy 2: Sorting based on sequence number

Sometimes you actually have information about in which sequence the activities occurred in some kind of sequence number attribute. This is great, because you can now sort your data set based on the sequence number (see below) and avoid the whole Same Timestamp Activities problem altogether.

Same Timestamps: Sorting based on sequence number (click to enlarge)

Because Disco uses the sequence from the activities in your original file for the events that have the same timestamp, this pre-sorting step will influences the order in which the variants and the process flows are formed and, therefore, fix the random order of the Same Timestamp Activities.

Strategy 3: Sorting based on activity name

Of course you don’t always have a sequence number that you can rely on for sorting the data. So what else can you do?

Another way that often helps is that you can pre-sort the data simply based on the activity name. The idea is that at least the activities that have the same timestamp (and are sometinmes in this and sometimes in that order) are now always in the same order, even if the order itself does not make much sense.

This is easy to do: Simply sort the data based on your activity column before importing it. However, sometimes this strategy can also backfire, because you may – accidentally – introduce wrong orders in same timestamp activities that by coincidence were fine before.

For example, consider the outcome of sorting the data based on activity name for the document signing process above:

Same Timestamps: Sorting based on activity name (click to enlarge)

It has helped to reduce the variation in the beginning of the process, but at the same time it has introduced a reverse order for the activities ‘Document Signed’ and ‘Response Received’ for case 3 (which have the same timestamps but were in the right order by coincidence in the original file).

Strategy 4: Sorting based on ideal sequence

To influence the order of the Same Timestamp Activities in the “right” way, you can analyze those process sequences in your data that are formed by actual differences in the timestamp. You can also talk to a domain expert to help you understand what the ideal sequence of the process would be.

For example, if you look at case 2 in the document signing process, then you can see that the sequence is fully determined by different timestamps (see screenshot below).

Same Timestamps: Understanding the ideal sequence (click to enlarge)

We are now going to use this ideal sequence to influence the sorting of the original data. One simple way to do it is to pre-face the activity names by a sequence number reflecting their place in the ideal sequence (i.e., ‘1 – Created’, ‘2 – Sent to Customer’, ‘3 – Response Received’, and ‘4 – Document Signed’) by using Find and Replace.

Same Timestamps: Pre-facing the activity names with a sequence number (click to enlarge)

After adding the sequence numbers, you can simply sort the original data by the activity column (see below).

Same Timestamps: Sorting based on ideal sequence reduces unwanted variation (click to enlarge)

This will bring the activities in the ideal sequence. When you now import the data in Disco, you should only see deviations from the ideal sequence if the timestamps actually reflect that.

Comments (4)

[…] article) Part 2: Missing Data Part 3: Zero Timestamps Part 4: Wrong Timestamp Configuration Part 5: Same Timestamp Activities Part 6: To be […]

Great article Anne. I have also run across the situation where the activities are associated with both course time stamp granularity and sequence numbers. One technique I’ve used in the past is to append the time with the sequence number as a millisecond / second / minute value before I import the data set into disco.

David Hudgins
AT&T

Yes, that’s another possibility. Thanks for the addition, David!

Great article Anne,

Alphabetical will work for now, but good to know about other solutions.

Marc


Leave a reply