Assess the Suitability of Your Data

Once you start looking for process mining data within your organization, you will be faced with data sets for which you need to determine whether they are suitable for process mining or not.

Perhaps you have found an existing report and want to see if that data extract is usable for your process mining project. Or you have requested a data set set from your IT department (see Checklist: Prepare Your Own Data Extraction) and now you need to assess whether it fulfills the requirements for a process mining analysis.

What do you need to check? Here are some common questions you can go through and some typical situations you might encounter. For each question, we let you know when you can use the data “as is”, when it is not usable, and when it is usable after making some data transformations. Make sure to also look at the Detect and Fix Data Quality Problems chapter afterwards.

Structured Data?

The first thing that you need to check is whether you have structured data. Especially for technical people the term “event log” that is often used to describe a process mining data set can be misleading. They might think of a more technical log like a web server log.

The problem is that technical logs are often unstructured in the sense that they do not contain clear columns and rows. For example, Figure 1 shows such a log file snippet. These log files are written by the computer but they are typically meant to be read by a human (for example, by a programmer to debug a problem).

../_images/Log-Data-1.png

Figure 1: Unstructured log data like this one is not suitable for process mining.

Instead, you need structured data with columns and rows in Excel, as a CSV or text file, or in a database. It does not matter what the delimiting character is. For example, Disco can read CSV files where the delimiting character is a comma (”,”), a semicolon (”;”), a tab (“t”), or a pipe (“|”) character (see also Required format for CSV, Excel and TXT Files). But you need structured data to be able to analyze it with a process mining tool.

In Figure 2 you can see an example data snippet that is suitable for process mining, both opened in Excel (left side) and in a text editor (right side).

../_images/Log-Data-2.png

Figure 2: You need structured data with columns and rows in Excel, as a CSV file, or in a database.

If you currently have unstructured data then you first need to pre-process the data to get it into a structured, CSV-like format.

Minimum Requirements Fulfilled?

The next thing you need to check is whether your data fulfills the minimum requirements for process mining (see also The Minimum Requirements for an Event Log). This means that:

  1. You need at least one column that can be used as case ID (see also Case ID).
  2. There is at least one column that can be used as activity name (see also Activity).
  3. If you don’t have a timestamp column: Are the rows sorted in the order in which the activities occurred? If not, you will need a sequence number (or a timestamp) to sort them (see also Timestamp).

Same Case ID in Multiple Rows?

For the column that you want to use as case ID, check whether the same case ID appears in multiple rows. You can do this by sorting the data set based on the case ID column (first make a copy of your file to preserve the original order of the data set).

What you would like to see is that there are at least some case IDs that occur in multiple rows. For example, in Figure 2 you can see that line 2 until line 12 all refer to the same order number ‘Case360’.

If you find that every case ID appears just exactly once, then you can check the following:

  • Is your data set structured in such a way that you have one case per row and the activity timestamps in separate columns as shown in Figure 3?

    Then you can use it for your process mining analysis but you need to restructure your data from a column-based format to a row-based format first. See an example for how to do that and learn more about the data quality problems in column-based data sets in the chapter on Missing Timestamps For Activity Repetitions.

../_images/MissingTimestampsActivityRepetitions-1.png

Figure 3: Example data set where the activities are in columns rather than rows.

  • If your activity information is not in a column-based format and you still have just one case ID per row in your data set, then most likely the field that you thought could be your case ID is just an event ID and does not help you to correlate the steps that belong to the same process instance.

    Look through your data set to see if you can find another field that does repeat across multiple rows (for example, a resource ID), which could be used as a case ID dimension instead.

  • If you don’t find any case ID candidate column, where the same ID appears in more than one row, then your data set is too aggregated: You just have one event for each case (probably the end result or the current status) and this is not enough to discover the process flow.

    Go back to your IT department and ask them for historical data (see also Checklist: Prepare Your Own Data Extraction).

Activity History?

When you look for a field that can be your activity name, you may encounter a situation like shown in Figure 4: The status is the same for each event in the case.

../_images/MissingActivityHistory.png

Figure 4: It looks like this data set has a column that can be used for the activity name, but the status does not change over time.

In this situation, you do have a column that tells you something about the process step, or the status, for each case. However, you don’t have the historical information about the status changes that happened over time. Often, such a field will contain the information about the current status (or the last activity that happened) for each case. However, this is not enough for process mining, where you do need the historical information on the activities.

If the activity name or status column never changes over the course of a case, then you cannot use the column as your activity name. You need to go back to the system administrator and ask them whether you can get the historical information on this field.

You can also look for other columns in your data set to see whether they contain information that does change over time (like an organizational unit or resource, so that you can analyze the transfers of work between different departments or people).

Timestamp History?

The same can happen with the timestamp fields. At first, it might seem as if you had many different timestamp columns in your data set. But does any of them change over time for the same case? Or are they all the same like in Figure 5?

../_images/MissingTimestamps.png

Figure 5: It looks like this data set has a column that can be used as a timestamp, but the timestamp does not change over time.

If your timestamp field never changes over the course of a case, then this is a data field but not a timestamp field in the way you would need it for your process mining analysis. If you only have timestamp columns that never change, then you don’t have a timestamp column at all.

If your data is sorted in such a way that the events are in the right order, then you can still use it for your process mining analysis. When you import a data set into Disco without configuring a timestamp column, then the order of the events in the source data is used for ordering the events (see also Import your data without a timestamp configuration). So, even without a timestamp you can then still analyze the process flow and the variants, based on the sequence information in the imported data set, but you won’t be able to do a performance analysis.

Date and Time in one Column?

You can have more than one timestamp column (see also Multiple Timestamp Columns). If you configure multiple columns as a ‘Timestamp’ column, Disco will parse each of them to determine the start and completion times for the activities. This means that if your source data has one timestamp distributed over two columns (see Date and Time columns in Figure 6), Disco cannot get the full timestamp.

../_images/DateAndTime.png

Figure 6: If you have your date and time in multiple columns, you need to combine them into one column before importing your data set in Disco.

If your data set is not too big, you can simply open it in Excel and create a third column (see Date & Time in Figure 6) that combines the two. The CONCATENATE function in Excel can be used to do that. After you have created your combined timestamp column, you can save or export the data set as a CSV file and configure the Date & Time column as your timestamp column in Disco.

In some cases the CONCATENATE formula does not work because the Date and Time values are interpreted by Excel as a number. To solve this problem, you have two options:

  • Make sure that the Date and Time columns are interpreted as text. For example, instead of writing B3 within the formula, write TEXT(B3;"HH:MM:SS").
  • Add the two columns by using the formula =A3+B3. Then right click on cell C3 and select Format Cells, from the format cell dialog box select the Number tab, from the category list select Custom, under Type write the format DD/MM/YYYY HH:MM:SS, and click OK.

If your data set is too big to reformat in Excel, you can use an ETL tool to combine both columns into one.

If reformatting your data is not an option, then just use the date as your timestamp (see the Same Timestamp Activities chapter if you should end up with a lot of activities on the same date as a result).

Different Timestamp Patterns?

Especially if you get timestamps from different IT systems for your process mining data set, then they easily come with different timestamp patterns. Luckily, Disco does not force you to provide timestamps in a specific format. Instead, you can simply tell Disco how it should read your timestamps by configuring the timestamp pattern during the import step.

This works in the following way:

  1. You select your timestamp column (it will be highlighted in blue)
  2. You press the ‘Pattern…’ button in the upper right corner
  3. Now you will see a dialog (see Figure 7) with a sample of the timestamps in your data (on the left side) and a preview of how Disco currently interpets these timestamps (on the right side).

In most cases, Disco will automatically discover your timestamp correctly. But if it has not recognized your timestamp then you can start typing the pattern in the text field at the top and the preview will be automatically updated while you are typing, so that you check whether the date and time are picked up correctly.

You can use the legend on the right side to see which letters refer to the hours, minutes, months, etc. Pay attention to the upper case and lower case, because it makes a difference. For example ‘M’ stands for month while ‘m’ stands for minute. The legend shows only the most important pattern elements, but you can find a full list of patterns (including examples) at [SimpleDateFormat].

../_images/Different-Timestamp-Patterns-1.png

Figure 7: Disco lets you configure the timestamp pattern of your data rather than forcing your data set to come in a particular timestamp pattern (see also Configuring Timestamp Patterns).

But what do you do if you have combined data from different sources, and they come with different timestamp patterns?

Let’s look at the example data snippet in Figure 8 and Figure 9, which contains just a few events for one case. As you can see, the first event has only a creation date and it is in a different timestamp format than the other workflow timestamps.

../_images/Different-Timestamp-Patterns-2.png

Figure 8: Example data set with two different timestamp patterns in CSV format.

../_images/Different-Timestamp-Patterns-3.png

Figure 9: Example data set with two different timestamp patterns opened in Excel.

So, how do you deal with such different timestamp patterns in your data?

In fact, this is really easy: All you have to do is to make sure you put these differently formatted timestamps in different columns. Then, you can configure different timestamp patterns for each column.

For example, the screenshot in Figure 7 shows you the pattern configuration for the workflow timestamp. And in Figure 10 you can see the timestamp pattern for the creation date.

../_images/Different-Timestamp-Patterns-4.png

Figure 10: Configuration of the second timestamp pattern in the example data set.

Now both columns have been configured as timestamps, each with a different pattern (see Figure 11) and you can click the ‘Start import’ button. Disco will pick the correct timestamp for each event.

../_images/Different-Timestamp-Patterns-5.png

Figure 11: Two columns have been configured as timestamp and different timestamp patterns have been assigned for each of them.

So, there is no need to change the date or time format in the source data (which can be quite a headache). All you have to do is to make sure they go into different columns.

If you received a data set with different timestamp patterns in a single column, try duplicating the timestamp column and use one timestamp pattern for the first and the other timestamp pattern for the duplicated column. Disco will figure out which timestamp pattern is matched and pick one or the other accordingly.

Many-to-many Relationships Between Different Case IDs?

One typical challenge in, for example, ERP systems is that the data is organized around business objects rather than processes. In this case you need to piece these business objects (for example, document types in SAP) together before you can start mining.

The first challenge is that a common case ID must be created for the end-to-end process to be able to analyze the complete process with process mining. For example the process may consist of the following phases:

  1. Sales order: traced by Sales order ID
  2. Delivery: traced by Delivery ID
  3. Invoicing: traced by Invoicing ID

To be able to analyze the complete process, all three phases must be correlated for the same case in one case ID column. For example, if a foreign key with the Sales order ID reference exists in the delivery and invoice phase, these references can be used for correlation and the case ID of the Sales order can be used as the overall case ID for the complete process.

A second—and somewhat trickier—challenge is that often there is not a clear one-to-one relationship between the sub case IDs. Instead, you may encounter so-called many-to-many relationships. In many-to-many relationships each object can be related to multiple objects of the other type. For example, a book can be written by multiple authors, but an author can also write multiple books.

Imagine the following situation: A sales order can be split into multiple deliveries (see illustration on the left below in Figure 12). To construct the event log from the perspective of the sales order, in this case both deliveries should be associated with the same case ID (see middle in Figure 12). The resulting process map after process mining is shown on the right in Figure 12.

Going down the chain, a delivery can also be split-invoiced etc. The same principle applies.

../_images/Many-To-Many-1.png

Figure 12: If you create your data set from the perspective of the sales order and there was a split delivery, two deliveries are associated to the same order.

Conversely, it may also be the case that a delivery can combine multiple sales orders (see illustration on the left in Figure 13).

In this case, again, to construct the event log from the perspective of the sales order, the combined delivery should be duplicated to reflect the right process for each case (see middle in Figure 13). As a result, the complete process is shown for each sales order and, for example, performance measurements between the different steps can be made (no performance measurements can be made in process mining between different cases).

The resulting process map is shown in Figure 13 on the right.

../_images/Many-To-Many-2.png

Figure 13: But if create your data set from the perspective of the sales order and there was one delivery for two orders, then the delivery needs to be associated to both orders.

To illustrate what would happen when the delivery is only associated to the first sales order, consider the example in Figure 14.

It looks as if there was no delivery for sales order 2, which is not the case.

In return, one needs to be aware that the number of deliveries in the above mapping may be higher than the actual number of deliveries that took place. There were no two deliveries, just one!

../_images/Many-To-Many-3.png

Figure 14: If you would associate the delivery only to one order, it would look like one order was delivered while the second one was not.

The point is that there is no way around this. Wil van der Aalst sometimes calls this “flattening reality” [AalstChapter4] (like putting a 3D-world in a 2D-picture). You need to choose which perspective you want to take on your process.

What you can take away is the following:

  • Sometimes, multiple pieces of data need to be connected before you can start mining the end-to-end process.
  • You need to think about the perspective that you want to take on your process (for example, sales order or delivery perspective?)
  • Often, different views can be taken during the extraction and may be needed for the analysis

In many situations, putting together your data set from a database-centric system is not only a technical challenge but the decisions that you make along the way have a great impact on your process mining analyses. To fully understand the choices that are involved and the impact they will have, we recommend to refer to this step-by-step guide by Mieke Jans [Jans].

‘Old Value’ Vs. ‘New Value’ Format?

Some systems, for example, some database histories or CRM audit trail tables will give you the status changes not in a single ‘Activity’ or ‘Status’ column but in a ‘New value’ and ‘Old value’ format.

Take a look at the example data snippet in Figure 15. Instead of one Activity or Status column, you have two columns showing the “old” and the “new” status. For example, in line no. 2 the status is changed from ‘New’ to ‘Opened’ in the first step of case 1.

../_images/OldNew-1.png

Figure 15: Data set in ‘Old Value’ / ‘New Value’ format.

The question is how to deal with data in this format when you import it into the process mining tool. Here are three possible solutions.

Solution 1

Should you use both the ‘Old value’ and the ‘New value’ column as the activity column and join them together?

This would be solution no. 1 and leads to the process picture shown in Figure 16.

../_images/OldNew-2.png

Figure 16: You could combine the ‘Old Value’ and ‘New Value’ columns to get a detailed process view of all the combinations.

All combinations of old and new statuses are considered here. This makes sense but can lead to quite inflated process maps with many different activity nodes for all the combinations very quickly.

Solution 2

Normally, you would like to see the process map as a flow between the different status changes. So, what happens if you just choose the ‘Old value’ as the activity during importing your data set?

You would get the process map shown in Figure 17.

../_images/OldNew-3.png

Figure 17: If you use the ‘Old Value’ column as the activity name then you will see all status changes except the last one in the process map.

The process map shows the process flow through the different status changes as expected, but there is one problem: You miss the very last status in every case (which is recorded in the ‘New value’ column).

For example, for case 2 the process flow goes from ‘Opened’ directly to the end point (omitting the ‘Aborted’ status it changed into in the last event).

Solution 3

You can do the same by importing just the ‘New value’ column as the activity column and get the process map shown in Figure 18.

../_images/OldNew-4.png

Figure 18: If you use the ‘New Value’ column as the activity name then you will see all status changes except the first one in the process map.

This way, you see all the different end points of the process. For example, some cases end with the status ‘Closed’ while others end as ‘Aborted’. But now you miss the very first status of each case (the ‘New’ status).

In this example, all cases change from ‘New’ to ‘Opened’. So, missing the ‘New’ in the beginning is less of a problem compared to missing the different end statuses. Therefore, solution 3 would be the preferred solution in this case. But in other situations, the opposite might be the case.

Filtering Based on Endpoints

Note that you can still use the values of the column that you did not use as the activity name to filter incomplete cases with the Endpoints Filter.

For example, if you used Solution 2 (see above) but wanted to remove all cases that ended in the ‘New value’ = ‘Aborted’ you can configure the desired end status based on the ‘New value’ attribute with the Endpoints filter as shown in Figure 19.

../_images/OldNew-5.png

Figure 19: Even if you don’t see either the first or the last status change in the process map, you can still filter based on their value with the Endpoints Filter.

Summary

In summary, what you can take away from this is the following:

  • If you encounter the ‘Old value / New value’ situation, often just using one of the two columns is preferred to get the expected view of status changes in the process map.
  • If you choose the ‘Old value’ column, you will lose the very last status change in each case.
  • If you choose the ‘New value’ column, you will miss the very first status in each case.
  • You can still filter start and end points based on the attribute column that you did not use for the activity name.

In most situations, this is enough and you can use your ‘Old value / New value’ data just as it is. If, however, you really need to see both the very first and the very last status in your process flow, then you would need to reformat your source data into the standard process mining format and add the missing start or end status as an extra row.

Here are two additional tips about ‘Old value / New value’-formatted data sets:

  • If you have a timestamp in your data set then the meaning of the timestamp changes how you should interpret the delays in your process map depending on whether you choose the ‘Old value’ or the ‘New value’ column as your activity name (see also Different Moments in Time).
  • Often, there are additional rows in the data set that indicate other, data field changes in addition to the status changes in the process. For example, if the customer address field was changed, then the audit trail will show the old address and the new address in the ‘Old value / New value’ format. To remove such data change events that are less relevant from a process perspective, make sure that you include an additional column in your data sets that tells you which field was changed (for example, the ‘Address’ or the ‘Status’ field). You can then use the Attribute Filter in ‘Keep selected’ mode to remove events that refer to changes you are less interested in.
[Jans]Mieke Jans. From Relational Database to Valuable Event Logs For Process Mining Purposes: A Procedure. Hasselt University, Belgium, 2016. URL: https://files.fluxicon.com//Articles/Mieke-JANS-PROCEDURE.pdf
[AalstChapter4]Wil van der Aalst. Process Mining book, Chapter 4, Getting the Data. URL: https://www.slideshare.net/wvdaalst/process-mining-chapter04gettingthedata
[SimpleDateFormat]Documentation and examples for the Java SimpleDateFormat on Oracles’s documentation page. URL: http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html