Data Requirements¶

One of the big advantages of process mining is that it starts with the data that is already there, and usually it starts very simple. There is no need to first set up a data collection framework. Instead you can use data that accumulates as a byproduct of the increasing automation and digitization of your business processes. These data are collected right now by the various IT systems you already have in place to support your business.

Sometimes people are worried that they do not have the right data, but in practice this is rarely the case. There are so many workflow systems, CRM systems, ERP systems, delivery notes, request, complaint, ticketing, or order systems, etc. — so most organizations have lots of data.

The starting point for process mining is a so-called event log. But what exactly is an event log? Where do event logs come from? And how do you know whether your data satisfies the requirements to apply process mining? This is what this introductory chapter is about.

What you will learn:

What kind of data is needed to do a process mining analysis.

The three key requirements for an event log.

How much data you need.

How difficult it is to get data for process mining.

The Mental Model for Process Mining¶

The core idea of process mining is to analyze data from a process perspective. You want to answer questions such as “What does my As-is process currently look like?”, “Are there waste and unnecessary steps that could be eliminated?”, “Where are the bottlenecks?””, and “Are there deviations from the rules and prescribed processes?”.

To be able to do that, process mining approaches data with a mental model that maps the data to a process view.

To understand what this means, let us first take a look at another mental model: The mental model for classification techniques in data mining.

Imagine that you have a widget factory and you want to understand which kinds of customers are buying your widgets. On the left in Figure 1, you see a very simple example of a data set. There are columns for the attributes Name, Salary, Sex, Age, and Buy widget. Each row forms one instance in the data set. An instance is a learning example that can be used for learning the classification rules.

Figure 1: Data mining example: The classification target class needs to be configured.

Before the classification algorithm can be started, one needs to determine which of the columns is the target class. Because we want to find out who is buying the widgets, we would make the Buy widget column the classification target. A data mining tool would then be able to construct a decision tree like depicted on the right in Figure 1.

The result shows that only males with a high salary are buying the widgets. If we would want to derive rules for another attribute, for example, predict how old the customers who buy our widgets typically are, then the Age column would be the classification target.

For process mining, we have a slightly different mental model, because we look at the data from a process perspective.

In Figure 2, you see a simplified example data set from a call center process. In contrast to the data mining example above, an individual row does not represent a complete process instance, but just an event. Because a data set that is used for process mining consists of events, this kind of data is often referred to as event log. In an event log:

Each event corresponds to an activity that was executed in the process.

Multiple events are linked together in a process instance or case.

Logically, each case forms a sequence of events—ordered by their timestamp.

From the data sample in Figure 2, you can see why even doing simple process-related analyses, such as measuring the frequency of process flow variants, or the time between activities, is impossible using standard tools such Excel. Process instances are scattered over multiple rows in a spreadsheet (not necessarily sorted!) and can only be linked by adopting a process-oriented meta model.

Figure 2: Process mining input data: Case ID, Activity and Timestamp need to be identified.

For example, if you look at the highlighted rows 6-9 in Figure 2, you can see one process instance (case9705) that starts with the status Registered on 20 October 2009, moves on to At specialist and In progress, and ends with the status Completed on 19 November 2009.

The basis of process mining is to look at historical process data precisely with such a “process lens”. The basic concept is actually quite simple, and it is one of the big advantages that process mining does not depend on specific automation technology or specific systems. It is a source system-agnostic technology, precisely because it is centered around the process-oriented mental model explained above. This way, it can be applied to a wide range of processes, including but not limited to customer service processes, system monitoring, healthcare, IT services, enterprise or financial processes.

The Minimum Requirements for an Event Log¶

According to the mental model described before, you need to identify at least the following three elements: Case ID, Activity, and Timestamp (see Figure 3). These three elements allow you to take a process perspective on the data.

Figure 3: The three minimum requirements for process mining: A Case ID, an Activity name and at least one Timestamp column.

In the rest of this section, these key ingredients are described in more detail.

Case ID¶

A case is a specific instance of your process. What precisely the meaning of a case is in your situation depends on your process. For example:

In a purchasing process, the handling of one purchase order is one case.

In a hospital, this would be the patient going through a diagnosis and treatment process.

In a call center process, a case would be related to a particular service request number.

For every event, you have to know which case it refers to, so that the process mining tool can compare several executions of the process to one another. So, you need to have one or more columns that together uniquely identify a single execution of your process. They form a case identifier (case ID).

Note

The case ID determines the scope of the process.

Be aware that the case ID influences your process scope. It determines where your process starts and where it ends. In fact, there may be more than one way to set up your case ID.

For example, in a customer service process you could set up the case ID in two different ways. First of all, you might see the processing of a particular service request (SR) as the process you want to analyze. Then the SR number is your case ID (see Figure 4).

Figure 4: Each service request number (SR1 and SR2) will be interpreted as a separate case (note how the start and the end frequencies at the dashed lines indicate that there are two cases).

At the same time, you may want to see the overall process for a customer as your process scope—the same customer may have gone through multiple service requests over time. Then the customer ID is your case ID (see Figure 5).

Figure 5: If the customer ID is used as a case, then all five steps in the example data set are mapped to a single case.

Both alternatives are logical and can make sense, depending on your analysis goals. In your project you can take different views on the process and analyze it from different perspectives (see also Take Different Perspectives On Your Process). The important part for now is that you have at least one column that can be used to distinguish your process instances and serve as a case ID.

Activity¶

An activity forms one step in your process. For example, a document authoring process may consist of the steps ‘Create’, ‘Update’, ‘Submit’, ‘Approve’, ‘Request rework’, ‘Revise’, ‘Publish’, ‘Discard’ (performed by different people such as authors and editors). Some of these steps might occur more than once for a single case while not all of them need to happen every time.

There should be names for different process steps or status changes that were performed in the process. If you have only one entry (one row) for each case, then your data is not detailed enough. Your data needs to be on the transactional level (you should have access to the history of each case) and should not be aggregated to the case level.

Events can sometimes record not only activities you care about, but also less interesting technical information. Look for events which describe the interesting activities for your process from the business perspective. While you can also filter out less relevant events later in the analysis, it is important to make sure that the relevant process steps are captured in your data.

Note

The activity names determine the steps in your process map and their granularity.

Be aware that the chosen activity influences the level of granularity with which you are looking at your process. Again, there may be multiple columns that can be combined to form the activity name, and there may be multiple alternative views on what constitutes an activity.

For example, in the document authoring process above, you might have additional information in an extra column about the level of required rework (major vs. minor) in the ‘Request rework’ step. If you just use the regular process step column as your activity, then ‘Request rework’ will show up as one activity node in your process map (see Figure 6).

Figure 6: Just using the process step name gives you a high-level view of the process.

If, however, you decide to include the level of rework (major or minor) in the activity name, then two different process steps ‘Request rework - major’ and ‘Request rework - minor’ will appear in the process map (see Figure 7).

Figure 7: Including the rework type in the activity name produces a more fine-granular view of the process.

Like before, both alternatives can make sense. During your analysis you can take different views to answer different questions (see also Take Different Perspectives On Your Process). To get started, it is important that you have at least one column that can be used to distinguish your process steps and serve as an activity name.

Timestamp¶

The third important prerequisite for process mining is to have at least one timestamp column that indicates when each of the activities took place. This is not only important for analyzing the timing behavior of the process but also to establish the order of the activities in your event log.

Note

If you don’t have a sequentialized log file, you need timestamps to determine the order of the activities in your process.

Sometimes, you have a start and a complete timestamp for each activity in the process. This is good. It allows you to analyze the processing time of an activity (the time someone actively spent on performing that task), also called execution time or activity handling time. Refer to Including Multiple Timestamp Columns to learn how to include multiple timestamps in Disco.

If you have just one timestamp then you can still analyze the time between two process steps, but you will not be able to see how long each of the activities took (the processing time will appear to be instant). Don’t worry if that’s the case for your data set. In fact, many systems just record one timestamp and you will be able to perform most of the analyses that you want to do.

If you don’t have any timestamp at all, there is a good chance that you can still perform a process mining analysis: Check whether the order of the events in your file correspond to the order in which the steps took place (or whether you can find a sequence number that you can use to bring them in the right order). The timestamp is the only one of the three requirements that you can leave out when importing your data. Disco will simply use the order of the rows in your data file to determine the sequence of the steps for each case.

Finally, don’t worry about the format of the timestamps. Disco can adapt to your timestamp pattern and there is no need to bring your timestamps into a fixed format (see Configuring Timestamp Patterns for further details).

Other Columns¶

Additional columns may be available for your process and we recommend to include them. Such additional attributes provide context and can be used in the analysis as well. For example, there may be attributes that provide important information that is necessary to answer the questions that you have about your process.

Which attributes are relevant for you depends on your domain and use case. Some typical additional attributes are:

What kind of product the service request in a call center was about (or the order in a sales or repair process). Include this attribute if you want to compare the performance for different product categories.

There may be process categories that are already defined. For example, in IT services there are different processes for managing incidents, change orders, and for fulfillment or field service. By including the process type you can separate the data and analyze the corresponding processes in isolation.

The channel through which a lead came in (email or ad campaign, coupon, etc.) is often relevant for sales processes. Similarly, for repair services new requests may come in through the dealer, the call center, or the web portal.

Processes can vary for different partners. For example, you may want to compare the process at different repair shops in service process.

Domain-specific characteristics are influencing processes: In a repair service, there are different requirements for warranty vs. out-of-warranty repairs. In a hospital, the disease of a patient determines the precise diagnosis and treatment process, and so on.

By which person or department was the activity handled. This information is needed for organizational handover analysis, which may reveal communication patterns and inefficiencies at the hand-over points between organizational units.

If you are analyzing data from a multinational company and want to compare processes in different countries, then the country information needs to be pulled out of the source data as well.

The value of an order is relevant for many purchasing processes, because depending on the amount of money that is involved different anti-fraud rules will apply.

These are just examples. Include all attributes that you find relevant because they can improve the significance and value of your analysis.

How Much Data Do You Need?¶

Once you know what kind of data you need, you often wonder how much data you should extract for your process mining analysis. Contrary to typical data mining and statistics techniques, there is no real minimum amount of data that is necessary to get a process mining result. Even if you would just have, say, five cases you could put this tiny data sample into the process mining tool and it would show you which paths these five cases went through.

The main concern is to get a representative sample of data for your process. The amount of data you should extract also depends on the questions that you want to answer. For example, if you want to understand the typical process, then, at a certain point, adding more data will not give you any new insights. However, if you are looking for exceptions or irregularities that are important from a compliance angle, you probably want to check the data of the whole audit year to catch everything that went wrong in the audited period.

There are two main ways to extract data for process mining, which we will look at in the following.

All data in a certain timeframe¶

The best way to extract data for process mining is to get all recorded activities over a certain time period as illustrated in Figure 8.

Figure 8: Extracting all data recorded in a specific time period (here 3 months).

After the data has been imported, you will notice that some cases are incomplete. This means that either the beginning or the end of the case is missing (due to the selected timeframe). For example, Case 1 and Case 10 lack some activities at the start and Case 5 and Case 10 miss some activities at the end. However, you can simply use the Endpoints Filter in Disco to remove incomplete cases and focus on the full process.

Getting all data in a certain timeframe is the preferred way to extract data for process mining, because it gives you a full picture about everything that is going on in the selected time period.

If you want to use this method then the question is how long should the timeframe be?

Your main concern should be to get a representative amount of cases that fall completely within the captured time period. As a rule of thumb, it is recommended to try to get data for at least 3 months. Depending on the run time of a single process instance it may be better to get data for up to a year (or more). For example, if your process usually needs 5–6 months to complete (think of a legal case in court, or a public building permit process), then a 3-month-long sample will not get you even one complete process instance.

So, it really depends on how long a case in your process is typically running. You want to get a representative set of cases and you need to keep some room to catch the usual few, long-running outlier instances as well.

If you are still unsure how much data you need to extract, use the following formula based on the expected throughput time for your process:

Formula: timeframe = expected completion time x 4 x 5

The baseline is the expected process completion time for a typical case. The x 4 ensures that you have as much data that you could see four cases that were started and completed after each other (of course there will be others in between). The x 5 accounts for the occasional long-running cases (20/80 rule) and makes sure you see cases that take up to five times longer in the extracted time window.

For example, if the expected completion time of a typical case in your process is 5 days, then the formula yields 100 days = 5 days x 4 x 5, which is approximately 3 months of data. If, however, a typical process is completed in just a few minutes, then extracting a couple of hours of data may be enough.

This formula is just a starting point. The more you know about your process, the better you will be able to judge the amount of data you should extract.

All cases started or completed in a certain timeframe¶

In many situations it is not possible to get all data in a large timeframe. One reason can be that due to the long process run times the timeframe would be too big. Another reason can be that you have to follow your cases through different places in a database (see also How Difficult Is it to Get Data For Process Mining?) and it is simply not possible to extract everything.

In this case you can best select all cases that were either started or completed in a certain timeframe.

Figure 9 illustrates the scenario of extracting all cases that were started in a timeframe of one month. The start point of the case, which is the timestamp of first activity in the process, determines whether the case is selected or not.

This is the most suitable method if you are interested in analyzing “fall-out” in your process. For example, in an order handling process you are most likely interested to see which customer orders, or purchase interests, were actually completed (rather than discontinued). So, you would extract all orders that have been created in a certain time period.

Figure 9: Extracting all cases that were started in a certain timeframe (here 1 month).

Figure 10 illustrates the scenario of extracting all cases that were completed in a timeframe of one month. The end point of the case, which is the timestamp of last activity in the process, determines whether the case is selected or not.

This is the most suitable method if you are interested in analyzing compliance questions in your process. For example, in a purchasing process you may want to find out whether all invoices that were paid in a particular month went through the required approval process. So, you would extract all activities related to invoices that have been paid in a certain time period. [1]

Figure 10: Extracting all cases that were completed in a certain timeframe (here 1 month).

The advantage of selecting cases based on a start or end point is that you will extract all activities that occurred for the selected cases. For example, note that Case 5 was included in its entirety in the scenario of Figure 9. In contrast, the activities of Case 5 that occurred outside of the selected timeframe in the extraction scenario in Figure 8 were “chopped off”. Similarly, Case 1 was included in its entirety in the scenario of Figure 10, while the activities that occurred before the selected timeframe were left out in the scenario of Figure 8.

However, ideally you can choose a large enough timeframe according to the scenario of Figure 8 that these boundary cases are insignificant. The advantage of extracting all activities in a certain timeframe is that this method preserves you the full analysis possibilities. You can still analyze both fall-out and compliance questions by separating cases that were started or completed in a certain (sub) timeframe with the Timeframe Filter in Disco later on.

Whichever data selection method was used, you should know how the data was extracted, because you need to take that information into account when you interpret the results. For example, if only cases started or completed in a certain timeframe were extracted, then you need to be aware that activities related to other cases may be performed in the timeframe that you are analyzing. However, you will not see these activities in your data set. Therefore, any analysis related to the work load and utilization of resources will be distorted.

How Difficult Is it to Get Data For Process Mining?¶

Understandably, one of the first questions a new process miner has is “How much effort will it be to get data for process mining?”. Unfortunately, there is no simple answer to this question. It depends.

Figure 11: Some systems provide a history that can directly be used for process mining, while for others you first have to create your event log from custom queries.

Essentially, there are two types of systems (see also Figure 11):

Systems that provide a history. On the easier end of the spectrum are systems that record case histories. For example, Customer Relationship (CRM), IT Service Management (ITSM), order management and Workflow systems typically fall into this category.

Often, these systems have either an audit trail, a history table, or history logs available in one form or another. For example, in an ITSM system each case has a ticket number and status changes are recorded for each incident, change request or problem ticket it relates to.

In many situations, these history tables can be readily exported as a CSV file and directly imported in the process mining tool without any pre-processing.

Systems for which you have to construct the history. On the more difficult end of the spectrum you will find database-centric systems like Enterprise Resource Planning (ERP) systems and some types of legacy systems. In these systems, the case histories are not readily available. You need to find the relevant data in the business database tables and create your own event log for the analysis.

However, you can start simple in the following way: Ask the process owner to draw a simple process map on the whiteboard that contains just the most important 5-10 activities in the process. You can then go the the IT department and ask them to look for timestamps related to these activities in the business database. The thing is that you are not typically looking for technical loggings but you want to analyze the process from a business perspective. And almost all the important process milestones will be recorded in the business data in some way. For example, in a sales process the date when a sales proposal was sent to the prospect will be recorded somewhere. Ask them to extract the timestamps for these activities together with the case information and you have your event log ready for process mining.

Do not limit yourself just to systems that easily provide a history for export, or put it off until you have modern systems that record perfect data. The beauty of process mining is that it works for any IT system through the simple meta model (Case ID, Activity name, and Timestamp) discussed above. Especially for older systems the insights can be even greater, because the processes are not explicit and workflows are hard-coded in the application. We have seen some amazing process mining analyses of legacy systems from the 80ies, which were certainly not built to support process mining.

Another challenge that you may encounter is that the process is supported by multiple IT systems. In this situation, the data is also distributed over these different systems. For example, the data about the purchasing process in the Hands-on Tutorial came from just one system. However, you can easily imagine a situation, where the purchase requests are handled in the ERP system and the invoices are managed through a separate, financial system.

If you want to combine data from two (or more) systems, the most important thing to look for is how you can follow a case across these different systems. For example, if the ERP system uses a purchase order number for each case and the financial system uses an invoice number for each case, then you need to find a back reference in the invoice to which purchase order it refers to. Also here you can start simple: For a very first test start with just one system (knowing that some parts of the process cannot be shown yet). This way, you can quickly start to learn more about process mining, demonstrate what is possible, and include more data in the next iteration.

Note that especially for ERP systems, you often need to follow your case through cross-references between different document types (like, purchase order, delivery, invoice, etc.) even though you are extracting the data from a single system. Take a look at the 2015 Process Mining Camp presentation by Mieke Jans, which describes how you create an event log from any ERP system in ten steps [2].

Be aware that the preparation of data for process mining often involves choices (see also How Much Data Do You Need?). For example, you may find situations, where there are many-to-many relationships between, for example, sales orders and deliveries. Many-to-many relationship means that there is not a 1-to-1 mapping between one sales order and one delivery. Instead, multiple orders can be combined in one delivery or a single order may be split up in multiple deliveries. In such situations you need to make decisions [3] about how to prepare the data, depending on what kind of process mining analysis you want to do. You may even have to create multiple data sets to create different views and answer different questions. Therefore, even if you did not extract the data yourself it is very important to understand how the data was extracted (and the choices that were made).

Don’t be frightened by all these challenges. Often, people are very concerned about the data at first but then they find out that much more data is available than they initially thought. Watch out for data warehouses in your organization that have been set up to serve data analytics teams in another context. If nothing works, start by manually recording 10-20 cases from the operational system in the following way: Look up the history for each individual case through the user interface of the system that people use who work in this process and write up the activities and their timestamps in an Excel sheet by hand.

There is no need to get the perfect data right from the beginning. Especially, when you are still learning about process mining you can best try to first get some data sample quickly, see which insights you can get from it, and what the limitations are. Then iterate and get more or better data in the next step.

To get started, take a look at the data extraction checklist in the next chapter (see Checklist: Prepare Your Own Data Extraction).

Footnotes

[1]

Note that the illustration in Figure 10 assumes that all ten cases have actually reached the selected end activity (like, for example, “Pay invoice”). If there were purchase orders that have not reached this end point, then they would not be selected even if their last activity would fall into the selected time period. That is why fall-out (cases that have stopped in the middle of the process) cannot be analyzed well with this extraction method.

[2]	Mieke Jans, From Relational Database to Valuable Event Logs for Process Mining Purposes: A Procedure: https://files.fluxicon.com//Articles/Mieke-JANS-PROCEDURE.pdf (Watch Mieke’s talk at Process Mining Camp in 2015 here: https://fluxicon.com/camp/2015/mieke).

[3]	Article on Dealing with Many-To-Many Relationships in Process Mining: https://fluxicon.com/blog/2015/04/dealing-with-many-to-many-relationships-in-data-extraction-for-process-mining/.