Process Mining in The Assurance Practice — Applications and Requirements

This is a guest article by Suzanne Stoof and Nils Schuijt from KPMG and Bas van Beek from PGGM based on an article that has previously appeared in Compact magazine. If you have a guest article or process mining case study that you would like to share as well, please contact us via

PGGM provides Assurance Standard 3402 and Assurance Standard 3000 reports that are specific for each customer. Within PGGM, process mining is used to show that a number of processes can also be tested for multiple clients at once because these processes are generic for multiple pension funds.

We describe the experiences of PGGM with regard to process mining based on a practical example. Specifically, the impact on the work of the auditor for the Assurance Standard 3402 and Standard 3000 report and the conditions are described. We also outline how process mining can be deployed to perform the audit more efficiently and with a higher quality in the future.


PGGM is one of the largest pension administration organizations in the Netherlands. It is responsible for the management of the pension administration for multiple pension funds, including the Pension Fund Care and Welfare (PFZW). To demonstrate to its customers that processes are controlled properly, the PGGM Service Organization Control (SOC) provides reports in accordance with the Assurance Standard 3402 and the Assurance Standard 3000. These Assurance Standard 3402 and Standard 3000 Reports are provided specifically for each pension fund.

PGGM and their auditors have discussed the options that may exist to shape the process of testing the internal control measures for the SOC reporting more efficiently. PGGM wants to keep providing separate Assurance Standard 3402 and Standard 3000 reports per pension fund. To be able to test a process in a multi-client fashion, it is important that it can be demonstrated that these processes and corresponding control measures are performed in a generic way for all pension funds. In this context, process mining can help by showing that certain processes are indeed performed in the same way for multiple pension funds. That is why PGGM started to experiment with process mining. Their aim was to achieve both more efficiency and a higher quality for their audits.

Process mining in the audit practice

Within the audit practice [Rama16]1, process mining can be deployed during multiple phases in the audit process:

  1. During walkthroughs. For this, process mining is used to visualize the walkthrough based on the event data. The advantage of this is that not only the happy flow but all possible paths within a process are mapped.
  2. As a basis for sampling or partial observations. By doing this, it is possible to audit only items with a higher risk, for example, because they do not follow the happy flow, but go through an alternative path.
  3. For compliance checking. With this, control measures like a four-eyes principle can be tested in a process for the entire population, for example.

Process mining was initially deployed to perform a line audit of the processes at four PGGM customers. Subsequently, these four process flows were put next to each other to demonstrate that each of the four pension funds follows exactly the same steps within the process.

Experiences with process mining at PGGM

PGGM has established a multidisciplinary process mining project team with expertise in both the domain of pension processes as well as with expertise in process analysis and data analysis.

The first phase of the experiment was focused on the exploration of the possibilities of process mining and the tooling. The added value of process mining quickly became visible as it provided insight into the actual execution of the processes, including the bottlenecks. For instance, it became clear that activities existed that were forwarded many times and without any need, and that the waiting times at the transfer of work between departments were long. PGGM was able to solve these bottlenecks by a redesign of the process flow. Other examples of initiated process improvements are:

  • reduction of the lead time and the creation of customer value by the elimination of activities that do not provide added value to the process;
  • realization of better process control by insight in first time right;
  • design of a multi-client process execution instead of a fund-specific implementation;
  • application of Robotic Process Automation in processes. This means that repeating human activities within administrative processes are performed by software robots.

The next step was to examine how process mining can be deployed to obtain insight into process controls. That was performed based on the principles that process mining results in:

  1. a more efficient implementation of the controls;
  2. time-saving of the audit work for the second and third level;
  3. in the long term probably a greater assurance, because entire populations are checked instead of partial observations.

Process mining can provide additional certainty, because it is based on a comprehensive analysis of the entire population. Therefore, the selection of partial observations, which is often the current methodology, becomes superfluous. Instead, all the activities and underlying relations in the entire population are shown. One example of the application of process mining on an entire population is the confirmation whether all letters sent to the participants were checked by an employee. Another example is the check whether for each change a segregation of duties rule was followed.

Limiting factors of process mining are often (as PGGM experienced as well) that the data architecture is not designed for simple use of process mining. The data preparation takes a lot of time, because the required information is stored in different systems. Furthermore, not all manual activities within the workflow system are logged, which means that not all processes can be covered by the data. A well-structured data-architecture is essential to make optimal use of a process mining tool.

The case of the ‘Disbursement’ process

We explain the application of process mining at PGGM more detail based on a practical example: The Disbursement process.

The starting point for the process mining analysis was a consultation with all parties involved in the disbursement process within PGGM. The purpose of this consultation was to determine the viability of the multi-client execution of the audit work. As a result of the consultation, it has been concluded that the Disbursement process would be eligible to be performed multi-client. The actual viability should, inter alia, be demonstrated by process mining.

In the disbursement process, the pension rights and awards of participants are converted into an actual disbursement. An important part therein is the conversion of the gross amount awarded to the net disbursement rights: The gross/net-calculation. Furthermore, the process includes various checks and authorizations that are necessary due to the nature of the process. The disbursement process includes three main activities (see Figure 1).

Figure 1: The disbursement process

Figure 1: The disbursement process

The first step in the analysis was the creation of an event log. As data source, the payment and financial systems were used. Subsequently, the data was loaded into the process mining-tool.

The first results based on the event log were not satisfactory yet, upon which the event log was enriched with data from other sources, where the auditor was able to follow the data trail. In the end, the final event log that was created resulted in the overview as shown in Figure 2.

Figure 2: Outcome process mining

Figure 2: Outcome process mining

The outcome of the analysis in Figure 2 shows that the process flows of the four pension funds A, B, C, and D work identically. At first, a gross file is generated in the system, where the pension rights are administrated (process step: Gross file). In the gross file, the gross pension rights are recorded. In the next step, the conversion of the gross pension rights into the net payment rights takes place. This calculation is performed by an external party (process step: Gross net calc’). Subsequently, the net disbursement file is received back (process step: Net file). Hereafter, verifications take place if the gross/net-calculation was done properly, after which the authorization and approval of the net disbursement file occur (process step: Authorization). Finally, this disbursement is provided to the payment department, that performs the payment made by the bank (process step: Disbursement).

Another approach that can be applied is that process mining shows the entire process flow. The cases included in the ‘happy flow’ are considered to be in control. What is interesting are any exceptions that become visible. These non-‘happy flow’ paths have to be analyzed and explained, because they are undesirable in the context of process control. As visible in Figure 2, in this process, no exceptions existed.

By means of the analysis and the outcome, as shown in Figure 2, it is demonstrated that the processes are performed identically for multiple pension funds. Making use of process mining, it has been demonstrated that all activities in the process, regardless of which pension fund, follow the same process flow. For the documentation of this conclusion, a description of the log and the way of data extraction from the workflow-tool is included. It is also described which filters have been used in the process mining tool, and the controls are plotted on the process map. In addition to the use of process mining, the analysis is further substantiated by interviews with subject matter experts, a walkthrough, and inspection of, inter alia, operational instructions, policies, and manuals.

Based on the experiences of PGGM, the following lessons learned’ were derived:

  • Ensure an appropriate design of the data architecture;
  • Take advantage of the existing knowledge in the organization and activate it. Think of data analysts, SQL-specialists, process analysts and auditors;
  • Do not only focus on process mining but make use of a combination of data analysis techniques;
  • Experiment and be receptive to new insights and techniques.

Impact on the audit work of the auditor and requirements

During the preliminary stage, PGGM and their auditors talked a lot about the conditions and opportunities to apply process mining in the context of the Assurance Standard 3402/3000-audit, to show that a certain process is generically applied for multiple pension funds.

PGGM wishes to keep the Assurance Standard 3402/3000-reports specific per pension fund. In the event that several processes will be tested multi-client, it is essential that it can be demonstrated that these processes and corresponding control measures actually take place in a generic way for all pension funds.

For this, from the auditor's point of view, a number of matters are important. They are:

  • Scoping. Beforehand, consideration should be given to the scoping, i.e. which pension funds, processes, process steps, etcetera, belong to the audit object;
  • Being able to demonstrate the reliability of the data that is used is of importance. For instance, not all systems are yet able to unlock the data that can be used for process mining;
  • Procedures other than process mining provide additional audit evidence to determine if the process and the control measures are generic, including the review of process descriptions;
  • Explanation of this approach in the Assurance Standard 3402/3000 report.

Because at PGGM two different applications were used in which the pension administrations are performed, the decision was taken that, for this reason, it is not possible to follow a generic methodology for all pension funds. For four pension funds, of which the pension administration is performed within one application, it was decided to further investigate this.

With the help of process mining, it can be demonstrated that the processes follow the same flow for all four pension funds. This shows that the processes and corresponding control measures in the application are performed in a generic way. To the auditor, it was important that PGGM had clearly documented how it came to this conclusion. This, inter alia, means that PGGM had to show the auditor how it had performed the analyses using process mining, and which conclusions have been drawn. The analysis and explanation of the exceptions were then repeated by the auditor. It was also important that the reliability of the data, including the population based on which the process mining took place, could be determined. This includes that it must be traceable how the data (the so-called information produced by the entity’) was obtained from the system, and that it is correct and complete. For this, among other things, it is important that it can be guaranteed that, after downloading the data from the pension administration, no more manual adjustments have been made.

To the auditor, it is also important to confirm that the processes that will be treated as multi-client are carried out by one team, instead of by specific customer teams. Specific customer teams would imply the risk that certain audits could still be performed in another way. Based on process descriptions, we have established that there is one Shared Service Center that performs the processes in a generic way for all pension funds.

From the point of view of the auditor, it is also important that in the Standard 3402/3000-report it is clearly explained to the users that not all processes were individually tested for that specific user, but that it was performed for a number of processes based on a multi-client approach. Both PGGM as well as the auditor clearly explain this in the report. Process mining can thus generate added value to the user of the Assurance Standard 3402/3000-report. In addition to the written explanation, it is recommended to inform the pension funds in time and orally during periodic discussions about this approach.


Currently, we are also looking into the future, where, inter alia, the possibilities are investigated to integrate process mining in the control measures. An example of this is that an employee of the pension administration determines, based on process mining, if no exceptions compared to the standard process exist for an entire population over a certain period. In case there are exceptions, they analyze the exceptions. An advantage of this method is that the complete population is considered in the execution of the control measure. Furthermore, also the auditors base themselves on entire populations, instead of selecting a number of partial observations based on which a conclusion is drawn.

In this way, assurance can be provided based on the complete population in an efficient way, which can also generate added value for the user of the Assurance Standard 3402/3000-report. Additionally, process mining could be deployed as a continuous monitoring tool, where the data could be loaded repeatedly to directly detect deviations within the process.


During the audit of the Assurance Standard 3402 reports by PGGM, it deployed process mining in consultation with KPMG. Hereby it was demonstrated that four of the pension funds follow the same process and that they also make use of the same controls within the process. Process mining provides insight into the entire population, while the auditor usually makes use of partial observations. The next steps in the implementation of process mining at PGGM concern both the combination with other processes and the introduction of process mining as an audit tool within the Assurance Standard 3402/3000 reporting. By the deployment of process mining as control, continuous monitoring also comes a step closer.

  1. [Rama16] E. Ramezani Taghiabadi, P.N.M. Kromhout, and M. Nagelkerke, Process mining: Let DATA describe your process, Compact 2016/4,, 2016.

Process Mining Meets Football! How Does a Football Team Possess The Ball On The Pitch?

Football match in progress

Finding the right perspective of the process is one of the challenges you can face when applying process mining. In most cases we already have an idea what we would expect of the process, but in some cases its not so easy to find the right perspective to be able to get valuable insights.

At Process Mining Camp in June 2019, Hadi Sotudeh, a PDEng student at Jheronimus Academy of Data Science (JADS), shared his experiences to apply process mining to the World Cup 2018 dataset. He now wrote down his analysis in this article and is also interested in any thoughts or feedback you may have. You can reach him on Linkedin or via email.

If you have a guest article or process mining case study that you would like to share as well, please contact us via

The data set

What has football to do with process mining? Nothing at all, but I noticed that the Statsbomb dataset (see Figure 1) fulfilled the requirements to at least try1. I was especially interested in how a football team possesses the ball on the pitch. Being able to answer this question, I would be able to give coaching staff great insights into interesting patterns of play to develop counter strategies.

Figure 1: An example fragment of the Statsbomb data set

Figure 1: An example fragment of the Statsbomb data set

First, we need to introduce the ball possession process:

Ball possession process is a sequence of on-ball actions taken by one team from the beginning of possession until the end of it (losing or scoring).

For example, in the video above the ball possession starts from the defenders and after several actions on the ball, such as passes and dribbles, the forward player loses to score. This ball possession sequence has happened in the match between Iran and Portugal in the World Cup 2018. This sequence with its most important attributes after preprocessing the data set is shown in Table 1 (click on the image to see a larger version).

Table 1: The recorded ball possession sequence in the data set

Table 1: The recorded ball possession sequence in the data set

As shown in the table above, this ball possession sequence has attributes such as:

  • case ID: all actions in a ball possession sequence have the same case Id
  • action: name of the on-ball action such as pass and pressure
  • type: type of ball action such as Free Kick, Open Play, Shot Faced
  • play pattern: each sequence has one play pattern such as “Free Kick” and “From Corner”
  • recipient: the receiving player
  • start Time and end Time: start and end time of the action
  • period: which half of the match
  • duration: duration of the action
  • possession team: the team in ball possession
  • team action: the action-taking team
  • player: the action-taking player
  • body part: the body part of the action-taking player
  • start_X and start_Y: the start location of the action on the pitch with respect to a reference
  • end_X and end_Y: the end location of the action on the pitch with respect to a reference
  • result: the outcome of the action

The preprocessed process mining datasets of the World Cup 2018 can be downloaded here per match and aggregated per team.

We chose Belgium (see Figure 2) to analyze their ball possession process because they had seven matches and this, of course, will provide a richer dataset.

Mapping the ball possession process

Figure 2: Belgium in the World Cup 2018

Figure 2: Belgium in the World Cup 2018

To map the ball possession process, we need to think about how to assign the three process mining parameters case-id, activity and timestamp. The ball possession sequence number was already chosen as our case-id. Then, the on-ball action was initially taken as the activity, and the start time of the action was chosen as the timestamp. After importing the dataset into Disco, all the activities and the most important paths are shown in the discovered process map (see Figure 3 - Click to enlarge).

Figure 3: All activities in the ball possession process with the most important paths

Figure 3: All activities in the ball possession process with the most important paths

Figure 4 is showing the ball possession process for 594 cases with 295 variants. The first variant includes 48 cases (only one pass), the second variant includes 38 cases (2 passes), and the third variant includes (3 passes).

Figure 4: Top three variants of the ball possession process

Figure 4: Top three variants of the ball possession process

The players as activities

We can also have another perspective on the process by taking the players as activities and see the way they interact on the pitch, see Figure 5.

Figure 5: Belgium top 11 players interaction during the ball possession process on the pitch with most frequent interactions

Figure 5: Belgium top 11 players interaction during the ball possession process on the pitch with most frequent interactions

Belgium players interaction on the pitch has 594 cases with 525 variants, see Figure 6.

Figure 6: Top three variants of Belgium players’ interaction during the ball possession process on the pitch

Figure 6: Top three variants of Belgium players’ interaction during the ball possession process on the pitch

The top first variant includes 10 cases where Kevin De Bruyne was involved in the sequences with only one action. The second and the third variants also have one action but taken by different players, Jan Vertonghen and Thibaut Courtois respectively.

Unfolding loops

If we look back at the first perspective, where we mapped the on-ball action as the activity in the process (see Figure 3 above), we can see that there is a very dominant self-loop on the Pass activity (see Figure 7 below).

Figure 7: Self-loop in ‘Pass’ activity

Figure 7: Self-loop in ‘Pass’ activity

The collapsing of repetitions into loops is useful in most situations, but now we want to dive deeper into the ‘Pass’ patterns. To do this, we need to “unfold” this loop.

We applied the unfolding technique described in this article. This simply means that we change a sequence from Pass, Pass, Pass (which will be collapsed into a single ‘Pass’ activity with a self-loop) to Pass1, Pass2, Pass3 (which will be shown as a sequence of ‘Pass’ activities after each other).

After adding the repetition number in the python script, we import the data back into Disco by choosing both the on-ball action as well as the newly added sequence number of the repetition as the activity name. The full process map (100% activities and 100% paths) is shown in Figure 8.

Figure 8: Complete unfolded process map

Figure 8: Complete unfolded process map

As one would expect from a football game, the process map is very complicated. By only focusing on 50% of the activities and the most important paths, we get a readable process map that we can now further analyze (see Figure 9).

Figure 9: Unfolded process map with 50% of the activities and the most important paths

Figure 9: Unfolded process map with 50% of the activities and the most important paths

Distinguishing different types of ball possession

In another exploration, we want to focus on sub processes inside the ball possession process. In a football match, ball possessions can have different types such as from goalkeeper, from corner, from free kick, from throw in, etc. It is obvious that the process should be different in a corner than a start from a goalkeeper.

We concatenate the type of the sequence with the on-ball action as the activity name to make it easier to focus only on interesting subsets of the map based on this added dimension (see Figure 10 - Click on the image to see the full picture).

Figure 10: Subprocesses of the ball possession process with the type of sequence dimension

Figure 10: Subprocesses of the ball possession process with the type of sequence dimension

The coaching staff can create a filter on cases that are not set-pieces such as corner, and free kicks, to only focus on the parts that they are interested in.

Another application of process mining is that the coaching staff can go to the case explorer and see which undesired sequences have happened on the pitch to identify irregular patterns that need to be prevented (see Figure 11 and 12).

Figure 11: Exploring undesired sequences

Figure 11: Exploring undesired sequences

Figure 12: Finding players involved in those sequences

Figure 12: Finding players involved in those sequences

By exploring the variants one by one on the left-hand side of Figure 11, we can see that there are 11 sequences that had only one ball-action (Pass_Throw-in-1). Let's only keep these sequences by filtering them and drill down to see what is the reason.

Figure 12 shows us a picture of different players and how they performed on the selected sequences. For example, Jan Vertonghen was involved in five sequences out of 11 that ended after throwing the ball in. We can select those sequences and drill down to see what has happened (see Figure 13).

Figure 13: An example of a ball throw-in sequence that Vertongen was involved in

Figure 13: An example of a ball throw-in sequence that Vertongen was involved in

By selecting each case, we can see what the other attributes of that sequence are. For example, one of the sequences happened against Panama and belongs to the second period and 73rd minute. Now, we can connect the event-log to the match video and see what has happened on the pitch.

Here, we can now go to the exact time and watch that frame carefully.

As you can see, Vertongen tried to start the throw-in with a long pass, which was not successful and Panama took the ball possession over.

This way, the coaching team does not need to watch the whole match from the beginning to the end. They will be able to only focus on the interesting pieces and save time. This application is also interesting when your team plays against unknown teams and you as one of the coaching team members will not need to watch all of the opponents matches completely.


After transforming the data, we were able to explore the actions of the players but found that there was not one dominant pattern. We took various approaches to take other perspectives to discover patterns. For example, we were able to look at interactions with individual players. Because the football interactions do not follow a typical (standard) process, finding the right level is one of the challenges to get insights. Taking various perspectives can help to learn new things about the opponent pattern of play, or for a team to learn from mistakes.

As always, it is a good idea to look back and see how we came to this point. When we look back at how we defined the process, we realize that, maybe, we can further redefine the process, right?

For example, one of the next steps could be redefining the process as:

Ball possession process is a sequence of zones where the ball moves in and out on the pitch from the beginning of possession until the end of it.

This view requires defining zones on the football pitch. One example of it is shown in Figure 14.

Figure 14: Dividing the pitch into different zones (activities)

Figure 14: Dividing the pitch into different zones (activities)

Finding out the right way to divide the pitch into meaningful zones, and relevant questions that we can answer using process mining that are also interesting for coaching staff, is what we can do next.

We have shown that process mining can be a powerful tool to explore football data. However, finding the right perspective to answer questions is not always obvious.

Data can be molded into multiple representations, which in turn allow us to take various perspectives of the process. Finding the right perspective is an iterative process that can be best explored by trying different things.

  1. This dataset is provided by Statsbomb for research purposes on their GitHub page.

How to Analyze Open Cases With Process Mining

This article previously appeared in the Process Mining News. Sign up here to receive more articles about the practical application of process mining.

One of the first things that you learn in the process mining methodology is how to filter out incomplete cases to get an overview about what the regular end-to-end process looks like.

Incomplete cases are process instances that have not finished yet. They are somewhere “in the middle” of the process. You typically remove such incomplete cases, for example, when you analyze the average case duration because the case duration of incomplete cases is misleading. The time between the first and the last event in an incomplete case can appear to be very fast, but in fact only a fraction of the process has taken place so far: If you would have waited a few more days, or weeks, then more activities would likely have taken place.

But what if you are exactly interested in those incomplete cases?

For example, you may want to know how long they have been open, how long nothing happened since the last activity or status change, and which statuses accumulate the most and most severe open cases without any action afterwards? These may be cases, where the customerunnoticed by the companyhas been already waiting for a long time and is about to be disappointed.

In this article, we show you how you can include the perspective of open cases in your process mining analysis. We provide detailed step-by-step instructions (download Disco if you have not done so yet) to follow along.

1. Apply filter to focus on incomplete cases

As a first step, we need to filter our data set to focus on the incomplete cases. One typical way to do that is to use the Endpoints filter, where you can first select the expected endpoints, and then invert the selection (by pressing the half-filled circle next to the search icon in the upper right corner of the filter settings).

Another way to filter incomplete cases is to focus on whether neither of the expected milestones in the process has been reached using the Forbidden mode in the Attribute filter. For example, in a customer refund process, these milestones may be activities such as Canceled, Order completed, and Payment issued, because they indicate that the refund order is not open anymore for the customer (see below - click on the screenshot to see a larger version).

The difference between using the Attribute filter and using the Endpoints filter is that with the Forbidden mode of the Attribute filter we do not care about what exactly the last step in the process was. Instead, we want to base our incompleteness evaluation on the fact that a specific activity has not (yet) occurred. Read The Different Meanings of “Finished” to learn more about the differences between these definitions for complete cases.

For the refund process, we use an Attribute filter in Forbidden mode, in which we select the milestone activities that indicate a completion, a cancellation, a payment, or a rejection of the refund request. This removes all cases that have reached one of these milestones somewhere in the process. In addition, we combine this Attribute filter with an Endpoints filter that removes all refund requests for which we are currently waiting for the customer in the ‘Missing documents requested’ activity (see screenshot below).

2. Export the filtered data set as a CSV file

The result is a process view that contains only those cases that are still open. As we can see, ca. 36% of the cases are incomplete in this data set (see screenshot below).

In this view, you can already see what the last activities were for all these open cases: The dashed lines leading to the endpoint indicate how many cases performed that particular activity as the last step in the process so far. For example, we can see that 20 times the activity Invoice modified was the very last step that was performed.

However, what you cannot see here is for how long they have already been waiting in this state. The problem is that when you measure the case duration in process mining, then you always look at the time between the very first and the very last event in each case, irrespective of how long ago that “last event” was performed.

To find out how long these open cases have been idle after the last step (and for how long they have been open in total), we are going to use a trick and simply add a Today timestamp to the data. To do that, first export the incomplete cases data set using the Export CSV file button (see lower right corner in screenshot above).

3. Export the list of cases as a CSV file

We will need to add this artificial Today timestamp to the end of each of the open cases. To quickly get a list of the case IDs, switch to the Statistics tab, right-click somewhere in the Cases overview table and choose the Export to CSV option (see screenshot below).

This will export a list of all open cases in a CSV file, one row per case.

4. Copy the Case IDs from the exported list of cases

Now, open the list of Case IDs that you just exported in Excel and select and copy the case IDs in the Case ID column to the clipboard (see screenshot below).

5. Append the Case IDs and add ‘Today’ timestamp

Paste the case IDs from the clipboard below the last row in the exported data file from Step 2 (see screenshot below).

Then, type the activity name Today in the activity column for the first newly added row. Furthermore, add a Today-timestamp to the timestamp column (see screenshot below). Make sure that you use exactly the same date and time pattern format as the other timestamps in your data set.

Which Today-timestamp should you use? If you have extracted your data set fairly recently (and you would assume that most cases that appear to be open in the data set are still open now), you can actually simply use your current date. Otherwise, look up the latest timestamp of the whole data set via the End timestamp in the overview statistics and use that date and timestamp to be precise. For example, 24 January 2012 was the last timestamp in the customer refund process.

Finally, copy the Today-activity name and the timestamp cells and copy them to the remaining newly added rows (see screenshot below).

6. Re-import the data to analyze open cases

If you now save your file and import it again into Disco, you will see that a new Today activity has appeared at the very end of the process (see screenshot below).

The main difference, however, will be in the performance analysis.

For example, if you switch to a combination of total and mean duration1 in the performance view of the process map (see screenshot below), then you will see that one of the major places in the process where cases are stuck is after the Shipment via logistics partner activity. On average, open cases have been inactive in this place for more than 13 days.

Another example is the case duration statistics, which now reflect the accurate time that these incomplete cases have actually been open so far (see screenshot below). For example, the average time that incomplete cases have been open in this data set is 24.9 days.

  1. Read our article on How to perform a bottleneck analysis with process mining to learn why this combination can be useful for identifying the big impact areas for delays in your process.

Process Mining Transformations — Part 6: Relabeling Activities

This is the 6th article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

Out of the three minimum data requirements for process mining, the activity name is crucial to visualize the steps in the process. It shows you which activities took place and in which sequence.

There are situations, in which the activity name is only captured on a very technical level by the IT system (e.g., as an action code, a transaction number, or some other cryptic label). This is a problem. Not only because it makes it difficult for business users to understand the process map, but also because it becomes close to impossible for the process mining analyst to interpret what they are seeing. Therefore, we recommend to always take the time to enrich such technical activity labels by human-readable activity names.

For example, take a look at the following data set extracted by a Brazilian IT Service Management department (see below). The ‘task sequence’ column represents the status changes of the tickets in the IT Service Management system.

When you import the data into Disco to discover the process map1, you find that the activity names are shown as numbers (see below). For example, the first activity at the top is shown as ‘10’, the second one as ‘20’, etc. (click on the process map to see a larger version).

This is not practical, because—unlike you are so familiar with the IT system that you “think in” task sequence codes yourself—you will have a hard time to understand and interpret this process.

Even having a translation table on your desk and looking up individual activities (to see which activity belongs to which status code) is not a good idea, because the process maps that you discover with process mining get complicated very quickly already by themselves. You need to be able to build up a mental model of the process to deal with this complexity in your analysis.

So, in this article we show you step by step how you can add meaningful activity names to a data set that only has cryptic activity labels.

Step 1: Export the activities

First, you can export the list of all the different activities that are contained in your data set. To do this, you can go to the ‘Activities’ view in the ‘Statistics’ tab in Disco. Simply right-click somewhere in the activity statistics table and use the ‘Export CSV…’ option to save the activity statistics as a CSV file (see below).

You can then open the exported file in Excel (see below).

The ‘Frequency’ and ‘Relative frequency’ statistics are not needed for this use case and you can delete those columns.

Step 2: Mapping the activities

In the next step, you can add a new column and give the Excel sheet to the IT administrator of the system from which you extracted the data. Ask them to add a short description for each of the technical activity labels in your list.

Alternatively, you can also fill in a meaningful activity name yourself by looking at example cases and the process map together with a domain expert.

For example, for the IT Service Management process from before a column ‘ActivityLabel_PT’ has been added with the Portuguese and another column ‘ActivityLabel_EN’ for the the English activity name (see above).

Step 3: Apply the new mapping to your dataset

Now that we have the mapping, we need to apply it to the source data. Here, we show you two simple ways of how to do this in Excel. We will share alternative ways of relabeling activity names for data sets that are too large to be manipulated in Excel in an upcoming article.

The easiest way is to just use the ‘Find and Replace’ functionality in Excel (see below).

  • Copy and paste the column with the technical activity code. Choose a new heading for the new column to indicate that this is the new activity name.
  • Select the new column (to make sure only fields in this column are being replaced) and open the ‘Find and replace’ tool in Excel.
  • Don't forget to check the find entire cells only options, otherwise you may only replace part of the text.
  • Copy and paste the first technical activity code in the ‘Find’ and its new human-readable name in the ‘Replace with’ field.
  • Press ‘Replace All’.
  • Continue until all technical activity codes in the new column have been replaced.

The ‘Find and Replace’ method becomes a bit tedious if you have a large number of different activities. In such situations you can better use the VLOOKUP function in Excel.2

To do this:

  • Add a new tab called ‘Mapping’ to the source Excel file and copy the result from Step 2 above (without headings) to this new tab.
  • Then, go back to your source data tab and add a new column including a heading for the relabeled activity.
  • Add the following formula =VLOOKUP(C2,Mapping!A:C,2,FALSE) in the first cell of the newly created column.
  • You can then automatically apply this formula to all the rows in the new column by double-clicking on the bottom right corner of this cell.

In the screen above both the Portuguese as well as the English activity names have been added to the data in this way.

Step 4: Import the data with the new label

Now, you can save the result from the previous step as a CSV file from Excel and import the CSV file into Disco.

For the IT Service Management data set we can choose whether we want to see the Portuguese or the English activity names in the process map (see below).

You can still also use the technical activity label as the activity name if you want to. To do this, simply configure both columns as ‘Activity’ during the import step. For example, in the screen above we have included both the ‘task sequence’ column as well as the ‘ActivityLabel_EN’ column into the activity name.

The resulting process map contains activity names with the combination of both column values as shown below.

Finally, validate if your process after the mapping is the same as before. The relabeling should not change the process itself (just the names of the activities).

For example, the process map above is exactly the same as the one that we got in the very beginning. The only difference is that we have now meaningful activity names displayed in the process map.

  1. Note that the process map has been simplified and, therefore, the numbers do not add up to 100%. You can learn more about when and how complex process maps can be simplified in our guide on Simplification Strategies for Process Mining.

  2. The VLOOKUP method also has the advantage that you can create more complicated mappings. For example, the original IT Service Management data set from this example actually had different activity names for the same task sequence codes depending on the IT Service Category. In such a situation, you can define the mapping as a combination of fields rather than a 1:1 mapping.

Recap of Process Mining Camp 2019

For eight years, it has been an amazing experience for us to welcome process miners from all over the world at the annual Process Mining Camp. Also this year's camp was fantastic! The atmosphere was great and there were a lot of inspiring talks by process mining professionals from many different areas.

Here is a short summary of this years camp. Sign up at the camp mailing list to be notified about next years camp and to receive the video recordings once they become available.

Opening Keynote

Anne Rozinat, co-founder of Fluxicon, opened the camp by emphasizing that it is an exciting time to be a process miner. The field is growing faster than ever before on a global scale. Fluxicon is very proud that professionals from 40 (!) different countries joined camp over the years to share best practices. It is also exciting to see that our academic initiative has exceeded the 600 universities mark.

For the professional, having a good tool for process mining is essential, but developing process mining as a discipline is the key to unlock the true potential. Besides extracting, preparing, and validating the data, you need to identify the best candidate process for process mining. Furthermore, you need to consider the impact and the ethical aspects of such an initiative. Then, you start your analysis by exploring the data and discovering the process, but you also have to choose the right moment to move into a more targeted analysis. Finally, being able to translate the insights into a solid business case and actual process change is crucial to realize the improvement opportunities.

For us at Fluxicon it is still amazing to see how people react when they first find out about process mining. It brings us back to the days when we were experimenting and could see our ideas work in practice for the first time. It is wonderful to see that process mining keeps on spreading across the globe; it is literally (almost) everywhere.

Process Miner of the Year

Kevin Joinson from GlaxoSmithKline was awarded the Process Miner of Year Award. He developed a new approach for cost deployment in manufacturing.

Cost deployment is a method from World Class Manufacturing, where an industrial engineering approach is taken to understand the cost of losses within an organisation (based on 100% of the cost). A key success factor was the involvement of the Subject Matter Experts (SMEs) and the initial segmentation of the data.

One of the results of Kevin's work is that the processing time of the quality management processes could be improved by 22%. We will share his winning contribution with you in more detail in an upcoming, dedicated article.

Freerk Jilderda ASML, The Netherlands

Freerk Jilderda from ASML kicked off with the first talk of the day. ASML is the leading developer of photolithography systems for the semiconductor industry. The machines are developed and assembled in Veldhoven and shipped to customers all over the world. Availability of the machines is crucial and, therefore, Freerk started a project to reduce the recovery time.

A recovery is a procedure of tests and calibrations to get the machine back up and running after repairs or maintenance. The ideal recovery is described by a procedure containing a sequence of 140 steps. After they identified the recoveries from the machine logging, they used process mining to compare the recoveries with the procedure to identify the key deviations. In this way they were able to find steps that are not part of the expected recovery procedure and improve the process.

Jozef Gruzman & Claus Mitterlehner Raiffeisen Bank International, Austria

The second speakers were Claus Mitterlehner and Jozef Gruzman from Raiffeisen Bank International. They started process mining 12 months ago as a part of their smart automation portfolio to derive insights from process-related data at the bank. Since then, they were able to apply process mining on various processes such as: corporate lending, credit card and mortgage applications, incident management and service desk, procure to pay, and many more.

Based on their experience they have developed a standard approach for black-box process discoveries. Using process mining, they first explore and review the processes on their own (prior to the in-depth analysis with the subject matter experts). They illustrated their approach and the deliverables they create for the business units based on the customer lending process.

Zvi Topol MuyVentive, United States

Zvi Topol from MuyVentive, was the third speaker of the day. He explored process mining for a completely new use case: The improvement of conversational interfaces.

Chatbots and voice interfaces such as Amazon Echo and Google Home are changing the way we interact with computers. Using natural language processing and machine learning, data scientists can detect the intents during the course of a conversation. Zvi added process mining on top of the detected intents to visualize the conversational flows. He showed how the discovery of conversational patterns can help to improve the customer experience of these conversational interfaces.

Bas van Beek & Frank Nobel PGGM, The Netherlands

As the fourth speakers of the day, Bas van Beek and Frank Nobel showed how they made an impact with process mining at the Dutch pension provider PGGM. The process lies at the heart of most of their process improvement initiatives and it is always a multi-disciplinary effort. However, the nature of each initiative is quite different.

Some projects are more focused on the redesign or implementation of an IT solution. Others require extensive involvement from the business to change the way of working. Frank showed the difference in approach by two examples. Afterwards, Bas showed an example where they used process mining for compliance purposes. Because they were able to demonstrate that certain individual funds actually follow the same process, they could group these funds and simplify the audits by using generic controls.

Mark Pijnenburg & Carmen Bratosin Philips Healthcare & ESI, The Netherlands

The fifth speakers, Mark Pijnenburg and Carmen Bratosin, applied process mining to the usage of MRI machines by physicians. Understanding the actual usage patterns in the field is especially interesting to improve the system requirements and to increase the test coverage based on real-life behavior for these machines.

But it is not easy, because the technical logging produced by the MRI machines is only available on a technical log level used for debugging. Furthermore, each physician has their own preferences regarding the machine setup for certain exams (adding to the complexity).

However, this did not stop Carmen and Mark. They started to select the key activities in the technical log, then aligned them with the user interface elements, and finally matched them with the steps described by the American College of Radiology to get them onto the abstraction level a radiology expert would understand. Following this approach, they were able to compare the actual usage with pre-defined exam cards.

Sudhendu Rai AIG, United States

Sudhendu Rai, lead scientist and head of data driven process optimization at AIG, was the sixth speaker. He developed a ‘Process Wind Tunnel’ framework to evaluate and optimize process structure and parameters using real-world data prior to committing to a final process design. Not to test the aerodynamic qualities of aircraft models, but to test the qualities of future state processes.

The initial model needs to reflect the reality as closely as possible. Process mining is a great way to discover the key steps that need the be part of the simulation model. Furthermore, process mining helps to determine the probabilities of transitions and the distribution of the process times to populate the model.

Sudhendu then developed “What-If” scenarios that reflected alternative process re-designs of the current process. Using discrete event simulation he tested the impact of each scenario before making the decision to implement a change in the actual process. In this way he was able to find the best scenario and could reduce the cycle time from 12 days to 5 days, increasing the throughput by over 30%.

Boris Nikolov Vanderlande, The Netherlands

The seventh speaker, Boris Nikolov, presented the application of process mining in logistic process automation. As a process improvement engineer, Boris supports customers by solving problems and by implementing new systems for baggage handling or parcel sorting and routing.

One of the customers in the parcel distribution center called Boris to solve a problem of recirculating parcels. Normally, parcels entering the system are scanned and routed to the right locations. However, a percentage of parcels kept circulating. Using the standard checks, he was not able to find the problem quickly and therefore tried to analyze it using process mining. In this way his was able to find that the lookup of the location of the parcels in the ERP was delayed and not known in time to be routed to the right location.

Besides solving problems, he also used process mining in the design stage of new baggage handling systems for airports. In order to save time, they develop simulation models to test if the design meets customer requirements. Data produced by the simulation models provided great insight when testing failure scenarios and helped to improve standard operating procedures.

Hadi Sotudeh JADS, The Netherlands

Sometimes, we see an application of process mining that nobody thought of before. Hadi Sotudeh, PDEng student at JADS, had such an example when he applied process mining to data from the 2018 World Cup in football.

After transforming the data, he was able to explore the actions of the players but found that there was not one dominant pattern. He took various approaches to take other perspectives to discover patterns. He was able to look at interactions with individual players, zones in the field, and to see the patterns for a particular outcome (goal or throw-in). Because the football interactions do not follow a typical (standard) process, finding the right level is one of the challenges to get insights. Taking various perspectives can help to learn new things about the opponent pattern of play, or for a team to learn from mistakes.

Wil van der Aalst RWTH Aachen, Germany

Wil van der Aalst gave the closing keynote at camp. He started with giving an overview of the progress that has been made in the process mining field over the past 20 years. Process mining unlocks great potential but also comes with a huge responsibility. Responsible data science focuses on positive technological breakthroughs and aims to prevent “pollution” by “bad data science”.

Wil gave us a sneak peek at current responsible process mining research from the area of ‘fairness’ (how to draw conclusions from data that are fair without sacrificing accuracy too much) and ‘confidentiality’ (how to analyze data without revealing secrets). While research can provide some solutions by developing new techniques, understanding these risks is a responsibility of the process miner.

Second Day: Workshops

The majority of the campers stayed for the second day to join one of the four workshops. In the workshops, (1) Rudi Niks explained how to improve digital processes when using process mining in each of the stages in the Lean Six Sigma improvement methodology. (2) Wesley Wierz and Rick van Buuren guided the workshop participants though the steps of extracting event logs from an ERP. (3) Andrs Jimnez Ramrez and Hajo Reijers discussed the combination of Robotics Process Automation (RPA) and process mining in their workshop. (4) Anne Rozinat taught the participants how to answer 20 typical process mining questions.

And, of course, during the breaks people got the chance to discuss and learn from each other.

We would like to thank everyone for the wonderful time at camp, and we can't wait to see you all again next year!

Photos by Lieke Vermeulen