Process Mining Meets Football! How Does a Football Team Possess The Ball On The Pitch?

Football match in progress

Finding the right perspective of the process is one of the challenges you can face when applying process mining. In most cases we already have an idea what we would expect of the process, but in some cases its not so easy to find the right perspective to be able to get valuable insights.

At Process Mining Camp in June 2019, Hadi Sotudeh, a PDEng student at Jheronimus Academy of Data Science (JADS), shared his experiences to apply process mining to the World Cup 2018 dataset. He now wrote down his analysis in this article and is also interested in any thoughts or feedback you may have. You can reach him on Linkedin or via email.

If you have a guest article or process mining case study that you would like to share as well, please contact us via anne@fluxicon.com.

The data set

What has football to do with process mining? Nothing at all, but I noticed that the Statsbomb dataset (see Figure 1) fulfilled the requirements to at least try1. I was especially interested in how a football team possesses the ball on the pitch. Being able to answer this question, I would be able to give coaching staff great insights into interesting patterns of play to develop counter strategies.

Figure 1: An example fragment of the Statsbomb data set

Figure 1: An example fragment of the Statsbomb data set

First, we need to introduce the ball possession process:

Ball possession process is a sequence of on-ball actions taken by one team from the beginning of possession until the end of it (losing or scoring).

For example, in the video above the ball possession starts from the defenders and after several actions on the ball, such as passes and dribbles, the forward player loses to score. This ball possession sequence has happened in the match between Iran and Portugal in the World Cup 2018. This sequence with its most important attributes after preprocessing the data set is shown in Table 1 (click on the image to see a larger version).

Table 1: The recorded ball possession sequence in the data set

Table 1: The recorded ball possession sequence in the data set

As shown in the table above, this ball possession sequence has attributes such as:

  • case ID: all actions in a ball possession sequence have the same case Id
  • action: name of the on-ball action such as pass and pressure
  • type: type of ball action such as Free Kick, Open Play, Shot Faced
  • play pattern: each sequence has one play pattern such as “Free Kick” and “From Corner”
  • recipient: the receiving player
  • start Time and end Time: start and end time of the action
  • period: which half of the match
  • duration: duration of the action
  • possession team: the team in ball possession
  • team action: the action-taking team
  • player: the action-taking player
  • body part: the body part of the action-taking player
  • start_X and start_Y: the start location of the action on the pitch with respect to a reference
  • end_X and end_Y: the end location of the action on the pitch with respect to a reference
  • result: the outcome of the action

The preprocessed process mining datasets of the World Cup 2018 can be downloaded here per match and aggregated per team.

We chose Belgium (see Figure 2) to analyze their ball possession process because they had seven matches and this, of course, will provide a richer dataset.

Mapping the ball possession process

Figure 2: Belgium in the World Cup 2018

Figure 2: Belgium in the World Cup 2018

To map the ball possession process, we need to think about how to assign the three process mining parameters case-id, activity and timestamp. The ball possession sequence number was already chosen as our case-id. Then, the on-ball action was initially taken as the activity, and the start time of the action was chosen as the timestamp. After importing the dataset into Disco, all the activities and the most important paths are shown in the discovered process map (see Figure 3 - Click to enlarge).

Figure 3: All activities in the ball possession process with the most important paths

Figure 3: All activities in the ball possession process with the most important paths

Figure 4 is showing the ball possession process for 594 cases with 295 variants. The first variant includes 48 cases (only one pass), the second variant includes 38 cases (2 passes), and the third variant includes (3 passes).

Figure 4: Top three variants of the ball possession process

Figure 4: Top three variants of the ball possession process

The players as activities

We can also have another perspective on the process by taking the players as activities and see the way they interact on the pitch, see Figure 5.

Figure 5: Belgium top 11 players interaction during the ball possession process on the pitch with most frequent interactions

Figure 5: Belgium top 11 players interaction during the ball possession process on the pitch with most frequent interactions

Belgium players interaction on the pitch has 594 cases with 525 variants, see Figure 6.

Figure 6: Top three variants of Belgium players’ interaction during the ball possession process on the pitch

Figure 6: Top three variants of Belgium players’ interaction during the ball possession process on the pitch

The top first variant includes 10 cases where Kevin De Bruyne was involved in the sequences with only one action. The second and the third variants also have one action but taken by different players, Jan Vertonghen and Thibaut Courtois respectively.

Unfolding loops

If we look back at the first perspective, where we mapped the on-ball action as the activity in the process (see Figure 3 above), we can see that there is a very dominant self-loop on the Pass activity (see Figure 7 below).

Figure 7: Self-loop in ‘Pass’ activity

Figure 7: Self-loop in ‘Pass’ activity

The collapsing of repetitions into loops is useful in most situations, but now we want to dive deeper into the ‘Pass’ patterns. To do this, we need to “unfold” this loop.

We applied the unfolding technique described in this article. This simply means that we change a sequence from Pass, Pass, Pass (which will be collapsed into a single ‘Pass’ activity with a self-loop) to Pass1, Pass2, Pass3 (which will be shown as a sequence of ‘Pass’ activities after each other).

After adding the repetition number in the python script, we import the data back into Disco by choosing both the on-ball action as well as the newly added sequence number of the repetition as the activity name. The full process map (100% activities and 100% paths) is shown in Figure 8.

Figure 8: Complete unfolded process map

Figure 8: Complete unfolded process map

As one would expect from a football game, the process map is very complicated. By only focusing on 50% of the activities and the most important paths, we get a readable process map that we can now further analyze (see Figure 9).

Figure 9: Unfolded process map with 50% of the activities and the most important paths

Figure 9: Unfolded process map with 50% of the activities and the most important paths

Distinguishing different types of ball possession

In another exploration, we want to focus on sub processes inside the ball possession process. In a football match, ball possessions can have different types such as from goalkeeper, from corner, from free kick, from throw in, etc. It is obvious that the process should be different in a corner than a start from a goalkeeper.

We concatenate the type of the sequence with the on-ball action as the activity name to make it easier to focus only on interesting subsets of the map based on this added dimension (see Figure 10 - Click on the image to see the full picture).

Figure 10: Subprocesses of the ball possession process with the type of sequence dimension

Figure 10: Subprocesses of the ball possession process with the type of sequence dimension

The coaching staff can create a filter on cases that are not set-pieces such as corner, and free kicks, to only focus on the parts that they are interested in.

Another application of process mining is that the coaching staff can go to the case explorer and see which undesired sequences have happened on the pitch to identify irregular patterns that need to be prevented (see Figure 11 and 12).

Figure 11: Exploring undesired sequences

Figure 11: Exploring undesired sequences

Figure 12: Finding players involved in those sequences

Figure 12: Finding players involved in those sequences

By exploring the variants one by one on the left-hand side of Figure 11, we can see that there are 11 sequences that had only one ball-action (Pass_Throw-in-1). Let's only keep these sequences by filtering them and drill down to see what is the reason.

Figure 12 shows us a picture of different players and how they performed on the selected sequences. For example, Jan Vertonghen was involved in five sequences out of 11 that ended after throwing the ball in. We can select those sequences and drill down to see what has happened (see Figure 13).

Figure 13: An example of a ball throw-in sequence that Vertongen was involved in

Figure 13: An example of a ball throw-in sequence that Vertongen was involved in

By selecting each case, we can see what the other attributes of that sequence are. For example, one of the sequences happened against Panama and belongs to the second period and 73rd minute. Now, we can connect the event-log to the match video and see what has happened on the pitch.

Here, we can now go to the exact time and watch that frame carefully.

As you can see, Vertongen tried to start the throw-in with a long pass, which was not successful and Panama took the ball possession over.

This way, the coaching team does not need to watch the whole match from the beginning to the end. They will be able to only focus on the interesting pieces and save time. This application is also interesting when your team plays against unknown teams and you as one of the coaching team members will not need to watch all of the opponents matches completely.

Conclusion

After transforming the data, we were able to explore the actions of the players but found that there was not one dominant pattern. We took various approaches to take other perspectives to discover patterns. For example, we were able to look at interactions with individual players. Because the football interactions do not follow a typical (standard) process, finding the right level is one of the challenges to get insights. Taking various perspectives can help to learn new things about the opponent pattern of play, or for a team to learn from mistakes.

As always, it is a good idea to look back and see how we came to this point. When we look back at how we defined the process, we realize that, maybe, we can further redefine the process, right?

For example, one of the next steps could be redefining the process as:

Ball possession process is a sequence of zones where the ball moves in and out on the pitch from the beginning of possession until the end of it.

This view requires defining zones on the football pitch. One example of it is shown in Figure 14.

Figure 14: Dividing the pitch into different zones (activities)

Figure 14: Dividing the pitch into different zones (activities)

Finding out the right way to divide the pitch into meaningful zones, and relevant questions that we can answer using process mining that are also interesting for coaching staff, is what we can do next.

We have shown that process mining can be a powerful tool to explore football data. However, finding the right perspective to answer questions is not always obvious.

Data can be molded into multiple representations, which in turn allow us to take various perspectives of the process. Finding the right perspective is an iterative process that can be best explored by trying different things.


  1. This dataset is provided by Statsbomb for research purposes on their GitHub page. ↩︎

How to Analyze Open Cases With Process Mining

This article previously appeared in the Process Mining News. Sign up here to receive more articles about the practical application of process mining.

One of the first things that you learn in the process mining methodology is how to filter out incomplete cases to get an overview about what the regular end-to-end process looks like.

Incomplete cases are process instances that have not finished yet. They are somewhere “in the middle” of the process. You typically remove such incomplete cases, for example, when you analyze the average case duration because the case duration of incomplete cases is misleading. The time between the first and the last event in an incomplete case can appear to be very fast, but in fact only a fraction of the process has taken place so far: If you would have waited a few more days, or weeks, then more activities would likely have taken place.

But what if you are exactly interested in those incomplete cases?

For example, you may want to know how long they have been open, how long nothing happened since the last activity or status change, and which statuses accumulate the most and most severe open cases without any action afterwards? These may be cases, where the customerunnoticed by the companyhas been already waiting for a long time and is about to be disappointed.

In this article, we show you how you can include the perspective of open cases in your process mining analysis. We provide detailed step-by-step instructions (download Disco if you have not done so yet) to follow along.

1. Apply filter to focus on incomplete cases

As a first step, we need to filter our data set to focus on the incomplete cases. One typical way to do that is to use the Endpoints filter, where you can first select the expected endpoints, and then invert the selection (by pressing the half-filled circle next to the search icon in the upper right corner of the filter settings).

Another way to filter incomplete cases is to focus on whether neither of the expected milestones in the process has been reached using the Forbidden mode in the Attribute filter. For example, in a customer refund process, these milestones may be activities such as Canceled, Order completed, and Payment issued, because they indicate that the refund order is not open anymore for the customer (see below - click on the screenshot to see a larger version).

The difference between using the Attribute filter and using the Endpoints filter is that with the Forbidden mode of the Attribute filter we do not care about what exactly the last step in the process was. Instead, we want to base our incompleteness evaluation on the fact that a specific activity has not (yet) occurred. Read The Different Meanings of “Finished” to learn more about the differences between these definitions for complete cases.

For the refund process, we use an Attribute filter in Forbidden mode, in which we select the milestone activities that indicate a completion, a cancellation, a payment, or a rejection of the refund request. This removes all cases that have reached one of these milestones somewhere in the process. In addition, we combine this Attribute filter with an Endpoints filter that removes all refund requests for which we are currently waiting for the customer in the ‘Missing documents requested’ activity (see screenshot below).

2. Export the filtered data set as a CSV file

The result is a process view that contains only those cases that are still open. As we can see, ca. 36% of the cases are incomplete in this data set (see screenshot below).

In this view, you can already see what the last activities were for all these open cases: The dashed lines leading to the endpoint indicate how many cases performed that particular activity as the last step in the process so far. For example, we can see that 20 times the activity Invoice modified was the very last step that was performed.

However, what you cannot see here is for how long they have already been waiting in this state. The problem is that when you measure the case duration in process mining, then you always look at the time between the very first and the very last event in each case, irrespective of how long ago that “last event” was performed.

To find out how long these open cases have been idle after the last step (and for how long they have been open in total), we are going to use a trick and simply add a Today timestamp to the data. To do that, first export the incomplete cases data set using the Export CSV file button (see lower right corner in screenshot above).

3. Export the list of cases as a CSV file

We will need to add this artificial Today timestamp to the end of each of the open cases. To quickly get a list of the case IDs, switch to the Statistics tab, right-click somewhere in the Cases overview table and choose the Export to CSV option (see screenshot below).

This will export a list of all open cases in a CSV file, one row per case.

4. Copy the Case IDs from the exported list of cases

Now, open the list of Case IDs that you just exported in Excel and select and copy the case IDs in the Case ID column to the clipboard (see screenshot below).

5. Append the Case IDs and add ‘Today’ timestamp

Paste the case IDs from the clipboard below the last row in the exported data file from Step 2 (see screenshot below).

Then, type the activity name Today in the activity column for the first newly added row. Furthermore, add a Today-timestamp to the timestamp column (see screenshot below). Make sure that you use exactly the same date and time pattern format as the other timestamps in your data set.

Which Today-timestamp should you use? If you have extracted your data set fairly recently (and you would assume that most cases that appear to be open in the data set are still open now), you can actually simply use your current date. Otherwise, look up the latest timestamp of the whole data set via the End timestamp in the overview statistics and use that date and timestamp to be precise. For example, 24 January 2012 was the last timestamp in the customer refund process.

Finally, copy the Today-activity name and the timestamp cells and copy them to the remaining newly added rows (see screenshot below).

6. Re-import the data to analyze open cases

If you now save your file and import it again into Disco, you will see that a new Today activity has appeared at the very end of the process (see screenshot below).

The main difference, however, will be in the performance analysis.

For example, if you switch to a combination of total and mean duration1 in the performance view of the process map (see screenshot below), then you will see that one of the major places in the process where cases are stuck is after the Shipment via logistics partner activity. On average, open cases have been inactive in this place for more than 13 days.

Another example is the case duration statistics, which now reflect the accurate time that these incomplete cases have actually been open so far (see screenshot below). For example, the average time that incomplete cases have been open in this data set is 24.9 days.


  1. Read our article on How to perform a bottleneck analysis with process mining to learn why this combination can be useful for identifying the big impact areas for delays in your process. ↩︎

Process Mining Transformations — Part 6: Relabeling Activities

This is the 6th article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

Out of the three minimum data requirements for process mining, the activity name is crucial to visualize the steps in the process. It shows you which activities took place and in which sequence.

There are situations, in which the activity name is only captured on a very technical level by the IT system (e.g., as an action code, a transaction number, or some other cryptic label). This is a problem. Not only because it makes it difficult for business users to understand the process map, but also because it becomes close to impossible for the process mining analyst to interpret what they are seeing. Therefore, we recommend to always take the time to enrich such technical activity labels by human-readable activity names.

For example, take a look at the following data set extracted by a Brazilian IT Service Management department (see below). The ‘task sequence’ column represents the status changes of the tickets in the IT Service Management system.

When you import the data into Disco to discover the process map1, you find that the activity names are shown as numbers (see below). For example, the first activity at the top is shown as ‘10’, the second one as ‘20’, etc. (click on the process map to see a larger version).

This is not practical, because—unlike you are so familiar with the IT system that you “think in” task sequence codes yourself—you will have a hard time to understand and interpret this process.

Even having a translation table on your desk and looking up individual activities (to see which activity belongs to which status code) is not a good idea, because the process maps that you discover with process mining get complicated very quickly already by themselves. You need to be able to build up a mental model of the process to deal with this complexity in your analysis.

So, in this article we show you step by step how you can add meaningful activity names to a data set that only has cryptic activity labels.

Step 1: Export the activities

First, you can export the list of all the different activities that are contained in your data set. To do this, you can go to the ‘Activities’ view in the ‘Statistics’ tab in Disco. Simply right-click somewhere in the activity statistics table and use the ‘Export CSV…’ option to save the activity statistics as a CSV file (see below).

You can then open the exported file in Excel (see below).

The ‘Frequency’ and ‘Relative frequency’ statistics are not needed for this use case and you can delete those columns.

Step 2: Mapping the activities

In the next step, you can add a new column and give the Excel sheet to the IT administrator of the system from which you extracted the data. Ask them to add a short description for each of the technical activity labels in your list.

Alternatively, you can also fill in a meaningful activity name yourself by looking at example cases and the process map together with a domain expert.

For example, for the IT Service Management process from before a column ‘ActivityLabel_PT’ has been added with the Portuguese and another column ‘ActivityLabel_EN’ for the the English activity name (see above).

Step 3: Apply the new mapping to your dataset

Now that we have the mapping, we need to apply it to the source data. Here, we show you two simple ways of how to do this in Excel. We will share alternative ways of relabeling activity names for data sets that are too large to be manipulated in Excel in an upcoming article.

The easiest way is to just use the ‘Find and Replace’ functionality in Excel (see below).

  • Copy and paste the column with the technical activity code. Choose a new heading for the new column to indicate that this is the new activity name.
  • Select the new column (to make sure only fields in this column are being replaced) and open the ‘Find and replace’ tool in Excel.
  • Don't forget to check the find entire cells only options, otherwise you may only replace part of the text.
  • Copy and paste the first technical activity code in the ‘Find’ and its new human-readable name in the ‘Replace with’ field.
  • Press ‘Replace All’.
  • Continue until all technical activity codes in the new column have been replaced.

The ‘Find and Replace’ method becomes a bit tedious if you have a large number of different activities. In such situations you can better use the VLOOKUP function in Excel.2

To do this:

  • Add a new tab called ‘Mapping’ to the source Excel file and copy the result from Step 2 above (without headings) to this new tab.
  • Then, go back to your source data tab and add a new column including a heading for the relabeled activity.
  • Add the following formula =VLOOKUP(C2,Mapping!A:C,2,FALSE) in the first cell of the newly created column.
  • You can then automatically apply this formula to all the rows in the new column by double-clicking on the bottom right corner of this cell.

In the screen above both the Portuguese as well as the English activity names have been added to the data in this way.

Step 4: Import the data with the new label

Now, you can save the result from the previous step as a CSV file from Excel and import the CSV file into Disco.

For the IT Service Management data set we can choose whether we want to see the Portuguese or the English activity names in the process map (see below).

You can still also use the technical activity label as the activity name if you want to. To do this, simply configure both columns as ‘Activity’ during the import step. For example, in the screen above we have included both the ‘task sequence’ column as well as the ‘ActivityLabel_EN’ column into the activity name.

The resulting process map contains activity names with the combination of both column values as shown below.

Finally, validate if your process after the mapping is the same as before. The relabeling should not change the process itself (just the names of the activities).

For example, the process map above is exactly the same as the one that we got in the very beginning. The only difference is that we have now meaningful activity names displayed in the process map.


  1. Note that the process map has been simplified and, therefore, the numbers do not add up to 100%. You can learn more about when and how complex process maps can be simplified in our guide on Simplification Strategies for Process Mining. ↩︎

  2. The VLOOKUP method also has the advantage that you can create more complicated mappings. For example, the original IT Service Management data set from this example actually had different activity names for the same task sequence codes depending on the IT Service Category. In such a situation, you can define the mapping as a combination of fields rather than a 1:1 mapping. ↩︎

Recap of Process Mining Camp 2019

For eight years, it has been an amazing experience for us to welcome process miners from all over the world at the annual Process Mining Camp. Also this year's camp was fantastic! The atmosphere was great and there were a lot of inspiring talks by process mining professionals from many different areas.

Here is a short summary of this years camp. Sign up at the camp mailing list to be notified about next years camp and to receive the video recordings once they become available.

Opening Keynote

Anne Rozinat, co-founder of Fluxicon, opened the camp by emphasizing that it is an exciting time to be a process miner. The field is growing faster than ever before on a global scale. Fluxicon is very proud that professionals from 40 (!) different countries joined camp over the years to share best practices. It is also exciting to see that our academic initiative has exceeded the 600 universities mark.

For the professional, having a good tool for process mining is essential, but developing process mining as a discipline is the key to unlock the true potential. Besides extracting, preparing, and validating the data, you need to identify the best candidate process for process mining. Furthermore, you need to consider the impact and the ethical aspects of such an initiative. Then, you start your analysis by exploring the data and discovering the process, but you also have to choose the right moment to move into a more targeted analysis. Finally, being able to translate the insights into a solid business case and actual process change is crucial to realize the improvement opportunities.

For us at Fluxicon it is still amazing to see how people react when they first find out about process mining. It brings us back to the days when we were experimenting and could see our ideas work in practice for the first time. It is wonderful to see that process mining keeps on spreading across the globe; it is literally (almost) everywhere.

Process Miner of the Year

Kevin Joinson from GlaxoSmithKline was awarded the Process Miner of Year Award. He developed a new approach for cost deployment in manufacturing.

Cost deployment is a method from World Class Manufacturing, where an industrial engineering approach is taken to understand the cost of losses within an organisation (based on 100% of the cost). A key success factor was the involvement of the Subject Matter Experts (SMEs) and the initial segmentation of the data.

One of the results of Kevin's work is that the processing time of the quality management processes could be improved by 22%. We will share his winning contribution with you in more detail in an upcoming, dedicated article.

Freerk Jilderda ASML, The Netherlands

Freerk Jilderda from ASML kicked off with the first talk of the day. ASML is the leading developer of photolithography systems for the semiconductor industry. The machines are developed and assembled in Veldhoven and shipped to customers all over the world. Availability of the machines is crucial and, therefore, Freerk started a project to reduce the recovery time.

A recovery is a procedure of tests and calibrations to get the machine back up and running after repairs or maintenance. The ideal recovery is described by a procedure containing a sequence of 140 steps. After they identified the recoveries from the machine logging, they used process mining to compare the recoveries with the procedure to identify the key deviations. In this way they were able to find steps that are not part of the expected recovery procedure and improve the process.

Jozef Gruzman & Claus Mitterlehner Raiffeisen Bank International, Austria

The second speakers were Claus Mitterlehner and Jozef Gruzman from Raiffeisen Bank International. They started process mining 12 months ago as a part of their smart automation portfolio to derive insights from process-related data at the bank. Since then, they were able to apply process mining on various processes such as: corporate lending, credit card and mortgage applications, incident management and service desk, procure to pay, and many more.

Based on their experience they have developed a standard approach for black-box process discoveries. Using process mining, they first explore and review the processes on their own (prior to the in-depth analysis with the subject matter experts). They illustrated their approach and the deliverables they create for the business units based on the customer lending process.

Zvi Topol MuyVentive, United States

Zvi Topol from MuyVentive, was the third speaker of the day. He explored process mining for a completely new use case: The improvement of conversational interfaces.

Chatbots and voice interfaces such as Amazon Echo and Google Home are changing the way we interact with computers. Using natural language processing and machine learning, data scientists can detect the intents during the course of a conversation. Zvi added process mining on top of the detected intents to visualize the conversational flows. He showed how the discovery of conversational patterns can help to improve the customer experience of these conversational interfaces.

Bas van Beek & Frank Nobel PGGM, The Netherlands

As the fourth speakers of the day, Bas van Beek and Frank Nobel showed how they made an impact with process mining at the Dutch pension provider PGGM. The process lies at the heart of most of their process improvement initiatives and it is always a multi-disciplinary effort. However, the nature of each initiative is quite different.

Some projects are more focused on the redesign or implementation of an IT solution. Others require extensive involvement from the business to change the way of working. Frank showed the difference in approach by two examples. Afterwards, Bas showed an example where they used process mining for compliance purposes. Because they were able to demonstrate that certain individual funds actually follow the same process, they could group these funds and simplify the audits by using generic controls.

Mark Pijnenburg & Carmen Bratosin Philips Healthcare & ESI, The Netherlands

The fifth speakers, Mark Pijnenburg and Carmen Bratosin, applied process mining to the usage of MRI machines by physicians. Understanding the actual usage patterns in the field is especially interesting to improve the system requirements and to increase the test coverage based on real-life behavior for these machines.

But it is not easy, because the technical logging produced by the MRI machines is only available on a technical log level used for debugging. Furthermore, each physician has their own preferences regarding the machine setup for certain exams (adding to the complexity).

However, this did not stop Carmen and Mark. They started to select the key activities in the technical log, then aligned them with the user interface elements, and finally matched them with the steps described by the American College of Radiology to get them onto the abstraction level a radiology expert would understand. Following this approach, they were able to compare the actual usage with pre-defined exam cards.

Sudhendu Rai AIG, United States

Sudhendu Rai, lead scientist and head of data driven process optimization at AIG, was the sixth speaker. He developed a ‘Process Wind Tunnel’ framework to evaluate and optimize process structure and parameters using real-world data prior to committing to a final process design. Not to test the aerodynamic qualities of aircraft models, but to test the qualities of future state processes.

The initial model needs to reflect the reality as closely as possible. Process mining is a great way to discover the key steps that need the be part of the simulation model. Furthermore, process mining helps to determine the probabilities of transitions and the distribution of the process times to populate the model.

Sudhendu then developed “What-If” scenarios that reflected alternative process re-designs of the current process. Using discrete event simulation he tested the impact of each scenario before making the decision to implement a change in the actual process. In this way he was able to find the best scenario and could reduce the cycle time from 12 days to 5 days, increasing the throughput by over 30%.

Boris Nikolov Vanderlande, The Netherlands

The seventh speaker, Boris Nikolov, presented the application of process mining in logistic process automation. As a process improvement engineer, Boris supports customers by solving problems and by implementing new systems for baggage handling or parcel sorting and routing.

One of the customers in the parcel distribution center called Boris to solve a problem of recirculating parcels. Normally, parcels entering the system are scanned and routed to the right locations. However, a percentage of parcels kept circulating. Using the standard checks, he was not able to find the problem quickly and therefore tried to analyze it using process mining. In this way he was able to find that the lookup of the location of the parcels in the ERP was delayed and not known in time to be routed to the right location.

Besides solving problems, he also used process mining in the design stage of new baggage handling systems for airports. In order to save time, they develop simulation models to test if the design meets customer requirements. Data produced by the simulation models provided great insight when testing failure scenarios and helped to improve standard operating procedures.

Hadi Sotudeh JADS, The Netherlands

Sometimes, we see an application of process mining that nobody thought of before. Hadi Sotudeh, PDEng student at JADS, had such an example when he applied process mining to data from the 2018 World Cup in football.

After transforming the data, he was able to explore the actions of the players but found that there was not one dominant pattern. He took various approaches to take other perspectives to discover patterns. He was able to look at interactions with individual players, zones in the field, and to see the patterns for a particular outcome (goal or throw-in). Because the football interactions do not follow a typical (standard) process, finding the right level is one of the challenges to get insights. Taking various perspectives can help to learn new things about the opponent pattern of play, or for a team to learn from mistakes.

Wil van der Aalst RWTH Aachen, Germany

Wil van der Aalst gave the closing keynote at camp. He started with giving an overview of the progress that has been made in the process mining field over the past 20 years. Process mining unlocks great potential but also comes with a huge responsibility. Responsible data science focuses on positive technological breakthroughs and aims to prevent “pollution” by “bad data science”.

Wil gave us a sneak peek at current responsible process mining research from the area of ‘fairness’ (how to draw conclusions from data that are fair without sacrificing accuracy too much) and ‘confidentiality’ (how to analyze data without revealing secrets). While research can provide some solutions by developing new techniques, understanding these risks is a responsibility of the process miner.

Second Day: Workshops

The majority of the campers stayed for the second day to join one of the four workshops. In the workshops, (1) Rudi Niks explained how to improve digital processes when using process mining in each of the stages in the Lean Six Sigma improvement methodology. (2) Wesley Wierz and Rick van Buuren guided the workshop participants though the steps of extracting event logs from an ERP. (3) Andrs Jimnez Ramrez and Hajo Reijers discussed the combination of Robotics Process Automation (RPA) and process mining in their workshop. (4) Anne Rozinat taught the participants how to answer 20 typical process mining questions.

And, of course, during the breaks people got the chance to discuss and learn from each other.

We would like to thank everyone for the wonderful time at camp, and we can't wait to see you all again next year!


Photos by Lieke Vermeulen

Wil van der Aalst at Process Mining Camp 2018

Process Mining Camp is just one week away (see an overview of the speakers here) and there are just a few tickets left. So, if you want to come, you should reserve your seat now!

To get ready for this years camp, we have started to release the videos from last year. If you have missed them before, you can still watch the videos of Fran Batchelor from UW Health, Niyi Ogunbiyi from Deutsche Bank, Dinesh Das from Microsoft, Wim Kouwenhoven from the City of Amsterdam, Olga Gazina from Euroclear, and Marc Tollens from KLM.

The final speaker at Process Mining Camp 2018 was Wil van der Aalst, the founding father of process mining. In his closing keynote, Wil talked about the updated skill set that process and data scientists need today. Since process mining research was starting up in Eindhoven in the late 90s, the availability of suitable data has increased tremendously, which makes it even more important that this data can and will be used in an appropriate and responsible manner.

This requires dedicated capabilities from the process miner in each stage of the analysis pipeline: Processing and analyzing data, being responsible about the effects on people, and on business models. When you look for people who are skilled in all of these technical areas, as well as in soft skills like communication and ethics, you start looking for (as they would say in the Netherlands) a sheep with 5 legs, or something that is very rare. Becoming a data scientist requires a lot of effort to learn all the skills that are needed to live up to these high expectations.

As ambassadors of process mining, we also have the responsibility to use the right terms. Wil sees a clear a distinction between Artificial Intelligence (AI), machine learning, and data mining. At the same time, one could argue that process mining is data mining, but the underlying techniques are very different. So, saying that process mining is part of data mining, or AI, doesnt make any sense.

There are incredible expectations around AI and Big Data, which is very dangerous as we have seen in past AI winters. We should be careful not to overpromise and try to be realistic. The incredible successes of machine learning techniques like deep learning are, for example, still limited to very specific fields.

Some in the media and big technology companies use terms like Artificial Intelligence, Machine Learning, and Deep Learning interchangeably. They might argue that you dont need process mining as you can just put an event log into a deep neural network and a process model will come out. There is, however, not one deep learning algorithm that can discover a process model. Instead, when we look at process mining it combines the fields and methods of process science and data science. This makes it even more challenging for us to cover all required skills.

But do you, as a professional, need to know how a car works internally in order to drive it? It depends on what you want to accomplish. For example, if you need to drive fast around the Nurburg Ring, it can be very useful. Also, if you need to select a car then it would certainly be useful to know about its internals. Process mining is a relatively young technology. Therefore, it is useful to know how it works in order to select the right tool, and to use it to maximum effect.

So what kind of skills do you need as a process miner? You need to be able to extract and clean the data, spend time on the analysis, and interpret the results. This is not easy. All the involved parties need to invest the time to determine what the process maps actually mean, so that they can really trust their interpretation. The sheep with five legs would be the ideal process miner, but in most cases this is not realistic.

Traditionally, you will often rely on collaboration between a data-driven expert and a business/domain-driven expert. However, you can also think about more hybrid process mining profiles. Some experts can integrate technological skills into their domain knowledge, while other data scientists can be process mining experts which are especially skilled to perform specific types of analysis in a particular domain.

Do you want to know what kind of process miner you could become? Watch Wil's talk now!


If you can't attend Process Mining Camp this year, you should sign up for the Camp mailing list to receive the presentations and video recordings afterwards.