Process Mining Transformations — Part 2: Unfold Loops for Activity Repetitions

This is the 2nd article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

In the previous article, we have shown how loops can be split up into individual cases. The same principle can also be useful when looking at looping activities.

For example, let’s take a look at the purchasing process in Figure 1. When we analyze the performance of this process we can see that some cases do not fulfill the SLA of 21 days throughput time. It seems that the two ‘Amend’ activities could be an important factor in these delays. Not only because of the long average waiting times but also because some of the cases go through the ‘Amend’ step multiple times: At least one case went through the ‘Amend Request for Quotation Requester’ step 12 times!

Figure 1: Fragment of the process map for the purchasing process. The primary metric that is shown in the map is ‘Mean duration’ while the secondary metric is ‘Maximum repetitions’.

The nature of a loop (or cycle) is that even if the same activity is repeated within the same case, it is represented by the same activity node in the process map. For example, the secondary metric in the process map in Figure 1 shows that the activity ‘Analyze Request for Quotation’ was performed up to 14 times within a single case. But each of these iterations is represented by the same activity in the map.

In order to understand the impact of these repetitions in more detail, we would like to “unfold” each repetition to take a deeper dive into the repetition patterns.

In this article, we show you how you can achieve this. We will “unfold” each repetition of the activity ‘Analyze Request for Quotation’, ‘Amend Request for Quotation Requester’ and ‘Amend Request for Quotation Requester Manager’ into a separate activity node to analyze the impact of these repetitions in more detail.

Step 1: Transform your data

When you look at case 1212 in Figure 2 below, then you can see that the ‘Analyze Request for Quotation’ activity (highlighted in green) and the ‘Amend Request for Quotation’ activity (highlighted in blue) were repeated multiple times. This means that in the context of the process map from Figure 1 this case moves up and down between the highlighted activity nodes. We would like to unfold the looping activities to get more visibility into the repetition pattern.

Figure 2: Example case 1212 with repeating activity pattern (click on the image to see a larger version).

To make things even more complex, the ‘Amend’ activity can either be performed by the Requester (see light blue highlights for activity ‘Amend Request for Quotation Requester’ in Figure 2) or by the Manager (see dark blue highlight for activity ‘Amend Request for Quotation Requester Manager’ in Figure 2). However, for our specific analysis we do not want to make this distinction. We care about how many amendments were made in total, regardless of whether they were made by the requester or the manager.

To be able to analyze each repetition, we need to add a sequence number to each iteration of these activities within the same case. Similar to the approach of unfolding loops for cases, we will add a counter to each occurrence of the repetition.

Previously, we have shown you how you can do the heavy lifting in Python. In this example we show you how you can do this with an ETL tool. ETL tools have the advantage that you don’t need to be a programmer to do data transformations. We use the ETL tool KNIME but you can use any other ETL tool or programming language of your preference to get the same result.

With special thanks to Eddy van der Geest, who contributed the solution to this specific data transformation question, you can find the KNIME workflow below (see Figure 3). You can also download the data set here and download the KNIME workflow here to follow the example of this article.

Figure 3: KNIME workflow that adds a counter for each repeated occurrence of an ‘Amend’ and ‘Analyze’ activity (click on the image to see a larger version or download the KNIME workflow to follow the steps yourself).

This workflow loads the dataset from the purchasing process and adds the sequence number for each occurrence of an ‘Analyze Request for Quotation’ activity within the same case as a new column to the right (see green highlighted rows in Figure 4). Furthermore, it keeps a joint counter for the repetition of either the ‘Amend Request for Quotation Requester’ or the ‘Amend Request for Quotation Requester Manager’ activities in another new column (see blue highlighted rows in Figure 4).

Figure 4: The result of the data preparation step for the case 1212. You can see that 2 columns are added that include a counter for the ‘Amend’ and ‘Analyze’ activity repetitions.

Based on this transformed data set, we can now analyze our loop pattern in more detail.

Step 2: Analyze the activity repetitions

To actually unfold the loop in the process map in a visual way, we include both the ‘Amend’ sequence number column as well as the ‘Analyze’ sequence number column into the activity name when we import the transformed data set into Disco (see screenshot in Figure 5 below).

Figure 5: The three highlighted columns are all configured as ‘Activity’ (note the little letter symbol in the header) and, therefore, will be concatenated (combined together) into the activity name.

As a result, we have unfolded each activity occurrence in the loop pattern from Figure 1 (see Figure 6 for the same map but with the repetitions unfolded).

For example, rather than one activity with the name ‘Analyze Request for Quotation’ we can now see a separate activity for each iteration. ‘Analyze Request for Quotation-1’ is the first occurrence, ‘Analyze Request for Quotation-2’ the second occurrence, and so on.

Figure 6: Unfolded loop pattern from Figure 1 (click on the image to see a larger version of the map).

The process map has become much bigger now, but for our purposes it is helpful to see in detail how the repeating activities follow each other and in which combinations.

We can now also answer our initial questions about the amendments. For example, say that we want to know how many cases took three or more than three amendments (by the requester or the manager combined). To answer this question, we can simply add an Attribute filter in ‘Mandatory’ mode for the ‘Amend_SequenceNr’ field (see Figure 7 below).

Figure 7: Filter for all cases that had three or more repetitions of an ‘Amend’ activity.

After applying the filter, we can see that 14% of the cases had three or more amendments (see Figure 8 below).

Figure 8: As a result, we find that 14% of the cases had at leaset three ‘Amend’ activities and can analyze this subset of the process in more detail.

The throughput time of the cases that had three or more amendments can now be compared with the overal case durations to see whether they take longer.

And because the loop pattern has been unfolded, we can see exactly how much time passes, for example, between the fourth amendment and the fifth ‘Analyze’ activity, etc. We can play the animation over the unfolded process map, and so on.

It’s generally useful to have repetitions collapsed into a single activity in the process map to get a more compact overview, but sometimes unfolding these activity repetitions is exactly what you might want to do to get to the bottom of your loop patterns.

PGGM Saves Time With Process Mining

This is a guest article by Frank Nobel from Finext and by Henri Martens from PGGM, a pension fund service provider. See the Dutch version here. If you have a guest article or process mining case study that you would like to share as well, please contact us via anne@fluxicon.com.

PGGM, one of the largest pension providers in the Netherlands, wants to make her processes more efficient and reduce the costs of the accountant. To do this, the company has researched the added value of process mining. And with success: the organization expects time savings of 66% for the first, second and third line checks of the processes which were studied in the experiment.

Process mining is a new method for process improvement. All related actions and turnaround times of a process are mapped out based on data.

Time to ask Henri Martens, Manager Shared Service Center Extra Services at PGGM, a couple of questions.

Image: Henri Martens (PGGM) at Finext Round Table event

1. Why did you start with process mining?

From the discussions with our accountant KPMG, process mining was suggested as a possibility to reduce the costs of the complex accountability processes and, therefore, also the costs of the accountant. With this savings potential in mind, I started the experiment.

I am always open for innovative techniques and have been motivated by the experiences of KPMG. A lot of time is spent at PGGM on accountability reports to show our clients that we are managing our processes accurately. Afterwards, it then takes the accountant substantial time to assess the files (and their creation).

2. Which value does process mining provide PGGM?

Process mining provides insight into the process. It shows all actions (recorded in the systems) and their underlying relationships. For example, with process mining we show that all sent letters have been checked. Furthermore, we also show that a second employee has performed this check.

The audit, which is part of the accountability reports, can be carried out faster with process mining and is more complete than a random check. However, we did not only stay with one application, but we have also used the power of process mining at other departments and processes of PGGM. It soon became clear that the use of process mining offers more than just becoming more efficient in executing audits, but that it also provides valuable opportunities to identify improvements in all processes.

3. What was the most important success factor of the experiment?

The most important success factor was to put together a multidisciplinary team with the right combination of expertise. It is important to have the right competencies in-house when you use process mining. It’s a combination of data mining and process analysis.

Within the team we have several experienced colleagues who are proficient in knowing the process execution, but less colleagues with data affinity. I went searching for someone with data affinity who could work full time on process mining and ended up at Finext. The team was then complete for us to start experimenting with the tooling of process mining. Furthermore, I have given the team the time and freedom necessary to develop the experiment themselves and show that process mining has added value for PGGM.

4. What benefits did applying process mining bring you?

We expect at least 66% time savings for the first, second and third line checks of the relevant processes. With the use of process mining we have established that essential processes are carried out in the same way for several clients. This evidence insures that less accountability documentation is needed.

In addition, the analyses have led to more insight in the actual execution of our processes. We have then also been able to implement various process improvements. For example, we were able to apply Robotic Process Automation (RPA) on one of our analyzed processes, whereas this was not considered useful before.

We have now completed the analyses of several processes at different departments. These analyses have been the starting point of process improvements due to the additional insight based on facts from the systems. After an improvement has been implemented, a second check is made to show the finally realized improvement potential.

5. How will you use process mining in the future?

We will use process mining in the future to continuously implement improvements in many more processes. We are now seeking collaboration with the robotization and data science disciplines. This joint group outlines the frameworks for the future use of process mining throughout all of PGGM.

During the experiment, process mining was applied in different ways, both for ad-hoc analyses as well as for long-term solutions. In the near future, it is especially important to continue using process mining in the organization. On the one hand for monitoring the processes for continuous improvements, and on the other hand for implementing standard audit reports.

——————————–

Download Interview Case Study: PGGM Saves Time With Process Mining

You can download this interview as a PDF here for easier printing or sharing with others.






Become the Process Miner of the Year 2018!

Two years ago, we introduced the Process Miner of the Year awards to help you showcase your best work and share it with the process mining community. After Veco won the award in 2016, our friends at Telefonica became the Process Miner of the Year 2017 (read the full case and watch the video recording here).

This year, we will continue the tradition and the best submission will receive the Process Miner of the Year award at this year’s Process Mining Camp, on 19 June in Eindhoven.

Have you completed a successful process mining project in the past months that you are really proud of? A project that went so well, or produced such amazing results, that you cannot stop telling anyone around you about it? You know, the one that propelled process mining to a whole new level in your organization? We are pretty sure that a lot of you are thinking of your favorite project right now, and that you can’t wait to share it.

What we are looking for

We want to highlight process mining initiatives that are inspiring, captivating, and interesting. Projects that demonstrate the power of process mining, and the transformative impact it can have on the way organizations go about their work and get things done.

There are a lot of ways in which a process mining project can tell an inspiring story. To name just a few:

  • Process mining has transformed your organization, and the way you work, in an essential way.
  • There has been a huge impact with a big ROI, for example through cost savings or efficiency gains.
  • You found an unexpected way to apply process mining, for example in a domain that nobody approached before you.
  • You were faced with enormous challenges in your project, but you found creative ways to overcome them.
  • You developed a new methodology to make process mining work in your organization, or you successfully integrated process mining into your existing way of working.

Of course, maybe your favorite project is inspiring and amazing in ways that can’t be captured by the above examples. That’s perfectly fine! If you are convinced that you have done some great work, don’t hesitate: Write it up, and submit it, and take your chance to be the Process Miner of the Year 2018!

How to enter the contest

You can either send us an existing write-up of your project, or you can write about your project from scratch. It is probably better to start from scratch, since we are not looking for a white paper, but rather an inspiring story, in your own words.

In any case, you should download this Word document, which contains some more information on how to get started. You can use it either as a guide, or as a template for writing down your story.

When you are finished, send your submission to info@fluxicon.com no later than 30 April 2018.

We can’t wait to read about your process mining projects!

Case Study: Customer Journey Mining

This is a guest article by Yeong Shin Lee from PMIG and Yongil Lee from LOEN Entertainment.

If you have a process mining case study that you would like to share as well, please contact us via anne@fluxicon.com.

Summary

Korean Internet companies are holding voluminous log data that records users’ service usage behavior. If they can effectively utilize it, they can gain a competitive edge for maximizing their earnings. Yet, most of them are still at an early stage in which they identify users’ rough characteristics by performing simple statistical analyses.

LOEN Entertainment runs Melon, which is the largest online music streaming service in South Korea. They adopted process mining with Disco to analyze their mobile app’s log data. LOEN analyzed new users’ journeys during the day when they signed up with a KakaoTalk account. KakaoTalk is a free mobile instant messaging application for smartphones with free text and free call features. KakaoTalk is used by 93% of smartphone owners in South Korea.

They categorized new users into five segments based on their behavioral pattern and clearly identified the reason why each segment signed up. Furthermore, building on the analysis results, it is planning to conduct a targeted marketing campaign for increasing each segment’s CVR (Conversion Rate). The company is judging that their process mining analysis using Disco plays a key role in understanding new customers and is likely to contribute to maximizing earnings.

Company & Service

With the spread of smartphones, the Korean digital music market has sharply grown, now reaching about $900 million. Melon’s market share is more than 60% and it has secured more than 34 million users and 4.5 million paying customers. It started as SK Telecom’s music service in 2004, when the digital music market was still in its early stages. Later, SK Telecom transferred the service to its subsidiary, LOEN Entertainment.

Kakao took the subsidiary over in January 2016. In collaboration with Kakao, LOEN is now focusing on securing new users. A user with a KakaoTalk account can use Melon’s service without a separate registration process (See Figure 1).

Figure 1: Melon’s Mobile App (Left) and its Login Screen (Right).

Furthermore, they conducted a campaign through which KakaoTalk’s paid emoticons are given to paying Melon subscribers at no cost.
To understand the behavior of new users who signed up with a KakaoTalk account and to increase their CVR, LOEN Entertainment, without getting external consulting, performed a process mining project after adopting Disco. An in-house data analyst prepared the data for process mining and a marketer set the direction of analysis and conducted the process mining analysis using her domain knowledge.

Process

The process that was analyzed is a new user’s journey within the mobile app during the day when they signed up. The reasons for choosing this process are as follows:

  1. First, the process is closely related with the company’s strategic direction, focusing on enlarging its customer base in concert with its parent company (i.e., Kakao).
  2. Second, increasing new users’ CVR contributes to its profit enlargement.
  3. Finally, segmenting new subscribers based on their behavioral patterns and identifying their registration intent helps to maintain long-term relationships with them.

Data

The project team extracted log data from a Hadoop system that records mobile app users’ service usage behavior. Then, the team pre-processed the data and imported it into Disco. ‘User Sequence Number’ and ‘Menu Name’ were configured as case id and activity, respectively.

Due to Disco’s full Unicode support, the team could easily understand the discovered process map with the activity names in Korean. Furthermore, with the help of Disco’s powerful filters a lot of the pre-processing could be done in the process mining tool itself, which reduces the time and effort for the overall process mining analysis.

Results

When the data analysis team uses a general web log analyzer, then it can identify a certain page that a user visited, and its previous and subsequent pages. In contrast, process mining provides an end-to-end process map, repetition patterns, and the duration between pages (menus). Therefore, the team could exactly identify how users use the mobile app service.

By employing the process mining capabilities of Disco, the team analyzed the customer journeys of new users and categorized them, based on their usage pattern, into five customer segments.

Segment 1 is the group of customers who paid a fee for the music service. The process map of this segment is shown in Figure 2 (see next page). The rectangles represent the activities (here, menu names) and the arrows between them show the order in which the pages were visited by the customers. The darker the activities and the thicker the arrows, the more frequently these parts of the process are followed.

Figure 2: Simplified process map of the page flow for the first customer segment (note that the English page names were overlaid for clarity; furthermore some activity names as well as the frequency and performance metrics have been redacted for confidentiality reasons).

Segment 2-5 are customer groups who did not pay for the music service. The team discovered their process maps and was able to clearly identify the customers’ registration intent through the maps. Based on these insights from the process mining analysis, strategies to increase the CVR have been developed.

Impact

The team is judging that it achieved full success in the process mining project. It divided new users into (previously unidentified) five customer segments. For each segment, they could clearly identify the registration intent and the key pages that were visited.

Now, the team is planning to conduct a targeted marketing campaign, customized for each segment, on these key pages where each segment visited frequently. After conducting the campaign, the team will identify how much each segment’s CVR has improved. For the CVR targets that are not achieved, the team will perform a process mining analysis to analyze the customer behavior and find out the root causes of why the target CVR was not achieved. After this initial project, Melon’s process mining analyses using Disco have now become a daily improvement activity.

——————————–

Download Case Study: Customer Journey Mining

You can download this case study as a PDF here for easier printing or sharing with others.

Process Mining Transformations — Part 1: Unfold Loops for Cases

Ideally, your data is in perfect shape and you can immediately use it for your process mining analysis without any changes. Unfortunately, there are many situations, where this is not the case and you actually need to prepare your data set a little bit to be able to answer your analysis questions.

In this series, we will be looking at typical process mining data transformation tasks. Via step-by-step instructions, we will show you exactly how you can accomplish these data preparation steps for your own data:

Part 1: Unfold Loops for Cases (this article)
Part 2: Unfold Loops for Activity Repetitions
Part 3: To be continued…

Unfold Loops for Cases

If you have a ‘loop’ in your process then this means that a certain process step is repeated more than once. While, strictly speaking, the term loop refers to what is also called a ‘self-loop’ (a direct repetition), the term is typically more loosely used to refer to cycles in general in the context of process maps.

Loops are often interesting for a process mining analyst, because they help to spot rework and inefficiencies in the process (see our article on how to identify rework in process mining here).

But sometimes, loops can also get in the way of answering your process mining questions. For example, imagine a process, where a tool such as a heavy-duty power drill can be rented for specialized construction work. To trace the movement of the tools, a barcode has been attached to each drill. The barcode provides a unique identifier for each tool and serves as our process mining case ID.

In addition, the following status changes are tracked with a timestamp for each tool: ‘Pickup’ (a tool is picked up by a customer), ‘Return’ (the tool is returned by the customer), ‘Ready for pickup’ (the tool is back in the store and available for a new rental cycle by a new customer), and ‘Intervention’ (the tool needs to be repaired).

The process map below shows the process that is discovered for this data set by Disco (click on the image to see a larger version).1

As you can see in the process map by following the thick paths, there is a very dominant loop in this process: Each of the 31,592 tools is picked up, returned, and prepared for the next customer several times — See the red arrow that points to the place where the tool rental cycle is restarted again for the next customer.

The problem with this loop is that some questions cannot be answered from this process perspective. For example, what if you want to know:

How many times it took more than two days before a tool was ready for pickup after it was returned by the customer?

Right now, we can only answer this question based on how many tools took more than two days at least once between ‘Return’ and ‘Ready for pickup’, because the tool’s barcode is currently our case ID.

To understand how many times in total a tool took more than two days between ‘Return’ and ‘Ready for pickup’ we need to shift the case ID perspective from the tool ID to a single rental cycle. But to do this, we need a “rental cycle counter” for each tool.

Here is how you can achieve this and break up a loop in your process into multiple case IDs.

Step 1: Sort your dataset

In this first step, you need to make sure that your data is sorted based on your case ID (here the tool’s barcode) and the timestamps. It is not important that the case IDs are in a particular order. But all events that belong to the same case need to be grouped in such a way that they appear after each other in the right sequence (so, you want to have the events in the right order for each case).

There are several ways to do this. For example, you can sort the data in Excel, in your database, or via an ETL tool. But the simplest way of all is to just import your data into Disco and export it as a CSV file again. You will see that the result is a neatly sorted event log.

Step 2: Transform your data

When you look at the sorted data set (see below), then you can see how a single tool ID (here ‘Case 10’) goes through multiple cycles of ‘Pickup’, ‘Return’, and ‘Ready for pickup’.

To be able to analyze each rental cycle separately, this loop needs to be broken up into multiple case IDs: We want to start a new case each time that the cycle repeats again. So, in addition to knowing that the drill with the barcode ‘Case 10’ was rented out, we also want to know whether it was rented out the first, the second, or the 100th time.

Because we do not have such a rental cycle counter yet, we will add it ourselves in this data transformation step. I have used a Python script to generate the sequence counter. But you can do the same with a Visual Basic script or any other programming language of your preference.

To preserve the flexibility to decide later where exactly the rental cycle restarts (at ‘Pickup’, ‘Return’, or ‘Ready for pickup’?), I have simply added a loop counter for each of these activities.

Here is my Python code snippet:

import csv

previous_caseID = 0
Seqnr1 = 0
Seqnr2 = 0
Seqnr3 = 0

print("Start data transformation")

infile = open('tool_rentals.csv', 'rU')
csv_f = csv.reader(infile)

ofile = open('result.csv', 'w')
writer = csv.writer(ofile, delimiter=',', quotechar='', quoting=csv.QUOTE_NONE, escapechar='\\')

for row in csv_f:
current_caseID = row[0]
current_activity = row[2]

if (str(previous_caseID) != str(current_caseID)):
# reset sequence numbers
Seqnr1 = 0
Seqnr2 = 0
Seqnr3 = 0

if (str(current_activity) == 'Pickup'):
Seqnr1 = Seqnr1 + 1

if (str(current_activity) == 'Return'):
Seqnr2 = Seqnr2 + 1

if (str(current_activity) == 'Ready for pickup'):
Seqnr3 = Seqnr3 + 1

# if it's the header row then write the header row
if (current_caseID == 'Case ID'):
# write the header
mylist = [row[0], row[1], 'Repetion_of_pickup', 'Repetion_of_return', 'Repetition_of_ready_for_pickup', row[2]]
else:
# write the values
mylist = [row[0], row[1], str(Seqnr1), str(Seqnr2), str(Seqnr3), current_activity]

# write the row to the output csv file
writer.writerow(mylist)

# update the caseID
previous_caseID = current_caseID

print("Transformation completed")

# close the file readers/writers
infile.close()
ofile.close()

The result of this transformation is a new data set with three additional columns, which count the number of repetitions for the activities ‘Pickup’, ‘Return’ and ‘Ready for pickup’ for each case, respectively (see below).

Step 3: Pick the right perspective and analyze

Let’s say that we want to start a new rental cycle with each ‘Pickup’ activity. This means that, for example, the case with the tool ID ‘Case 10’ should be broken up into multiple cases such as ‘Case 10-0’ (no ‘Pickup’ has occurred yet), ‘Case 10-1’ (the drill has been picked up once), ‘Case 10-2’ (the drill has been picked up a second time), ‘Case 10-3’ (the drill has been picked up a third time), etc.

Each of these cases are much shorter (see the red arrows in the screenshot below) than the previous, very long case ‘Case 10’.

Now that we have added the repetition counter columns, taking this perspective is easy: We can simply configure both the ‘Case ID’ column (this is the tool ID from the barcode) and the new ‘Repetition_of_pickup’ column as a Case ID column in the import step (note the little Case ID symbol in the header row of both columns):

After importing the data into Disco, we remove all tool rental cycle cases that do not start with the ‘Pickup’ activity or that do not reach the ‘Ready for pickup’ activity in their cycle (see our article on ‘how to deal with incomplete cases’ here). This leaves us with 261,594 rental cycles for all tools together (see below).

Out of these 261,594 cases, we can now answer our original question and determine how many times a tool was not ‘Ready for pickup’ again after the ‘Return’ activity within two days. One way to answer this question is to use the Follower filter (see screenshot below).

After applying this filter, we can see that in 83% of the cases it took more than two days2 to have the tool ready for pickup again (see below).

So, if having the tool ready for pickup within two days is our ambition, then currently only 17% of the rental cycles meet this goal and we need to find ways to improve our process.


  1. Note that this process has multiple start and end points, because the data set was extracted for a certain timeframe. Different tools were in different stages of the rental cycle at the beginning and at the end of the data set.  
  2. Note that we are looking at calendar days in this example. If we wanted to analyze this question based on business days, we could do this by removing weekends and holidays using the TimeWarp functionality in Disco as shown here.  
Process Mining Camp on 19 & 20 June — Save the Date!

Have you always wanted to meet other process miners in person? Perhaps you followed the MOOC and would like to share your experiences with people who are also just starting out. Or you have already worked with process mining for several years and now you want to learn from other organizations about how they made the next step?

Open your agenda right now and mark the date: Process Mining Camp takes place again on 19 & 20 June in Eindhoven1 this year!

Process Mining Camp is not your run-of-the-mill, corporate conference but a community meet-up with a unique flair. The campers are really nice people who do not just brag about their successes but also share their pitfalls and failures, from which you can learn even more than from stories that go well. In addition, you will get lots of ideas about new approaches and use cases that you have not considered before.

For the seventh time, process mining enthusiasts from all around the world will come together in the birth place of process mining. Last year, more than 220 people from 24 different countries came to camp to listen to their peers, share their ideas and experiences, and make new friends in the global process mining community.

Like last year, this year’s Process Mining Camp will run for two days:

  • The first day (19 June) will be a day full of inspiring practice talks from different companies, as you have seen from previous camps.
  • On the second day (20 June), we will have a hands-on workshop day. Here, smaller groups of participants will get the chance to dive into various process mining topics in depth, guided by an experienced expert.

Mark these dates in your calendar and sign up for the camp mailing list here to be notified when ticket sales open! Even if you can’t make it this year, you should sign up to receive the presentations and video recordings as soon as they become available.

We can’t wait to see you in Eindhoven on 19 June!


  1. Eindhoven is located in the south of the Netherlands. Next to its local airport, it can also be reached easily from Amsterdam’s Schiphol airport (direct connection from Schiphol every 15 minutes, the journey takes about 1h 20 min).  
Privacy, Security and Ethics in Process Mining — Part 4: Establish a Collaborative Culture

This is the 4th and last article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

Perhaps the most important ingredient in creating a responsible process mining environment is to establish a collaborative culture within your organization. Process mining can make the flaws in your processes very transparent, much more transparent than some people may be comfortable with. Therefore, you should include change management professionals, for example, Lean practitioners who know how to encourage people to tell each other “the truth”, in your team (see also our article on Success Criteria for Process Mining).

Furthermore, be careful how you communicate the goals of your process mining project and involve relevant stakeholders in a way that ensures their perspective is heard. The goal is to create an atmosphere, where people are not blamed for their mistakes (which only leads to them hiding what they do and working against you) but where everyone is on board with the goals of the project and where the analysis and process improvement is a joint effort. 


Do: 


  • Make sure that you verify the data quality before going into the data analysis, ideally by involving a domain expert already in the data validation step (see Data Validation Session). This way, you can build trust among the process managers that the data reflects what is actually happening and ensure that you have the right understanding of what the data represents.
  • Work in an iterative way and present your findings as a starting point for discussion in each iteration. Give people the chance to explain why certain things are happening and let them ask additional questions (to be picked up in the next iteration). This will help to improve the quality and relevance of your analysis as well as increase the buy-in of the process stakeholders in the final results of the project. 


Don’t:


  • Jump to conclusions. You can never assume that you know everything about the process. For example, slower teams may be handling the difficult cases, people may deviate from the process for good reasons, and you may not see everything in the data (for example, there might be steps that are performed outside of the system). By consistently using your observations as a starting point for discussion, and by allowing people to join in the interpretation, you can start building trust and the collaborative culture that process mining needs to thrive.
  • Force any conclusions that you expect, or would like to have, by misrepresenting the data (or by stating things that are not actually supported by the data). Instead, keep track of the steps that you have taken in the data preparation and in your process mining analysis. If there are any doubts about the validity or questions about the basis of your analysis, you can always go back and show, for example, which filters have been applied to the data to come to the particular process view that you are presenting.
Privacy, Security and Ethics in Process Mining — Part 3: Anonymization

This is the 3rd article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

If you have sensitive information in your data set, instead of removing it you can also consider the use of anonymization techniques. When you anonymize a set of values, then the actual values (for example, the employee names “Mary Jones”, “Fred Smith”, etc.) will be replaced by another value (for example, “Resource 1”, “Resource 2”, etc.).

If the same original value appears multiple times in the data set, then it will be replaced with the same replacement value (“Mary Jones” will always be replaced by “Resource 1”). This way, anonymization allows you to obfuscate the original data but it preserves the patterns in the data set for your analysis. For example, you will still be able to analyze the workload distribution across all employees without seeing the actual names.

Some process mining tools (Disco and ProM) include anonymization functionality. This means that you can import your data into the process mining tool and select which data fields should be anonymized. For example, you can choose to anonymize just the Case IDs, the resource name, attribute values, or the timestamps. Then you export the anonymized data set and you can distribute it among your team for further analysis. 


Do:

  • Determine which data fields are sensitive and need to be anonymized (see also the list of common process mining attributes and how they are impacted if anonymized below).
  • Keep in mind that despite the anonymization certain information may still be identifiable. For example, there may be just one patient having a very rare disease, or the birthday information of your customer combined with their place of birth may narrow down the set of possible people so much that the data is not anonymous anymore.

Don’t: 


  • Anonymize the data before you have cleaned your data, because after the anonymization the data cleaning may not be possible anymore. For example, imagine that slightly different customer category names are used in different regions but they actually mean the same. You would like to merge these different names in a data cleaning step. However, after you have anonymized the names as “Category 1”, “Category 2”, etc. the data cleaning cannot be done anymore.
  • Anonymize fields that do not need to be anonymized. While anonymization can help to preserve patterns in your data, you can easily lose relevant information. For example, if you anonymize the Case ID in your incident management process, then you cannot look up the ticket number of the incident in the service desk system anymore. By establishing a collaborative culture around your process mining initiative (see part 4) and by working in a responsible, goal-oriented way, you can often work openly with the original data that you have within your team.

Anonymization of Common Process Mining Fields

Here is an overview of the typical process mining attributes and why you might want (or might not want) to anonymize them: 


Resource name

Removing the names of the employees working in the process is one of the more common anonymization steps. It can help to decrease friction and put employees more at ease when you involve them in a joint analysis workshop. Anonymizing employee names certainly is a must if you make your data publicly available in some form.

Be aware that it may still be possible to trace back individual employees. For example, if you look up a concrete case based on the case ID in the operational system, you will see the actual resource names there.

Finally, keep in mind that anonymizing employee names for an internal process mining analysis also removes valuable information. For example, if you identify process deviations or an interesting process pattern, normally the first step is to speak with the employees who were involved in this case to understand what happened and learn from them. 


Case ID

Anonymizing the case ID is a must if it contains sensitive information. For example, if you analyze the income tax return process at the tax office, then the case ID will be a combination of the social security number of the citizen and the year of the tax declaration. You will have to replace the social security information for obvious reasons.

However, for data sets where the case ID is less sensitive it is a good idea to keep it in place as it is. The benefit will be that you can look up individual cases in the operational system to verify your analysis or obtain additional information. Losing this link will limit your ability to perform root cause analyses and take action on the process problems that you discover. 


Activity name

Normally, you would not anonymize the activity name itself. The activities are the process steps that appear in the process map and in the variant sequences in the Process Mining tool. The reason why you do not want to replace the activity names by, for example, “Activity 1”, “Activity 2”, “Activity 3”, etc., is that most processes become very complex very quickly and without the activity names you have no chance to build a mental model and understand the process flows you are analyzing. Your analysis becomes useless.

Keeping the activity names in full is usually not a problem, because they describe a generic process step (like “Email sent”). However, especially if you have many different activity names in your data, you should review them to ensure they contain no confidential information (e.g., “Email sent by lawyer X”).

Other Attributes

Sensitive information is often contained in additional attribute columns. For example, even if you are analyzing an internal ordering process, there might be additional data fields revealing information about the customer.

You can either completely remove data columns that you don’t need, or you can anonymize their values. Keep the attribute columns that are not sensitive in their original form, because they can contain important context information when you inspect individual cases during your Process Mining analysis.

Finally, be aware that sensitive information can also be hidden in a ‘Notes’ attribute or some other kind of free-text field, where the employees write down additional information about the case or the process step. Simply anonymizing such a free-text field would be useless, because the whole text would be replaced by “Value 1”, “Value 2”, etc. To preserve the usefulness of the free-text field while removing sensitive information requires more work in the data pre-processing step and is not something that process mining tools can do for you automatically. 


Timestamps

Sometimes, the time at which a particular activity happened already reveals too much information and would make it possible to identify one of your business entities in an unwanted way. In such situations, you can anonymize the timestamps by applying an offset. This means that a certain number of days, hours, and minutes will be added to the actual timestamps to create new (now anonymized) timestamps.

Keep in mind that some of the process patterns may change when you analyze data sets with anonymized timestamps. For example, you might see activities appear on other times of the day than you would see in the original data set. For this reason, timestamp anonymization is mostly used if data sets are prepared for public release and not if you analyze a process within your company.

Privacy, Security and Ethics in Process Mining — Part 2: Responsible Handling of Data

This is the 2nd article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

Like in any other data analysis technique, you must be careful with the data once you have obtained it. In many projects, nobody thinks about the data handling until it is brought up by the security department. Be that person who thinks about the appropriate level of protection and has a clear plan already prior to the collection of the data.

Do:

  • Have external parties sign a Non Disclosure Agreement (NDA) to ensure the confidentiality of the data. This holds, for example, for consultants you have hired to perform the process mining analysis for you, or for researchers who are participating in your project. Contact your legal department for this. They will have standard NDAs that you can use.
  • Make sure that the hard drive of your laptop, external hard drives, and USB sticks that you use to transfer the data and your analysis results are encrypted.

Don’t:

  • Give the data set to your co-workers before you have checked what is actually in the data. For example, it could be that the data set contains more information than you requested, or that it contains sensitive data that you did not think about. For example, the names of doctors and nurses might be mentioned in a free- text medical notes attribute. Make sure you remove or anonymize (see part 3) all sensitive data before you pass it on.
  • Upload your data to a cloud-based process mining tool without checking that your organization allows you to upload this kind of data. Instead, use a desktop-based process mining tool (like Disco or ProM) to analyze your data locally or get the cloud-based process mining vendor to set-up an on-premise version of their software within your organization. This is also true for cloud-based storage services like Dropbox: Don’t just store data or analysis results in the cloud even if it is convenient.
Privacy, Security and Ethics in Process Mining — Part 1: Clarify Your Goal

[This article previously appeared in the Process Mining News – Sign up now to receive regular articles about the practical application of process mining.]

When I moved to the Netherlands 12 years ago and started grocery shopping at one of the local supermarket chains, Albert Heijn, I initially resisted getting their Bonus card (a loyalty card for discounts), because I did not want the company to track my purchases. I felt that using this information would help them to manipulate me by arranging or advertising products in a way that would make me buy more than I wanted to. It simply felt wrong.

The truth is that no data analysis technique is intrinsically good or bad. It is always in the hands of the people using the technology to make it productive and constructive. For example, while supermarkets could use the information tracked through the loyalty cards of their customers to make sure that we have to take the longest route through the store to get our typical items (passing by as many other products as possible), they can also use this information to make the shopping experience more pleasant, and to offer more products that we like.

Most companies have started to use data analysis techniques to analyze their data in one way or the other. These data analyses can bring enormous opportunities for the companies and for their customers, but with the increased use of data science the question of ethics and responsible use also grows more dominant. Initiatives like the Responsible Data Science seminar series1 take on this topic by raising awareness and encouraging researchers to develop algorithms that have concepts like fairness, accuracy, confidentiality, and transparency built in2.

Process Mining can provide you with amazing insights about your processes, and fuel your improvement initiatives with inspiration and enthusiasm, if you approach it in the right way. But how can you ensure that you use process mining responsibly? What should you pay attention to when you introduce process mining in your own organization?

In this article series, we provide you four guidelines that you can follow to prepare your process mining analysis in a responsible way.

1. Clarify Goal of the Analysis (this article)
2. Responsible Handling of Data
3. Consider Anonymizatione
4. Establish a Collaborative Culture

1. Clarify Goal of the Analysis

The good news is that in most situations Process Mining does not need to evaluate personal information, because it usually focuses on the internal organizational processes rather than, for example, on customer profiles. Furthermore, you are investigating the overall process patterns. For example, a process miner is typically looking for ways to organize the process in a smarter way to avoid unnecessary idle times rather than trying to make people work faster.

However, as soon as you would like to better understand the performance of a particular process, you often need to know more about other case attributes that could explain variations in process behaviours or performance. And people might become worried about where this will leave them.

Therefore, already at the very beginning of the process mining project, you should think about the goal of the analysis. Be clear about how the results will be used. Think about what problem you are trying to solve and what data you need to solve this problem.

Do:

  • Check whether there are legal restrictions regarding the data. For example, in Germany employee-related data cannot be used and typically simply would not be extracted in the first place. If your project relates to analyzing customer data, make sure you understand the restrictions and consider anonymization options (see part 3).
  • Consider establishing an ethical charter (see an example charter contributed by Léonard Studer here) that states the goal of the project, including what will and what will not be done based on the analysis. For example, you can clearly state that the goal is not to evaluate the performance of the employees. Communicate to the people who are responsible for extracting the data what these goals are and ask for their assistance to prepare the data accordingly.

Don’t:

  • Start out with a fuzzy idea and simply extract all the data you can get. Instead, think about what problem are you trying to solve? And what data do you actually need to solve this problem? Your project should focus on business goals that can get the support of the process managers you work with (see part 4).
  • Make your first project too big. Instead, focus on one process with a clear goal. If you make the scope of your project too big, people might block it or work against you while they do not yet even understand what process mining can do.

  1. Responsible Data Science (RDS) initiative: http://www.responsibledatascience.org  
  2. Watch Wil van der Aalst’s presentation on Responsible Data Science at Process Mining Camp 2016: https://www.youtube.com/watch?v=ewQbmINuXeU