The CeBIT is the world’s largest and most international computer expo and takes place next week in Hannover, Germany. We are excited to be there and to have the opportunity to introduce many more people to process mining, and to show them Disco live in action.
We will be there the whole week from 16–20 March at Hall 3 Stand 36. We have also been invited to give daily process mining lectures as part of the gfo-Symposium, which features a broad range of process analysis and process management topics.
You can see the full program of the gfo-Symposium (in German) here.
The concrete times of our process mining lectures is shown here.
If you would like to attend the CeBIT but have not ticket yet, just let us know and we can organize you a free ticket.
For those of you coming to Hannover next week, make sure to stop by at Hall 3 Stand H36 and say hello!
This is the second part in a series about managing complexity in process mining. We recommend to read Part I first if you have not seen it yet.
Part II: Remove Incomplete Cases
Removing incomplete cases seems like a pre-analysis, clean-up step but read on to learn why it is also relevant as a simplification strategy.
Strategy 3) Remove Incomplete Cases
Imagine you just got a new data set and simply want to make a first process map. You typically do not want to get into a detailed analysis right away. For example, you often want to first validate that the extracted data is right, or you might need to quickly show the process owner a first picture of how the discovered process looks like.
Obviously, a complex process map is getting in your way to do that.
Now, while filtering incomplete cases is a typical preparation step for your actual analysis, you might also want to check whether you have incomplete cases to get a simpler process map. Here is why.
In many cases, the data that is freshly extracted from the IT system contains cases that are not yet finished. They are in a certain state now and if we would wait longer then new process steps would appear. The same can happen with incomplete start points of the process (things may have happened before the data extraction window).
For the analysis of, for example, process durations it is very important to remove incomplete cases, because otherwise you will be judging half-finished cases as “particularly fast”, reducing the average process duration in a wrong way. But incomplete cases can also inflate your process map layout by adding many additional paths to the process end point.
To understand why, take a look at the process map below. It shows that next to the regular end activity ‘Order completed’ there are several other activities that were performed as the last step in the process — showing up as dashed lines leading to the end point at the bottom of the map. For example, ‘Invoice modified’ was the last step in the process for 20 cases (see below). This does not sound like a real end activity for the process, does it?
To remove incomplete cases, you can just add an Endpoints filter in Disco and select the start and end activities that are valid start and end points in your process (see below).
The resulting process map will be simpler, because the graph layout becomes simpler (see below).
So, even if you are in a hurry and not really in the analysis phase yet, it is worth to try removing incomplete cases if you are faced with too much complexity in your process.
That was strategy No. 3. Watch out for Part III, where we explain how dividing up your data can help simplifying your process maps.
Have you ever imported a data set in your process mining tool and what you got was a complex “spaghetti” process? Often, real-life processes are so complex that the resulting process maps are too complicated to interpret and use.
For example, the process that you get might look like the picture above.
The problem with this picture is not that it is wrong, in fact this is the true process if you look at it in its entirety. The problem is that this process map is not useful, because it is too complicated to derive any useful insights or actionable information from it.
What we need to do is to break this up and to simplify the process map to get more manageable pieces.
In this series, you will learn 9 simplification strategies for complex process maps that will help you get the analysis results that you need. We show you how you can apply these strategies in the process mining software Disco (download the free demo version from the Disco website to follow along with the instructions).
The 9 strategies are grouped into the following four parts. You can find the first two strategies in today’s article below. The remaining parts will be released in the next days and linked from here.
Part I: Quick Simplification Methods (this article)
Part II: Remove Incomplete Cases
Part III: Divide and Conquer
Part IV: Leaving Out Details
Let’s get started!
Part I: Quick Simplification Methods
First, we look at two simplification methods that you can use to quickly get to a simpler process map.
Strategy 1) Interactive Simplification Sliders
The first one is to use the interactive simplification sliders that are built in the map view in Disco (see below).
The Disco miner is based on Christian’s Fuzzy Miner, which was the first mining algorithm to introduce the “map metaphor”, including advanced features like seamless process simplification and highlighting of frequent activities and paths. However, the Disco miner has been further developed in many ways.
One important difference is that if you pull both the Activities and the Paths sliders up to 100% then you see an exact representation of the process. The complete picture of the process is shown, exactly as it happened. This is very important as a reference point and one-on-one match of your data to understand the process map.
However, without applying any of the simplification strategies discussed later, the complete process is often too complex to look at on 100% detail.
Here is where the interactive simplification sliders can give you a quick overview about the process. We recommend to start by pulling down the Paths slider, which gradually reduces the arcs in the process map by hiding less frequent transitions between activities.
At the lowest point, you only see the most important process flows, and you can see that the “spaghetti” process map from above has been simplified greatly, already yielding a very readable and understandable process map (see below).
What you will notice is that some of the paths that are shown can be still quite low-frequent. For example, in the following fragment you see that there are two paths with just the frequency 2 (see below). The reason is that the Paths simplification slider is smart enough to take the process context into account and sees that these paths connect the very low-frequent activity ‘Request rejected L3’, which just occurred 4 times (see below). It would not be very useful to have low-frequent activities “flying around”, disconnected from the rest of the process.
The Paths slider is very important, because it allows you to see everything that has happened in you process (all the activities that were performed), but still get a readable process map with the main flows between them.
Often, you will find that getting a quick process map with all the activities shown (Activities slider up at 100%) and only the main process flows (Paths slider down at lowest point, or slightly up, depending on the complexity of the process) will give you the best results.
However, if you have many activities, or if you want to further simplify the process map, you can also reduce the number of activities by pulling down the Activities slider (see below).
At the lowest point, the Activities slider shows you only the activities from the most frequent process variant (see also strategy No. 2 in the next section). This means that only the activities that were performed on the most frequent path from the very beginning to the very end of the process are shown. So, this shows you really the main flow of the process (now also abstracting from less frequent activities, not only less frequent paths).
For example, the “spaghetti” process map from the beginning could be greatly simplified to just the main activities ‘Order created’ and ‘Missing documents requested’ by pulling down the Activities slider (see below).
Strategy 2) Focusing on the Main Variants
An alternative method to quickly get a simplified process map is to focus on the main variants of the process. You find the variants in the Cases view in Disco.
For example, one case from the most frequent variant (Variant 1) is shown in the screenshot below: There are just two activities in the process, first ‘Order created’ and then ‘Missing documents requested’ (so, most cases are actually, strangely, waiting for feedback from the customer, but we are not focusing on this at the moment).
If you look at the case frequencies and the percentages for the variants, then you can see that the most frequent variant covers 12.41%, the second most frequent covers 5.16% of the process, etc. What you will find in more structured processes is that often the Top 5 or Top 10 variants may already be covering 70-80% of your process. So, the idea is to directly leverage the variants to simplify the process.
Note: This strategy only works for structured processes. In unstructured processes (for example, for patient diagnosis and treatment processes in a hospital, or for clicks-streams on a website) you often do not have any dominant variants at all. Every case is unique.
In such unstructured processes, variant-based simplification is completely useless, but the interactive simplification sliders from the previous section still work (they always work).
You can easily focus on the main variants in Disco by using the Variation filter (see below). For example, here we focus on the Top 5 variants by only keeping the variants that have a support of 50 cases or more.
Only the Top 5 variants are kept and we see that these few (out of 446) variants are covering 29% of the cases.
If you now switch back from the Cases view to the Map view, you can see the process map just for those 5 variants (see below).
The trick here is that, this way, you can easily create a process map with 100% detail (notice both the Activities and paths sliders are pulled up completely) – But of course only for the variants that are kept by the filter.
This method can be particularly useful if you need to quickly export a process map for people who are not familiar with process mining. If you export the process map with 100% detail then all the numbers add up (no paths are hidden) and you do not need to explain what “spaghetti” processes are and why the process map needs to be simplified. You can simply send them the exported PDF of the process map and say, for example, “This is how 80% of our process flows” (depending how many % your variant selection covers).
Note, however, that less frequent activities are often hidden in the more exceptional variants, and you do not see them when you focus on the main variants. Use the interactive simplification sliders from the previous section to quickly get a simplified map with the complete overview of what happens in your process.
These were two quick simplification strategies. Watch out for Part II, where we explain how removing incomplete cases can help simplifying your process maps.
We are happy to announce the immediate release of Disco 1.8.0!
This update to Disco adds a number of new functionalities, making your process analysis even more powerful and expressive. Rather than new features, though, the focus of this release is to further improve the performance, stability, and robustness of Disco, and to provide a reliable and even more capable platform for going forward.
Since we have reengineered the native integration of Disco from the ground up, this update cannot be installed automatically. Please go to fluxicon.com/disco and download the updated installer package for your platform in order to install the Disco 1.8.0 update.
If you would like to learn more about the new features in Disco 1.8.0, and the changes we have made under the hood, please keep on reading.
Process Map Animation is one of the most popular features in Disco. If you need to quickly demonstrate the power of process mining to a colleague, your manager, or a client, there is no better way to get their attention than showing them a process map come to life.
But animation is not just a showy demo feature that is nice to look at. It provides a dynamic perspective that makes understanding bottlenecks and process changes much easier. Synchronized animation, a new feature in Disco 1.8.0, adds a new dimension of insight to animation.
Regular animation in Disco replays your event log data on the current model, just as it happened in your data. In contrast, synchronized animation starts to replay all cases in your data at the same time. This allows you analyze at what time into case execution the hot spots and bottlenecks in your process are most prominent, and to compare your process performance over the set of cases in your data.
You can choose between regular and synchronized animation by right-clicking the animation button in Disco’s process map view.
Improved Median Support
In Disco 1.6.0, we introduced support for the median, both in process map duration and in the statistics view. In many situations, the median (also known as the 50th percentile) gives you a much better idea of the typical characteristics of a process than the arithmetic mean, especially for data sets that contain extreme outliers.
While the median is very useful for analysis, it is quite demanding to determine, both in terms of computing power and regarding memory requirements. So far, we have used a very advanced technique to compute medians in Disco, which can estimate the value of the median with a very low error margin, while keeping the memory requirements very low. This is important, because Disco needs to compute a lot of medians at the same time (for example, for a process map, we need to compute the median for each activity, and also for each path between them) and for huge data sets.
However, there are some situations, in which we have very few measurements for a median (for example, when an activity or path occurs only a few dozen times in the data). When those few measurements are very skewed, i.e. if they are very unevenly distributed, the computed median in Disco could differ significantly from the precise median. This is not a bug in the traditional sense, which is to say that the median estimation in Disco works as expected. Rather, this discrepancy reflects the skewed measurement space reflected in the data. Still, it can be confusing to the analyst, and as such we treated it as a bug.
To address this, in Disco 1.8.0, we have completely reengineered the computation of medians. We now use a new algorithm that can compute the precise median all over Disco with significantly reduced memory footprint. When you have a huge or complex data set, and Disco runs low on available memory, it will automatically transition selected medians to a more memory-efficient calculation method. By automatically selecting those medians, where the transition yields the lowest error in the estimated median, Disco ensures that, even when you are memory-constrained, you will get the best results possible for all your data.
All median calculations that have been transitioned to the more memory-efficient calculation method are now highlighted throughout Disco by being prefixed with a tilde. For example, in the image above, the path with the “~ 142 milliseconds” median duration has been estimated, while the other paths (with “3.9 d” and “71.1 mins”) are precise. This makes it easy for the analyst to see which medians are precise and which have been estimated.
Unless you are working with very large data sets, you will probably never see an estimated median in Disco. And even when you do, in all likelihood the estimated median will differ only very slightly from the precise median, or not at all. And for those rare situations when you absolutely do require total precision of all medians in a huge data set, you can simply increase the memory available to Disco in the control center.
This new median calculation system in Disco 1.8.0 provides the best of both worlds. Wherever possible, you get an absolutely precise median with the minimum memory footprint and best system performance. Whenever that is not possible, Disco automatically reduces the precision for those measurement points where it makes the least difference. In that way, you will get nearly precise medians also for very large data sets. And the best part is, since Disco makes all these decisions automatically, you will never need to worry.
Minimum Duration Perspective
Analyzing the performance of a process in a process map is one of the most important and useful functionalities of Disco. For each activity and path, you can either display the total duration over all cases, inspect the typical duration using either the mean or median duration, or you can display the maximum duration observed in your data.
In Disco 1.8.0, we are adding the minimum duration for all activities and paths. This can be useful if you want to see the “best case scenario”, e.g. if you want to know how fast an activity can be completed if all goes well.
On the other hand, the minimum duration can also highlight problems. If, for example, an activity that checks for authorization from a manager has a minimum duration of only 10 milliseconds, you know that you are either dealing with a suspicious situation, such as fraud, or that there are problems recording your log data.
The minimum duration is available either from the drop-down menu in the Performance perspective, or by clicking on an activity or path, in Disco’s map view.
Disco 1.8.0 now completely supports Mac OS X devices with Retina screens. So, if you have a Mac with a retina screen, every part of Disco will now look even better and razor-sharp.
On the Mac, Disco now also uses the latest version of Java, improving the performance, reliability, and security of using Disco on Mac OS X.
The 1.8.0 update also includes a number of other features and bug fixes, which improve the functionality, reliability, and performance of Disco. Please find a list of the most important further changes below.
- Improved CSV Import user interface performance and fidelity.
- Improved flexibility of timestamp parser when importing CSV data.
- Improved table view performance in the user interface.
- Improved diagnostics information that can be sent from feedback or error dialogs, for better and faster problem resolution.
- Fixed a bug that could prevent certain recipes from being loaded.
- Fixed a bug that could prevent loading logs with large numbers of cases and variants.
- Redesigned context dialog popovers.
- Improved launch process and OS integration for Windows and Mac OS X.
- Improved overdrive performance when mining process maps on machines with multiple CPU cores.
- Improved performance of creating process map animations on machines with multiple CPU cores.
A Happy New Year everyone! We start the year by looking back to 2014 for our annual Process Mining at BPM post.
In 2014, there was an insane amount of process mining papers at the BPM conference. As always, we have looked through all the main conference and workshop papers to find the ones that are related to process mining and contacted the authors of the papers that were not yet publicly available.
You can find full-paper links to the publications below and we will keep adding new links from authors who have not responded yet. If we missed something please let us know.
The BPM conference is a very competitive conference with hundreds of papers being submitted to the main track and just around 20+ of them are accepted. It’s incredible that 14 of them fall into the process mining research area. Below you find the links to the papers and the slides, along with a short summary:
Discovering Target-Branched Declare Constraints by Claudio Di Ciccio, Fabrizio Maria Maggi, and Jan Mendling from Vienna University of Business and Economics, Austria, and University of Tartu, Estonia (download slides)
An alternative to discovering process models is to discover a set of declarative rules, restricting the allowed behavior “from the outside” rather than explicitly outlining the paths that are possible. However, a challenge for complex processes is that the discovery of declarative processes often also results in hundreds of constraints (another encounter of the so-called “spaghetti” problem). The work of Claudio and his colleagues addresses the explosion of branching constraints by mining Target-Branched constraints.
Crowd-Based Mining of Reusable Process Model Patterns by Carlos Rodríguez, Florian Daniel, and Fabio Casati from the University of Trento, Italy (download slides)
Rather than discovering process models from data, Carlos and his colleagues investigate the discovery of model patterns, for example to provide recommendations during process modeling. While there are automated methods such as frequent sub-graph mining, they explore an approach where the pattern identification is implemented through humans in a crowdsourcing environment. The approach is tested to discover data flow-based mashup models.
A Recommender System for Process Discovery by Joel Ribeiro, Josep Carmona, Mustafa Mısır, and Michele Sebag from Universitat Politècnica de Catalunya, Spain and TAO, INRIA Saclay – CNRS – LRI, Universite Paris Sud XI, Orsay, France (download slides)
There are dozens of different process mining algorithms with different strengths and weaknesses, and even built on different formalisms (e.g., Petri nets, BPMN, EPC, Causal nets). So, selecting the right one and using it correctly is a daunting task. Joel and his colleagues have worked out a recommender system to find the best discovery algorithm for the data at hand. This way, the users can get a recommendation for which algorithm to use. Log features such as the average trace length and measures such as fitness and precision are the basis for the recommendation.
Beyond Tasks and Gateways: Discovering BPMN Models with Subprocesses, Boundary Events and Activity Markers by Raffaele Conforti, Marlon Dumas, Luciano García-Bañuelos, and Marcello La Rosa from Queensland University of Technology, Australia, and University of Tartu, Estonia (download slides)
Existing process mining techniques generally produce flat process models. The authors developed a technique for automatically discovering BPMN models containing subprocesses (based on a set of attributes that includes keys to identify (sub)process instances, and foreign keys to identify relations between parent and child processes), interrupting and non-interrupting boundary events, and activity markers. The discovered process models are more modular, but also more accurate and less complex than those obtained with flat process discovery methods.
A Genetic Algorithm for Process Discovery Guided by Completeness, Precision and Simplicity by Borja Vázquez-Barreiros, Manuel Mucientes, and Manuel Lama from the University of Santiago de Compostela, Spain (download slides)
The authors present a new genetic process discovery algorithm with a hierarchical fitness function that takes into account completeness, precision and simplicity. The algorithm has been tested with 21 different logs and was compared with two state of the art algorithms.
Constructs Competition Miner: Process Control-Flow Discovery of BP-Domain Constructs by David Redlich, Thomas Molka, Wasif Gilani, Gordon Blair, and Awais Rashid from SAP Research Center Belfast, Lancaster University, and University of Manchester, United Kingdom (download slides)
A new process discovery algorithm is proposed that follows a top-down approach to directly mine a process model which consists of common business process constructs (in a language familiar to the business analyst rather than Petri nets or other languages preferred by academic scholars). The discovered process model represents the main behaviour and is based on a competition of the supported constructs.
Mining Resource Scheduling Protocols by Arik Senderovich, Matthias Weidlich, Avigdor Gal, and Avishai Mandelbaum from Technion, Israel and Imperial College London, United Kingdom (download slides)
Their contribution fits under the umbrella of operational process mining, similar to other techniques aiming to predict wait times and case completion times. The paper focuses on service processes, where performance analysis is particularly important, and does not only take the load information into account but also the order of activities that a service provider follows when serving customers. A data mining technique and one based on queueing heuristics are tested based on a large real-live data set from the telecom sector.
Temporal Anomaly Detection in Business Processes by Andreas Rogge-Solti from Vienna University of Economics and Business, Austria and Gjergji Kasneci from Hasso Plattner Institute, University of Potsdam, Germany (download slides)
This paper focuses on temporal aspects of anomalies in business processes. The goal is to detect temporal outliers in activity durations for groups of interdependent activities automatically from event traces. To detect such anomalies, the authors propose a Bayesian model that can be automatically inferred form the Petri net representation of a business process.
A General Framework for Correlating Business Process Characteristics by Massimiliano de Leoni, Wil van der Aalst, and Marcus Dees from the University of Padua, Italy, Eindhoven University of Technology, The Netherlands, and Uitvoeringsinstituut Werknemersverzekeringen (UWV), The Netherlands (download slides)
The authors provide a general framework for deriving and correlating process characteristics and therewith unify existing ad-hoc solutions for specific process questions. First, they show how the desired process characteristics can be derived and linked to events. Then, they show that we can derive the selected dependent characteristic from a set of independent characteristics for a selected set of events.
The Automated Discovery of Hybrid Processes by Fabrizio Maria Maggi from University of Tartu, Estonia, Tijs Slaats from IT University of Copenhagen, Denmark, and Hajo A. Reijers from Eindhoven University of Technology, The Netherlands (download slides)
This paper presents an automated discovery technique for hybrid process models: Less-structured process parts with a high level of variability can be described in a more compact way using a declarative language. Procedural process modeling languages seem more suitable to describe structured and stable processes. The proposed technique discovers a hybrid process model, where each of its sub-processes may be specified in a declarative or procedural fashion, leading to overall more compact models.
Declarative Process Mining: Reducing Discovered Models Complexity by Pre-Processing Event Logs by Pedro H. Piccoli Richetti, Fernanda Araujo Baião, and Flávia Maria Santoro from the Federal University of the State of Rio de Janeiro, Brazil (download slides)
The authors present a new discovery approach for declarative models that aims to address the problem that existing declarative mining approaches still produce models that are hard to understand, both due to their size and to the high number of restrictions of the process activities. Their approach reduces declarative model complexity by aggregating activities according to inclusion and hierarchy semantic relations.
SECPI: Searching for Explanations for Clustered Process Instances by Jochen De Weerdt and Seppe vanden Broucke from KU Leuven, Belgium (download slides)
Trace clustering is an approach to group process instances in similar groups, however usually does not provide insight into on which basis these groups were formed. This paper presents a technique that assists users with understanding a trace clustering solution by finding a minimal set of control-flow characteristics whose absence would prevent a process instance from remaining in its current cluster.
Business Monitoring Framework for Process Discovery with Real-Life Logs by Mari Abe and Michiharu Kudo from IBM Research, Tokyo, Japan (download slides)
This paper proposes a monitoring framework for process discovery that simultaneously extracts the process instances and metrics in a single pass through the event log. Instances of monitoring contexts are linked at runtime, which allows to build process models from different metrics without reading huge logs again.
Predictive Task Monitoring for Business Processes by Cristina Cabanillas, Claudio Di Ciccio, and Jan Mendling from Institute for Information Business at Vienna, and Anne Baumgrass from Hasso Plattner Institute at the University of Potsdam, Germany (download slides)
Event logs of running processes can be used as input for predictions around business processes. The authors extend this idea by also including misbehaviour patterns on the level of singular tasks associated with external events such as from GPS or RFID systems and demonstrate the use case based on a scenario from the smart logistics area.
The workshops always take place the day before the main conference starts, are smaller, have a specific theme, and also provide the space to explore and discuss new ideas. Normally, mostly the BPI workshop is the main target for process mining papers but last year the theme runs like a read thread through almost all of the workshops:
The 10th International Workshop on Business Process Intelligence (BPI), as always, had lots of process mining contributions:
The 7th Workshop on Business Process Management and Social Software (BPMS2) focused on social software as a new paradigm and had one process mining paper:
The 3rd Workshop on Data- & Artifact-centric BPM (DAB) specializes on data-centric processes and also had a contribution in the process mining area:
The 2nd International Workshop on Decision Mining & Modeling for Business Processes (DeMiMoP) looked specifically into decisions in relation to processes and had three process mining papers:
The 3rd Workshop on Security in Business Processes (SBP) featured two process mining contributions plus a practitioner keynote on the topic:
Finally, the 3rd International Workshop on Theory and Applications of Process Visualization (TaProViz) also had a practitioner keynote on process mining and two more papers in this area:
More Process Mining
There was actually even more process mining going on than we can cover here. Andrea Burattin received the Best Process Mining PhD thesis award. There were demos. CKM Advisors won the BPI Challenge (the team of Gabriele Cacciola from the Universiy of Calabria won the student challenge). The annual IEEE Task force meeting took place. And we had an awesome process mining party.
What you can see from all the new contributions above is that process mining is as active a research area as never before. It’s an exciting area to work in and there are still so many topics that have not been addressed yet.
This year’s BPM conference takes place in Innsbruck. If you are a researcher, you should mark the deadlines and try to be there!
Get process mining news plus extra practitioner articles straight into your inbox
In the process mining news, we create this list of collected process mining web links on the blog, with extra material in the e-mail edition.
Process Mining on the Web
Here are some pointers to new process mining discussions and articles, in no particular order:
To make sure you are not missing anything, here is a list of the upcoming process mining events we are aware of:
Would you like to share a process mining-related pointer to an article, event, or discussion? Let us know about it.
Happy holidays, everyone!
This is a guest post by John Hansen, Author of the blog www.processmining.dk, and Claudia Billing from Copenhagen Airports A/S. Both share their experience from applying process mining to a process at Copenhagen Airports based on Bag-tag data extracted from the Bag-tag system.
If you have a process mining case study that you would like to share as well, please contact us at email@example.com.
Process and Data
Everyone has dropped off and picked up luggage at the airport, but what happens behind the scenes?
Every bag that is checked in or transferred through the airport gets a bag-tag that contains valuable information about the destination flight. All bags are handled in the baggage sortation factory, ensuring that they end up on the right flight on time.
The Bag-tag is scanned multiple times on its way from check-in, through the baggage factory, and to the aircraft. Furthermore, and you may not know this, when customers arrive early at the airport, then their luggage is actually not directly sent to the place at the airport where it will be picked up for upload to the aircraft, but it is first sent to a storage facility (a kind of “baggage hotel”) for some time before it is retrieved again.
The process needs to meet several performance KPIs. Because of the different scenarios (different destinations, without storage vs. with storage, etc.) the process can vary significantly and the process mining project was started to have a closer look at how exactly the process looks like based on the Bag-tag scan data.
The approach that was taken in the project was to look at the results in iterative cycles with close collaboration from the domain expert. This way, first analysis results were obtained in an exploratory manner and then refined in the following iterations.
For example, one challenge was to understand and simplify the data from a spaghetti-like process overview into meaningful details by filtering and slicing the process data.
Figure 1: Overall process (starting point of the analysis)
Figure 2: Detail-view of the process after applying filters for focusing on specific aspect
Also, different perspectives were taken on the data, which allowed to explore different questions and analysis views. Overall, the knowledge about the desired process and the operational KPIs were guiding the analysis.
From the process map and the related process statistics, interesting details were discovered such as “Where are the bottlenecks?”, and “Are those primarily in the baggage factory belts or in the surrounding events?”. Furthermore, the Bag Throughput KPI was analysed and possible reasons for discrepancies from the target values were determined.
In one of the analysis perspectives, the location, where the bag was scanned, redirected, etc., was incorporated in the activity (see above). This perspective made it possible to easily see the performance in the process steps related to locations.
For example, the average number of minutes from the operated check-in to the bag being seen for the first time in the baggage factory. Or the average time luggage was stored due to early arrival. Information like this is valuable to get a full picture of the overall process, and having it right at hand is a huge advantage.
This overview then helped to identify the challenge areas and likely root causes. It also helped to rule out other root causes. For example, the process bottlenecks were generally not related to the baggage factory belt performance.
Although there was not a specific hypothesis to check prior to the Process Mining analysis, it was possible to to identify valuable insights very quickly. It was a big advantage that not all the questions needed to be defined upfront. Instead, the Copenhagen Airports analysts valued particularly that while the main bag process was mapped out quickly, it was still possible to uncover and analyze variations from the main process in detail in an explorative way.
This way, it was possible to learn more about the process and discover new insights in each iteration. By seeing the actual process without assumptions, and digging into the actual process patterns that were discovered, analyses could be done much quicker and in much more detail than in a question-answer-based, traditional way.
In summary, the takeaway points are:
- An overall process overview was obtained quickly and interesting facts were easy to identify. For example, weekends have more circulations than other days.
- It was possible to identify likely reasons for KPI discrepancies.
- Being able to identify areas with potential process challenges prior to a more in-depth analysis, the analysis could be concentrated on areas with possible process challenges, as opposed to the traditional approaches where the process areas that are analysed in detail are not necessarily those having the most challenges.
- The easy and fast way of looking at the process from different perspectives (for example considering the locations vs. not considering the locations) revealed many new insights. The perspective could shift from KPIs and bottlenecks, to process performance related to locations.
- Root cause analyses could be done quickly based on the evidence. For example, the process bottlenecks were generally not related to the baggage factory belt performance.
- It was possible to compare process performance for special days (e.g. days with mechanical breakdowns) to average or good days.
- It was fast and easy to get an overview of the process performance.
- As with all data analyses, the process mining analysis is dependent on getting the right data, which was improved iteratively. It’s an advantage to start quickly with what you have and then to enhance the data in the iterative work.
John Hansen, Author of the blog www.processmining.dk
Claudia Billing, Copenhagen Airports A/S
This is a guest post by Nicholas Hartman (see further information about the author at the bottom of the page).
If you have a process mining article or case study that you would like to share as well, please contact us at firstname.lastname@example.org.
Data Preparation for Process Mining
This is the first in a four-part series on best practices in data preparation for process mining analytics. While it may be tempting to launch right into extensive analytics as soon as the raw data arrives, doing so without appropriate preparation will likely cause many headaches later. In the worst case, results could be false or have little relevance to real-world activities.
This series won’t cover every possible angle or issue, but it will focus on a broad range of practical advice derived from successful process mining projects. The 4 pieces in this series are:
- Human vs. Machine – Understanding the unintentional influence that people, processes and systems can have on raw event log data
- Are we on time? – Working with timestamps in event logs (spoiler alert: UTC is your friend)
- Are we seeing the whole picture? – Linking sub-process and other relevant contextual data into event logs
- Real data isn’t picture perfect – Missing data, changing step names and new software versions are just a few of the things that can wreak havoc on a process mining dataset… we’ll discuss weathering the storm
Part I: Human vs. Machine
Whenever we launch into a process mining project, our teams first identify and collect all the available log data from all the relevant event tracking and management systems. (We also aim to collect a lot of additional tangential data, but I’ll talk about that more in a subsequent post).
After loading the data into our analytical environment, but before diving into the analysis, we first closely scrutinize the event logs against the people, systems and processes these logs are meant to represent.
Just because the process manual says there are 10 steps doesn’t mean there are 10 steps in the event log. Even subtler, and potentially more dangerous from an analytical standpoint, is the fact that just because an event was recorded in the event log doesn’t mean that it translates into a meaningful process action in reality.
We consider this event log scrutiny one of the most important preparations for process mining. Failure to give this step a team’s full attention, and adjust processing mining approaches based on the outcome of this review, can quickly lead to misleading or just flat out wrong analytical conclusions.
We could write a whole book on all these different issues we’ve encountered, but below is a summary of some of the more common items we come across and things that anyone doing process mining is likely to encounter.
Within process mining output, we often refer to a loopback between two steps as ‘ping-pong’ behavior. Such behavior within a process is usually undesirable and can represent cases of re-work or an over-specialization of duties amongst teams completing a process. However, to avoid mis-identifying such inefficiencies a detailed understanding of how people, process and systems interrelate is necessary before launching into the analysis.
Take the following example on IT incident management tickets as illustrated in Figure 1:
: A closed ticket that is re-opened and then re-closed is an example of ping-pong behavior.
In this case a ticket is closed, but then at a later date the status is changed back to opened and then closed again. Many would quickly make a hypothesis that the re-opening of the ticket was a result of the original issue not being correctly resolved. Indeed this is a common issue, often caused by support staff placed under pressure to meet overly simplistic key performance indicators (KPIs) that push for targets on the time to close a ticket, but that don’t also measure the quality or completeness of the work output.
During one recent project our team was investigating just such a scenario. However, because they had done the appropriate due diligence up front in investigation how people interacted with the process management systems they also understood that there were some more benign behaviors that could produce the same event log pattern. We identified cases of ticket’s being re-opened and plotted the distribution of the time the ticket remained re-opened. The result (shown illustratively in Figure 2) revealed that there were two distinct distributions—one where the re-open period was very brief and another with a much longer period of hours or days).
: Distribution of the number of tickets relative to the length of the ticket’s re-open period.
Upon closer inspection we found that the brief re-open period was dominated by bookkeeping activities and was an unintended by-product of some nuances in the way that the ticketing system worked. Occasionally managers, or those that worked on the ticket, would need to update records on a previously closed ticket (e.g., to place the ticket into the correct issue or resolution category for reporting). However, once a ticket was ‘closed’ the system no longer allowed any changes to the ticket record. To get around this, system users would re-open the ticket, make the bookkeeping update to the record, and then re-close the ticket—often just a few seconds later.
Strictly from a process standpoint this represented a ping-pong, and still a potential inefficiency, but very different from the type of re-work we were looking for. By understanding how human interaction with the process system was actually creating the event logs we were able to proactively adjust our analytics to segment these bookkeeping cases within the analytics—in this case through a combination of the length of the re-opened period and some free text comments on the tickets.
After performing the re-work analysis exclusive of bookkeeping activities, the team was able to identify and quantify major inefficiencies that were impacting productivity. In one particular case, almost a quarter of the ticket transfers between two teams were unnecessary, yet had repeatedly escalated into ping-pongs due to unclear ownership of particular roles, responsibilities and functions within that support organization.
Figure 3 highlights some other event log anomalies that can be caused by the way people and processes interact with the system generating the log files.
: Examples of additional types of event log anomalies that can be caused by the way people interact with systems.
A – Skipping the Standard Process Entry Point
Event logs often show processes that do not start at the intended start point. It’s important to understand up front if this is an accurate representation of reality, or a quirk of how data is being recorded in the event logs.
An example of this can be found in loan processing protocols. The normal procedure might be that a loan request is originated and supporting documents added to the record before being sent off for the first found of document review. However, process mining may show that some loans skip this first step and their first appearance in the system is at the document review stage.
In this example, reasons for such observations could include:
- Loans from certain offices originate in a legacy system that then only creates a record in the main system starting at step 2
- Some special loans are still handled manually and passed off for document review without first logging an entry in the central system so the first record in this system starts at step 2
- Some loan requests are actually split into sub-requests during step 1, but some users forget to update child records with the parent request number making it appear like these child load requests start at step 2
If something similar is occurring in your process dataset, it is important to make sure that any analysis considers that the raw event logs around this point in the process give incomplete information relative to what’s happening in reality. Event logs often only record what’s happening directly within the system creating the log, while the process under study may also be creating data in other locations. There are some ways to fill in these missing gaps or otherwise adjust the analysis accordingly, which we’ll discuss in a later article.
B – Skipping Steps
Process logs also often skip steps. Analysis of such scenarios is often desirable because it can highlight deviations from standard procedure. However, the absence of a step in the event log doesn’t mean the step didn’t happen.
Returning to the earlier example of support desk tickets, teams that aren’t disciplined in keeping ticket records up to date will sometimes abandon a ticket, record for a period, and then return at a later date to close out the ticket. This is another example of a behavior that’s often caused by an imbalanced focus on narrow KPIs (e.g., focusing too much on the time to close a ticket can cause teams to be very quick at closing tickets, but not recording much about what happened between opening and closure). A ticket may be ‘created’ but never ‘assigned’ or ‘in progress,’ instead jumping right to ‘closed.’ This course of action can occasionally be legitimate (e.g., in cases where a ticket is opened accidently and then immediately closed), but before performing analysis it’s important to understand when and why the data shows such anomalies.
If this is the first time a dataset has been used to conduct process mining there’s a good chance that it will contain such regions of missing or thin data. Often, management teams are unaware that such gaps exist within the data and one of the most beneficial outputs of initial process analytics can be the identification of such gaps to improve the quality and quantity of data available for future ongoing analysis.
C – Rapidly Progressing Through Steps
Related to the previous case are situations where a process quickly skips through a number of steps at a speed that is inconsistent with what’s expected. Some systems will not allow steps to be skipped and thus users looking to jump an item ahead are forced to quickly cycle through multiple statues in quick succession.
Such rapid cycling through steps is often legitimate, such as when a system completes a series of automation steps.
Final Note on KPIs
At several points through this piece I mentioned KPIs and the impact they can have on how people complete processes and use systems. It’s also important to be on the lookout for how some of the observed differences between reality and event logs can have unintended impacts on such KPIs. Specifically, is the KPI actually even measuring what it’s marketed as measuring? There will always be some outliers, but given that many process KPIs were created without conducting thorough process mining beforehand it’s often the case that a process miner will find some KPIs that are based on flawed calculations—especially where a simple metric like average or median is masking a complex scenario where significant subset of the measurements are not relevant to the ‘performance’ intended to be measured.
Checklist for Success
In closing, here are a few key questions to ask yourself before launching into analysis:
- Do you understand how the event logs are generated, and specifically how humans and automated processes impact what’s recorded in the event log?
- For any anomalies revealed during initial process mining, do you understand all the actual actions that cause the observed phenomena?
- Are there any currently deployed KPIs that could be adversely impacted by the observed differences between the event logs and reality?
In the next installment of this series we’ll take a closer look at timestamps and some of the import relationships between timestamps and event logs.
Nicholas Hartman is a data scientist and director at CKM Advisors in New York City. He was also a speaker at Process Mining Camp and his team won the BPI Challenge this year.
More information is available at www.ckmadvisors.com
As you may have heard, the first process mining MOOC ‘Process Mining: Data science in Action’ is starting next week. MOOC stands for Massive Open Online Course and it is basically a web-based course that allows anyone all over the world to follow the lessons by watching the video lectures and solving assignments.
The process mining MOOC is very exciting for several reasons.
The lecturer of this course is none other than the godfather of process mining himself: There is no better person from whom you could learn about process mining than prof. Wil van der Aalst. The course is based on his process mining book and on top of that Wil is an excellent lecturer. Usually only the students of the Technical University in Eindhoven have the opportunity to take a full course with him, but now everybody can.
This MOOC will also amplify what many of us in the process mining community are doing: Making even more people aware of process mining and its benefits. We are trying to do our part by making process mining accessible to practitioners with Disco, and by evangelizing the topic through our academic initiative, the process mining camp, our blog, presentations, and everywhere we go. But also many of you are spreading the word about process mining by introducing it at your company, showing it to your friends and colleagues, and by sharing your experiences.
Process mining is one of the most interesting and useful data science disciplines around, and it is kind of amazing that still only a small number of people know about it. The MOOC will help introducing many more people to our field — So far, more than 22,000 people have signed up already. It is simply incredible how far the process mining community has come in the last few years!
We are proud to be part of the process mining movement, and it is our honor to support this MOOC course: The students will be using ProM and Disco to do the practical exercises. We think that Disco will help to show the participants that using process mining is easy, and that they can get started right away.
Let’s continue spreading the word about this MOOC. We, for one, are looking forward to welcoming a whole lot of new faces to the worldwide process mining community!
Yesterday, all submissions to the BPI Challenge 2014 were published on the Challenge website.
The winners this year have been the team of CKM Advisors. Yes, they already won the BPI Challenge in 2012, and they were the runner-up last year!
Picking CKM’s contribution as the winners of this year’s competition was unanimous. One of the jury members commented on their work as follows:
I like how 13 clear patterns were defined, how a decision tree was presented to distinguish between them and how this served as a basis for the analysis, prediction and presentation of the results.
This year’s BPI challenge was particularly difficult as many of the questions were in fact outside of the classical process mining space, reaching further into the data mining and data science area than in the previous years.
CKM, with their data science background, directly tackled these questions and identified patterns for how changes impact the IT service level at the bank by leading to new interactions with the service desk and incidents. You can take a look their winning submission here.
To honor their achievement, the winners received a special trophy (see above). This beautiful award has been hand-crafted by Felix Günther, after an original concept and design. It is made from a single branch of a plum tree, which symbolizes the “log” that was analyzed in the challenge. The copper inlay stands for the precious information that was mined from the log.
Furthermore, this year for the first time there was a student competition category at the BPI challenge. The winners of the student competition are Gabriele Cacciola from the Universiy of Calabria in Italy, and Raffaele Conforti and Hoang Nguyen from the Queensland University of Technology in Australia. You can read their winning submission here. As a price, they have received an iPad.
It’s your turn, now!
People often ask us how they can practice their process mining skills. The BPI challenge data sets are a great way to do that. And on top of having the chance to play with some real data, you can also read the submission of all the participants to see their solutions. We recommend to also take a look at the previous BPI Challenges from 2013, 2012, and 2011.
For this year’s challenge, the following submissions were selected for publication and have been made available on the BPI challenge website yesterday. Here they are in order of scoring by the jury.
- CKM Advisors, USA: Pierre Buhler, Rob O’ Callaghan, Soline Aubry, Danielle Dejoy, Emily Kuo, Natalie Shoup, Inayat Khosla, Mark Ginsburg, Nicholas Hartman and Nicholas Mcbride
- GRADIENT ECM, Slovakia: Jan Suchy and Milan Suchy
- UWV and Consultrend, The Netherlands: Marcus Dees and Femke van den End
- National Research Council, Canada: Scott Buffett, Bruno Emond and Cyril Goutte
- KPMG Advisory, Belgium: Peter Van den Spiegel, Leen Dieltjens, Liese Blevi, Jan Verdickt, Paul Albertini and Tim Provinciael
- ChangeGroup, Denmark: John Hansen
- Research Center for Artificial Intelligence, Germany: Tom Thaler, Sönke Knoch, Nico Krivograd, Peter Fettke and Peter Loos
- Pontificia Universidad Católica, Chile: Michael Arias, Mauricio Arriagada, Eric Rojas, Cecilia Saint-Pierre and Marcos Sepúlveda
- Universiy of Calabria, Italy, and QUT, Australia: Gabriele Cacciola, Raffaele Conforti and Hoang Nguyen
- Federal University of the State of Rio de Janeiro, Brazil: Pedro Richetti, Bruna Brandão and Guilherme Lopes
- Myongji University, Korea: Seung Won Hong, Ji Yun Hwang, Dan Bi Kim, Hyeoung Seok Choi, Seo Jin Choi and Suk Hyun Hong
We congratulate not only the winners but all participants of the BPI Challenge for their great work and contribution to advancing the process mining area. Thank you!