Sign up for our webinar with TransWare to learn about the challenges of getting high-quality data from SAP. They will demonstrate their process mining integration server (for mixed SAP and non-SAP system landscapes).
TransWare has built an integration to Disco via our Airlift interface. In this webinar, they will explain the background, capabilities, and the set-up of their solution.
Thursday, 5 November 2015 @ 17:00 CET
Process mining introduction
Challenges of good quality data extraction from SAP
TransWare process mining integration server (for mixed SAP and non-SAP system landscapes)
If you want to know more about how to get data out of SAP for process mining purposes, and how you can integrate non-SAP systems into the analysis, sign up for the webinar here!
Imagine that your data science team is supposed to help find the cause of a growing number of complaints in the customer service process. They delve into the service portal data and generate a series of charts and statistics for the distribution of complaints over the different departments and product groups. However, in order to solve the problem, the weaknesses in the process itself must be identified and communicated to the department.
You then include the CRM data and with the help of Process Mining you are quickly in the position to identify unwanted loops and delays in the process. And these variations are even displayed automatically as a graphical process map! The head of the CS department can detect at first glance what the problem is, and can immediately undertake corrective measures.
Right here is where we see an increasing enthusiasm for Process Mining across all industries: The data analyst can not only quickly provide answers but also speak the language of the Process Manager and visually display the discovered process problems.
Data scientists deftly move through a whole range of technologies. They know that 80% of the work consists of the processing and cleaning of data. They know how to work with SQL, NoSQL, ETL tools, statistics, scripting languages such as Python, data mining tools, and R. But for many of them Process Mining is not yet part of the data science toolbox.
What is Process Mining?
Process Mining is a relatively young technology, which was developed about 15 years ago at the Technical University of Eindhoven by the research group of Prof. Wil van der Aalst. Given the name, it seems to be related to the much older area of ‘data mining’. Historically, however, Process Mining has its origin in the field of business process management, and the current Data Mining Tools contain no Process Mining Technology.
So what exactly is Process Mining?
Process Mining allows us to map and analyze complete processes based on digital traces in the information systems. A process is a sequence of steps. Therefore the following 3 requirements must be met in order to use Process Mining:
Case ID: A case ID must identify the process instance, a specific execution of the process (for example, a customer number, order number, or patient ID).
Activity: For each process the most important steps or status changes in the process must be logged. These mostly can be found in the business data of a database in the IT system (e.g., the date of an offer to the customer in the sales process).
Timestamp: For every process step you need a timestamp to bring the process sequence for each case in the correct order.
If you find these 3 elements in your IT system, Process Mining can supply a correct representation of the process in the blink of an eye. The visualisation of the process is generated directly from the historical raw data.
What You Can Do With Process Mining
Process Mining is not a reporting tool, but an analysis tool. It enables you to quickly analyse any and very complex processes. For example so-called Click Streams from websites that show how visitors navigate a webpage (and where they “drop out” or “wander around” due to poor usability of the page). Or take the new workflow system in your company, which has only recently been established and from which the department now wants to know how many processes really follow the redesigned, streamlined process path.
You can display the activity flow as well as the transfer between departments in different views of the process, identify bottlenecks, and investigate unwanted or long-running paths within the process.
These process views can also be animated to help in the communication with the department: the actual processes based on the timestamps from the data are ‘replayed’ and show in a very tangible way where the problems in the process are.
Why Data Scientists Should Become Familiar with Process Mining
Data science teams around the world begin to start looking into Process Mining because:
Process Mining fills a gap which is not covered by existing data-mining, statistics and visualization tools. For example, data mining techniques can extract decision trees, predictions, or Frequent Patterns, but cannot display complete processes.
Data scientists with their skills to extract, link, and prepare data are ideally equipped to exploit the full potential of Process Mining. For example, the data of different IT systems such as the CRM data calls in the call center of a bank and the interactions with the customer advisor in the branch must be linked with each other in a ‘Customer Journey’ analysis.
Analytical results must be communicated with the business. Data Science Teams do not analyse data for themselves, but to solve problems and issues for the business. If these questions revolve around processes, then charts and statistics are only meaningful in a limited way and are often too abstract. Process Mining allows you to provide a visual representation to the process owner, and also to directly profit from their domain knowledge in interactive analysis workshops. This allows you to find and implement solutions quickly.
Are you curious and want to know more about Process Mining? We recommend the following links:
2 free online courses (so-called MOOCs) have recently started, which offer an introduction to the topic of Process Mining:
The ‘Process mining: Data science in Action’ MOOC at Coursera is a course given by Prof. Wil van der Aalst himself and provides a comprehensive picture of the foundations and the background of Process Mining algorithms: www.coursera.org/course/procmin
We are happy to announce the immediate release of Disco 1.9!
This update makes a lot of foundational changes to the platform underlying Disco to pave the way for future developments that are in the works, but it is also a productivity release that will make your daily work with Disco even more of a breeze than it is right now. The power of process mining, and of Disco in particular, is the capability to explore unknown and complex processes very quickly. Starting from a data set that you don’t fully understand yet, you can take different views on your process — in an iterative manner — until you get the full picture. This update will help you to get there even faster.
Disco will automatically download and install this update the next time you run it, if you are connected to the internet. You can of course also download and install the updated installer packages manually from fluxicon.com/disco.
If you want to make yourself familiar with the changes and new additions in Disco 1.9, we have made a video that should give you a nice overview. Please keep reading if you want the full details of what is new in Disco 1.9.
An important aspect of process mining is that you not only discover the actual process based on data, but that — for any problem that you find in your analysis — you can always go back to a concrete example. Inspecting individual cases helps to understand the context, formulate hypotheses about the root cause of the issue, and enables you to take action by talking to the people who are involved and can tell you more.
Quickly inspect case details via right-click on case statistics table
One typical scenario in this exploration is to look up some extreme cases in the Cases table of the Overview statistics. For example, by clicking on the different table headers, you can bring up the cases that take the longest time (or the most steps) — or the ones that are particularly fast (or taking the fewest steps) — to the top.
In Disco 1.9 you can now quickly inspect cases from the case statistics overview in the following way: right-click the case you are interested in and choose ‘Show case details’ (see screenshots above). You are immediately taken to the detailed history for that case.
Select case IDs via the Attribute filter
In addition, you can now also filter for specific cases based on their case ID.
In most situations, you want to filter cases based on certain characteristics (such as long case durations). However, sometimes it can also be useful to directly choose a set of cases you want to focus on.
A new entry below the other attributes in your data set brings up the list of all case IDs in the Attribute filter and you can select the ones that you want to keep (see screenshot above).
Variants are sequences of steps through the process from the beginning to the end. If two cases have taken the same path through the process, then they belong to the same variant. Because there are often a few dominant variants, for example, 20% of the variants covering 80% of the cases (indicating the mainstream behavior), the variant analysis is useful to understand the main scenarios of the process. However, at the same time there are typically many more variants than people expect, and the improvement potential often lies in the less frequent variants (the exceptional behavior of the process).
Because the variant analysis is such a useful tool, it is easily one of the most popular functionalities in Disco. And now with Disco 1.9 the variant analysis has become even more useful.
Quickly inspect the variant details via right-click on variant statistics table
You can now quickly inspect the variant details from the variant statistics overview, much in the same way as you can jump to a particular case shown before in the Case Analysis section.
Simply right-click on the variant that you want to explore and choose ‘Show variant details’ (see screenshots above). You are immediately taken to the variant with all the cases that follow that variant.
Select variants via the Attribute filter
Furthermore, you can now also explicitly filter variants. Previously you could already filter the variants based on their frequency with the Variation filter, for example to focus on the mainstream or the exceptional cases. But what if your ideal process consists of variant 1, 2, 3, and 5, because Variant 4 is quite frequent but represents an unwanted path that you do not want to include?
With Disco 1.9 you can now explicitly filter variants in the following way: Similar to the new Case ID filter shown above you find a new entry at the bottom of the attribute list in the Attribute filter. Simply select the variants you want to keep and apply the filter (see screenshot above).
Filter short-cuts are already a great source of productivity in Disco. For example, you can already directly click on an activity in the process map, a path between two activities, or the dashed lines leading to the start and end points. These short-cuts allow you to jump to a pre-configured filter that focuses on all cases that perform that activity (or follow that path, or start or end at the chosen endpoints), which you only have to apply to inspect the results.
Now three additional short-cuts have become available with Disco 1.9.
Add a pre-configured Attribute filter directly from the Statistics tab
Imagine that you are analyzing a customer service process, where refund requests can come in via different channels. You want to focus on the process for the Callcenter channel.
You can now simply right-click on the attribute value that you want to filter and choose the ‘Filter for Callcenter’ short-cut (see screenshot above) to automatically add a pre-configured filter, which has the right attribute and attribute value already selected.
Add pre-configured Case ID and Variant filters directly from the Statistics overview
The same filter short-cut functionality has also been added for the new Case ID and Variant filters, which were introduced in the Case Analysis and Variant Analysis sections above. Simply right-click on the case or the variant you want to filter and the filter will be automatically added with the right pre-configuration.
There is an even faster way than filter short-cuts in Disco: Searching. A search can be incredibly useful if you just want to inspect some examples, where a certain activity occurs, or where a particular organizational group or any kind of custom attribute value is involved.
Disco features a lightning fast full-text search in the upper right corner of the Cases tab. As soon as you start typing, Disco will search live through all your data and highlight where it finds cases that contain your search text.
Automatically search for attribute values via right-click
The search short-cut makes it now even easier to benefit from Disco’s search capability. For example, let’s say that we are looking at the BPI Challenge 2015 data set of building permit process data and we discover a less-frequent activity ‘partly permit’. We are wondering in which context that step typically happens.
With Disco 1.9, you can simply right-click the activity name and choose ‘Search for partly permit’. Disco will enter the search text for you, and you will be immediately taken to the Cases tab and see the searched activity highlighted in the cases, where it was found.
Search for anything directly from Cases view
This works for any attribute value — and also while you are inspecting cases in the Cases tab itself. For example, assume that in one of the cases you see another activity ‘by law’ that occurs on the same day and you want to see some more examples, where that happens. Simply right-click and use the short-cut to trigger the new search.
Process mining is a tool that fills a piece in the puzzle, by providing a process view on the data at hand. Data scientists or process improvement analysts often use additional tools, such as statistics tools, traditional data mining tools, or even Excel, to complement their process mining analysis with different perspectives.
All analysis results can be exported from Disco — The process maps, charts and statistics, individual cases, and the filtered log data. However, until now the variants could only be exported in the form of the variant statistics.
With Disco 1.9 you can now not only export the variant statistics (including the actual activity sequences for each variant) but also the raw data including the variant information. This opens up new possibilities, such as running correlation analyses with data mining tools or using the Disco output to create a custom deliverable.
Export the variant information with the Case Statistics overview via right-click on the table
Exporting your data set will now include variant information
You can now export the variant information from Disco with your raw data in two different ways:
Export the case statistics (which now include the variant information) via right-click on the Cases table,
Export your log data, now enriched with variant information, via the Export button in the lower right corner of Disco.
Improved Formatting for Large Frequencies
Disco is highly optimized towards the kind of data that process mining needs and can process very large data sets very quickly. But especially if you have imported a data set with many millions of records, then inspecting the frequency statistics can become a game of counting zeros to understand what numbers you are looking at.
The new Thousands Separator makes large numbers easier to read
To make reading large numbers easier, a thousands-separator has been introduced in Disco 1.9 across the board. For example, in the above screenshot you can see a data set with 100 million records, whereas the ‘start’ activity was performed 3.9 million times.
More Powerful Trim Mode in Endpoints Filter
Disco’s powerful set of filters allow you to quickly zoom into your data in many different ways. By working directly from the raw data, Disco’s capabilities extend way beyond simple drill-downs that you see in BI tools based on prepared queries and aggregated data cubes.
For example, the Trim mode in the Endpoints filter allows you to focus on arbitrary segments of your process by cutting off all events that happen before and after the indicated endpoints.
The Trim mode in the Endpoints filter now allows you to focus on either the first or the longest subset based on your endpoints
With Disco 1.9 the Trim-mode becomes more powerful. It lets you determine what should happen if you have multiple end event markers in your selection (or if your end event appears multiple times in the same case). You can now choose between:
Trim longest: Cuts to the sequence between the first occurrence of one of your start events and the last occurrence of one of your end events (previous trim-mode).
Trim first: Cuts to the first sequence between your chosen start and end events.
New Audit Report Export
Next to process improvement teams also auditors increasingly use Disco to analyze processes for their audits. Their focus is typically less on performance (like detecting bottlenecks) but more on compliance questions like detecting deviations from the allowed process, violation of segregation of duty rules, or the missing of mandatory steps. All of these compliance issues can be easily analyzed with Disco and you can get a nice overview about typical auditing questions in this presentation given by Youri Soons at Process Mining Camp 2013.
One thing that is really important in the work of an auditor is that they need to document their work. They document the original data, the findings of the audit, but also the steps that they took to arrive at those findings to make it possible to verify and re-produce them after the fact.
Therefore, we have added a new audit report export in Disco 1.9. The audit report bundles the machine-readable (and re-usable) recipe with a human-readable filter report and the resulting data set in a Zip file, ready to be attached to your audit documentation.
Audit report can be exported from the Empty Filter Result screen
Another problem is that, as an auditor, you are often checking for compliance rules that are not violated. For example, you may find that there is not a single case that remains in the data set after you apply your filter to check for a segregation of duty rule violation.
That’s a good result, but how can you document it? With Disco 1.9 you can now also export the audit report directly from the empty filter result dialog (see screenshot above).
Process Map With Fixed Percentage
The last feature will be useful if you want to repeat analyses based on new data sets. For example, after an improvement project you want to look at the new process and see how effective the improvements actually were.
While you can already re-use your filter settings via recipes from the previous project to quickly re-run the analyses on the new data, you sometimes also want to re-create the process maps based on exactly the same level of detail (you can learn more about how the detail sliders in the Map view work in this article). And moving the sliders is a cumbersome way to hit the exact percentage point that you want to see.
Explicit Percentages for detail sliders in map view
With Disco 1.9 you can now explicitly set the desired percentage points for the Activities and the Paths sliders in the map view, by clicking on their respective percentages below the sliders (see screenshot above).
The 1.9 update also includes a number of other features and bug fixes, which improve the functionality, reliability, and performance of Disco. Please find a list of the most important further changes below.
CSV Import: Improved accuracy and reliability of CSV auto-detection.
CSV Import: Improved timestamp parsing and timestamp pattern auto-detection.
CSV Export: Enhanced CSV Export Format for better Excel compatibility.
Bug fixes: Fixes several minor issues and user interface inconsistencies.
Stability: Fixes a stability issue observed with some newer Java versions.
We want to thank all of you for using Disco, and for providing a continuous stream of great feedback to us!
Most of the changes in this release can be directly traced back to a conversation with one of our customers, a support email, or in-app feedback submitted from Disco. Without that feedback, it would be impossible for us to keep Disco so stable and fast. And, even more importantly, your feedback enables us to concentrate our efforts on changes that make Disco even better for you: More relevant for the problems you try to solve, and a better, more efficient, and just more fun companion for your work.
We hope that you like Disco 1.9, and we keep looking forward to your feedback!
A brand-new MOOC called Fundamentals of BPM is starting up next week on Monday, 12 October 2015. It has been developed by the Queensland University of Technology (QUT) in Brisbane, Australia, and is taking a theoretically founded but also very practical and practitioner-oriented approach. You can get a look behind the scenes in this BPTrends article on the new MOOC.
The MOOC is based on the textbook “Fundamentals of Business Process Management”, which has been adopted in over 100 educational institutions worldwide. It includes a practical segment on process mining as well as process mining case studies, exercises, theoretical backgrounds, and a video interview with Wil van der Aalst.
We are very happy that the MOOC organizers have chosen our process mining software Disco as the process mining software to be used in the MOOC. Fluxicon is supporting the MOOC by providing training licenses for the participants, who can use Disco to follow the process mining exercises and to explore their own processes to learn more about what process mining can do. You can sign up for the MOOC here.
We spoke with Marcello La Rosa, one of the instructors in the MOOC and professor and Academic Director for corporate programs and partnerships at the Information Systems school of the Queensland University of Technology (QUT) in Brisbane, Australia.
Interview with Marcello
It’s great to see that you have included a section on process mining in the new MOOC ‘Fundamentals of BPM’. Process mining is an important part if you take a holistic approach to process management, because it closes the loop and lets people evaluate how the processes are really performed, and where the weaknesses and improvement opportunities are.
In the process mining section of the MOOC, you will also report on a project carried out at Suncorp. Can you tell us more about that project?
One of the case studies discussed in the MOOC is related to a process mining project that Queensland University of Technology conducted with Suncorp Commercial Insurance in 2012. The objective of that study was to identify the reasons why certain low-value claims would take too long to be processed, as opposed to others, of the same type, which instead would be handled within reasonable times.
The company had formulated different hypotheses about the reasons for these inefficiencies but any process change following these hypotheses had not led to any measurable improvements. Process mining provided the flipping point.
In a nutshell, we extracted the data related to six months of execution of the two variants of this claims handling process from Suncorp’s claims management system, discovered the respective process models using Disco, and identified the differences between these two models.
In fact, it was found that in the slow variant the process would clog at a couple of activities due to rework and repetition. These findings were then supported by a statistical analysis of the differences and the data replayed on top of the discovered models to build a business case. Enroll in the MOOC to find out more about how Suncorp managed to use process mining to improve its business processes.
What is the most important impact that process mining has in your opinion in the organizations that are using it?
The speed of reaction, which has increased dramatically. Now organizations can get to the bottom of their process weaknesses in much less time. For example, the project with Suncorp was completed in less than six months.
This faster response time is possible because Process mining is changing the way business process management (BPM) is done. As we will see in the course, process mining offers a new entry point to the BPM lifecycle, through the monitoring of process execution data which is the last phase in a typical BPM project.
This, on the one hand, allows analysts to quickly discover process models — with the advantage that such models are based on the evidence of the data and are thus not prone to human bias. On the other hand, it offers an opportunity to jump directly to the analysis phase, without necessarily relying on a process model, to find out where process weaknesses are.
Who can benefit from participating in the new MOOC and why should they sign up?
This course is open to anyone who has an interest in improving organizational performance.
It will be useful to those who have already worked in the area of business process management (BPM) and would like to consolidate and expand their learnings, since this is the first course that offers a comprehensive overview of the BPM lifecycle (from process identification all the way to process monitoring). But given that no prior knowledge is required, this course also provides a great opportunity for professionals and students who are new to learn about the exciting discipline of BPM. This is achieved by combining a gentle introduction to the subject with more advanced topics which offer many opportunities for deepening the content.
Last but not least, the variety of learning media (short videos, activities, quizzes, readings, interviews, project work) will ensure following this MOOC is fun!
Have you missed the Coursera MOOC1Process Mining: Data science in Action the last time around? Or did you have to drop out, because you did not have the time to complete it? You are in luck, because the Process Mining MOOC starts again today, on October 7, 2015. It’s a free online course, where you can watch video lectures and test your knowledge through online quizzes.
Fluxicon is supporting the MOOC by providing training licenses for our process mining software Disco. The new edition of the MOOC will also include a real-life process mining session that gives you a taste of how you can solve real process problems in your organisation with process mining. You can sign up here.
We spoke with Prof. Wil van der Aalst, who created the MOOC, about how online classes compare to regular class-room studies and what established process mining analysts can get out of following the course.
Interview with Wil
The MOOC ‘Process Mining: Data Science in Action’ is starting again on 7 October in its third edition. So far, already more than 65,000 people have participated in the MOOC. That is an incredible success. Now, there will be many more new people who will come in contact with process mining for the first time. We have also heard from several people who had to drop out of one of the previous courses and who will now be taking it again.
What do you think are the advantages and what are the disadvantages of learning about a topic like process mining in an online course? Are there things that are easier and things that you see that are more difficult for online learners compared with your regular university classroom courses?
The main advantage of taking an online course is that it is not bound to a fixed location and time. It is amazing to see people from over 200 countries participating in a course. We are reaching people that would never have had the opportunity to study process mining otherwise (because of location and time constraints). It has helped to create awareness: Many BPM practitioners and Data Scientists still do not know that these powerful techniques are available and directly applicable.
However, MOOCs do not replace class rooms. Studying is also a social process. Personal contact between teachers and students is important. Students that study in groups can ask questions and motivate each other. MOOCs try to mimic this through a forum, but this is not the same thing. Nevertheless, it is interesting to see the interactions between participants in the forum of the Process Mining MOOC.
Yes, the forums have been very active and it was great to see how people are discussing the material and help each other out.
What can a practitioner who is already actively working with process mining still learn from the MOOC, why should they participate?
The topic of process mining is quite broad and extends far beyond automated process discovery. The MOOC provides a rather complete view of the spectrum and will help practitioners to think of analysis opportunities they would otherwise not see (conformance checking, data-aware process mining, predictions, etc.).
It is also important to have a basic understanding of the way algorithms work and what the foundational limitations and trade-offs are. When pushing the discovery button of your favorite process mining tool, one should understand process discovery in order to interpret the results and to get the diagnostics one is looking for. For example, there is always a trade-off between fitness, precision, generalization, and simplicity. Understanding these trade-offs is important when being confronted with “Spaghetti models”.
What do you recommend to people who – after finishing the MOOC – want to make the next step. What should they do?
There is a lot of material available. Of course people should study the book “Process Mining: Discovery, Conformance and Enhancement of Business Processes”. The website http://www.processmining.org/ also provides many pointers.
People say “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”. We should avoid that they say the same about process mining. Process mining is very practical and the threshold to get started is much lower than for most other technologies.
Everyone knows the saying that you can lie with statistics. One of the themes around the responsible use of statistics is that correlation does not imply causation. For example, the above graph from the Spurious correlations book illustrates how ridiculously unrelated things can be correlated.
Another problem that is less frequently mentioned is that you get what you measure. This is the inverse take on the popular “you can’t know what you don’t measure” and hints at the fact that the way you measure influences your results.
To understand the you get what you measure problem take a look at the following process from a customer service department at a large Internet company. It shows the contact moments that customers had with the support team over various channels (phone, web, email, chat).
The key metric that was used in the team to monitor the service performance was the First Contact Resolution Rate (FCR). The FCR measures how many of the customer problems the team could solve within the first contact with the customer, for example, without the customer having to call back again. In the process map below you can see that out of 21,304 inbound calls only 540 resulted in repeat calls. The overall FCR was an impressive 98%.
However, the process mining analysis was done based on the Service Request number as a Case ID. The Service Request ID is a unique identifier that is automatically assigned to each new service case by the Siebel CRM system. A deeper analysis revealed that all service requests were closed pretty quickly – typically within up to 3 days.
If the customer did call back after 3 days, a new service request was opened. So, the process above shows the flow of the service requests, but it does not show the real service process the customers went through.
To shift the perspective, the same data was then imported again into Disco. This time, the Customer ID was used as a Case ID. You can see how the process changes if you look at it from this new perspective.
Only 17,065 cases were in reality started by an inbound call. Over 3,000 were actually repeat calls (only counted as new service requests). With this new view the true FCR dropped to 82%.
The customer service example demonstrates how the perspective that you take on the process influences the results. And while Disco allows you take different views on the process very quickly, it is your responsibility as a process mining analyst to make sure you explore these different views and think about how you should look at the process.
The initial, service-request based analysis was being done from the perspective of the measured KPI, which, in fact, may have influenced the behavior of the agents in the call center in the first place: If you are measured based on how few call-backs you get, you are inclined to close those service requests just a little more quickly.
However, from the customer perspective this leads to a worse experience, because they have to repeat all their information details and describe the problem again. It would be better for them if the agent would look up and re-open their case. So, also from a process management perspective you often get what you measure. And if the KPIs that are used to evaluate the performance of the employees do not encourage the right behavior that you want in your process then you are in trouble.
As a process miner you need to be careful to take contextual factors like how people are measured, and what their incentives are, into account when you asses a process in your organization. Otherwise you won’t get the full picture.
As a process miner, you need access to the process manager, or another subject matter expert, to ask questions, validate, and prioritize the analysis results that are coming up.
However, the very first step of any analysis is to explore the data and develop a first understanding of the process. Hypotheses are formed based on the questions that were defined together with the process owner in the scoping phase of the project.
This is exactly the step in a process mining project that the annual BPI Challenge allows you to practice:
You receive anonymized but real-life data for a process
You get a description of the process and some questions the process owners have about it
The data set is public and anyone can analyze it. In the end a winner will be chosen by the jury
You get feedback from the reviewers in the jury about your analysis
Even after the BPI Challenge competition is over, you can still use the data sets to practice exactly that initial analysis step in a project — And to compare your approach with the other submissions.1
But of course participating in the actual competition is much more fun. And last week, the winners of this year’s BPI Challenge were announced.
First of all, Irene Teinemaa, Anna Leontjeva and Karl-Oskar Masing from the University of Tartu, Estonia, won the prize for the best student submission. One of the noteworthy aspects of their work was that they used a lot of different tools. They were awarded a certificate.
In the overall competition, Ube van der Ham from Meijer & Van der Ham Management Consultants in the Netherlands won the BPI Challenge trophy.
The jury found that Ube brought many interesting insights to light that will help the municipalities in their process improvement and collaborations.
Like in the past two years, the trophy was developed after an original design by the artist Felix Günther. Hand-crafted from a single piece of wood, this “log” represents the log data to be mined. The shiny rectangle represents the gold that is mined from the data and this year has the shape of the famous roof of Innsbruck, where the award ceremony for the BPI Challenge took place.
The back of the trophy still features the bark of the tree, giving the whole piece a gorgeous feel and a heavy weight.
We thank Felix for this amazing work and know that Ube was very happy about not just receiving the BPI Challenge award but the trophy itself.
What is great about the BPI Challenge is that you can read the different reports of all participants and compare their approaches. This is a great way to learn more about process mining in practice.
Keep in mind that nobody of the participants had the chance to ask the actual process owners questions during their analysis. So, not every result or assumption that they make was correct. Also the winner, Ube van der Ham, warns that not all observations are necessarily correct, and one of the jury members who knows the process noted some misinterpretations. And inevitably they get stuck at points, where they can only hypothesize and not make a definite statement.
However, your role as a process mining analyst in a real project is to collect your assumptions and hypotheses and then validate them with the process experts in the following process mining sessions and workshops. And you can learn a lot be looking at how other people approached this data set.
If you have little time, I recommend to read the winning report by Ube and the work by Liese Blevi and Peter Van den Spiegel from KPMG – a close second place. Liese and Peter take a very careful and systematic approach in understanding the log data and the process that is behind it.
Take a look at the following example. Instead of one Activity or Status column, you have two columns showing the “old” and the “new” status. For example, in line no. 2 the status is changed from ‘New’ to ‘Opened’ in the first step of case 1.
This is a pattern that you will encounter in some situations, for example, in some database histories or CRM audit trail tables.
The question is how to deal with log data in this format.
Should you use both the ‘Old value’ and the ‘New value’ column as the activity column and join them together?
This would be solution no. 1 and leads to the following process picture.
All combinations of old and new statuses are considered here. This makes sense but can lead to quite inflated process maps with many different activity nodes for all the combinations very quickly.
Normally, you would like to see the process map as a flow between the different status changes. So, what happens if you just choose the ‘Old value’ as the activity during importing your data set?
You would get the following process map.
The process map shows the process flow through the different status changes as expected, but there is one problem: You miss the very last status in every case (which is recorded in the ‘New value’ column).
For example, for case 2 the process flow goes from ‘Opened’ directly to the end point (omitting the ‘Aborted’ status it changed into in the last event).
You can do the same by importing just the ‘New value’ column as the activity column and get the following picture.
This way, you see all the different end points of the process. For example, some cases end with the status ‘Closed’ while others end as ‘Aborted’. But now you miss the very first status of each case (the ‘New’ status).
In this example, all cases change from ‘New’ to ‘Opened’. So, missing the ‘New’ in the beginning is less of a problem compared to missing the different end statuses. Therefore, solution 3 would be the preferred solution in this case. But in other situations, the opposite might be the case.
Filtering Based on Endpoints
Note that you can still use the values of the column that you did not use as the activity name to filter incomplete cases with the ‘Endpoints’ filter.
For example, if you used Solution 2 (see above) but wanted to remove all cases that ended in the ‘New value’ = ‘Aborted’ you can configure the desired end status based on the ‘New value’ attribute with the Endpoints filter as shown below:
In summary, what you can take away from this is the following:
If you encounter the ‘Old value / New value’ situation, often just using one of the two columns is preferred to get the expected view of status changes in the process map.
If you choose the ‘Old value’ column, you will lose the very last status change in each case.
If you choose the ‘New value’ column, you will miss the very first status in each case.
You can still filter start and end points based on the attribute column that you did not use for the activity name.
In most situations, this is enough and you can use your ‘Old value / New value’ data just as it is. If, however, you really need to see the very first and the very last status in your process flow, then you would need to reformat your source data into the standard process mining format and add the missing start or end status as an extra row.
Have you dived into process mining and just started to see the power of bringing the real processes to life based on data? You are enthusiastic about the possibilities and could already impress some colleagues by showing them a “living” process animation. Perhaps you even took the Process Mining MOOC and got some insights into the complex theory behind the process mining algorithms.
You probably realized that there is a lot more to it than you initially thought. After all, process mining is not just a pretty dashboard that you put up once, but it is a serious analysis technique that is so powerful precisely because it allows you to get insights into the things that you don’t know yet. It needs a process analyst to interpret the results and do something with it to get the full benefit. And like the data scientists say, 80% of the work is in preparing and cleaning the data.
So, how do you make the next step? What data quality issues should you pay attention to, and how do you structure your projects to make sure they are successful? How can you make the business case for using process mining on a day-to-day basis?
We are here to help you and have just opened our process mining training schedule for autumn 20151. In the past, we held 1-day trainings that gave a good starting point about the practical application of process mining but there was never enough time to practice. That is why earlier this year we started to give an extended 2-day course, which runs through a complete project in small-step exercises on the second day.
The feedback so far has been great. Here are two quotes from participants of the last 2-day training:
Practical, insightful, and at times amazing.
Very useful. In two days, if one already has a little background on Process Mining, you just become an expert, or at least this is how it feels.
The course is suitable for complete beginners, but if you have already some experience don’t be afraid that it will be boring for you. The introductory part will be quick and we will dive into practical topics and hands-on exercises right away.
The training groups are deliberately kept small and some seats have already been taken, so be quick to make sure you don’t miss your opportunity to become a real process mining expert!
If the dates don’t fit or you prefer an on-site training at your company (also available in Dutch and German), contact Anne to learn more about our corporate training options. ↩
This is a guest post by Nicholas Hartman (see further information about the author at the bottom of the page) and the article is part II of a series of posts highlighting lessons learned from conducting process mining projects within large organizations (read Part I here).
If you have a process mining article or case study that you would like to share as well, please contact us at firstname.lastname@example.org.
Timestamps are core to any process mining effort. However, complex real-world datasets frequently present a range of challenges in analyzing and interpreting timestamp data. Sloppy system implementations often create a real mess for a data scientist looking to analyze timestamps within event logs. Fortunately, a few simple techniques can tackle most of the common challenges one will face when handling such datasets.
In this post I’ll discuss a few key points relating to timestamps and process mining datasets, including:
Reading timestamps with code
Useful time functions (time shifts and timestamp arithmetic)
Understanding the meaning of timestamps in your dataset
Note that in this post all code samples will be in Python, although the concepts and similar functions will apply across just about any programming language, including various flavors of SQL.
Reading timestamps with code
As a data type, timestamps present two distinct challenges:
The same data can appear in many different formats
Concepts like time zones and daylight savings time mean that the same point in real time can be represented by entirely different numbers
To a computer time is a continuous series. Subdivisions of time like hours, weeks, months and years are formatted representations of time displayed for human users. Many computers base their understanding of time on so called Unix time, which is simply the number of seconds elapsed since the 1st of January 1970. To a computer using Unix time, the timestamp of 10:34:35pm UTC April 7, 2015 is 1428446075. While you will occasionally see timestamps recorded in Unix time, it’s more common for a more human-readable format to be used.
Converting from this human readable format back into something that computers understand is occasionally tricky. Applications like Disco are often quite good at identifying common timestamp formats and accurately ingesting the data. However, if you work with event logs you will soon come across a situation where you’ll need to ingest and/or combine timestamps containing unusual formats. Such situations may include:
Ingesting data into another system (e.g., loading it into a database)
The following scenario is typical of what a data scientist might find when attempting to complete process mining on a complex dataset. In this example we are assembling a process log by combining logs from multiple systems. One system resides in New York City and the other in Phoenix, Arizona. Both systems record event logs in the local time. Two sample timestamps appear as follows:
System in New York City: 10APR2015 23.12.17:54
System in Phoenix Arizona: 10APR2015 20.12.18:72
Such a situation presents a few headaches for a data scientist looking to use such timestamps. Particular issues of concern are:
The format of the timestamps is non-standard
Both systems are run in local time rather that UTC
The systems are in different time zones (US-Eastern and US-Mountain)
New York uses daylight savings time whereas Arizona does not
You can see how this can all get quite complicated very quickly. In this example we may want to write a script that ingests both sets of logs and produces a combined event log for analysis (e.g., for import into Disco). Our primary challenge is to handle these timestamp entries.
Ideally all system admins would be good electronic citizens and run all their systems logging functions in UTC. Unfortunately, experience suggests that this is wishful thinking. However, with a bit of code it’s easy to quickly standardize this mess onto UTC and then move forward with any datetime analytics from a common and consistent reference point.
First we need to get the timestamps into a form recognized by our programming language. Most languages have some form of a ‘string to datetime’ function. Using such a function you provide a datetime string and format information to parse this string into its relevant datetime parts. In Python, one such function is strptime.
We start by using strptime to ingest these timestamp strings into a Python datetime format:
# WE IMPORT REQUIRED PYTHON MODULES (you may need to install these first)
# WE IMPUT THE RAW TEXT FROM EACH TIMESTAMP
# WE CONVERT THE RAW TEXT INTO A NATIVE DATETIME
# e.g., %d = day number and %S = seconds
ny_date = datetime.datetime.strptime(ny_date_text, "%d%b%Y %H.%M.%S:%f")
az_date = datetime.datetime.strptime(az_date_text, "%d%b%Y %H.%M.%S:%f")
# WE CHECK THE OUTPUT, NOTE THAT FOR A NATIVE DATETIME NO TIMEZONE IS SPECIFIED
>>> 2015-04-10 23:12:17.540000
At this point we have the timestamp stored as a datetime value in Python; however, we still need to address the time zone issue. Currently our timestamps are stored as ‘native’ time, meaning that there is no time zone information stored. Next we will define a timezone for each timestamp and then convert them both to UTC:
# WE DEFINE THE TWO TIMZEONES FOR OUR DATATYPES
# NOTE: ‘ARIZONA’ TIMEZONE IS ESSENTIALLY MOUNTAIN TIME WITHOUT DAYLIGHT SAVINGS TIME
tz_eastern = pytz.timezone('US/Eastern')
tz_mountain = pytz.timezone('US/Arizona')
# WE CONVERT THE LOCAL TIMESTAMPS TO UTC
ny_date_utc = tz_eastern.localize(ny_date, is_dst=True).astimezone(pytz.utc)
az_date_utc = tz_mountain.localize(az_date, is_dst=False).astimezone(pytz.utc)
# WE PRINT CHECK THE OUTPUT, NOTE THAT THE TIMEZONE OF +0 IS ALSO NOW RECORDED
>>> 2015-04-11 03:12:17.540000+00:00
>>> 2015-04-11 03:12:26.720000+00:00
Now we have both timestamps recorded in UTC. In this sample code we manually inputted the timestamps as text strings and then simply printed the results to a terminal screen. An example of a real-world application would be to leverage the functions above to read in raw data from a database for both logs, process the timestamps into UTC and then write the corrected log entries into a new table containing a combined event log. This combined log could then be subjected to further analytics.
Useful time functions
With timestamps successfully imported, there are several useful time functions that can be used to further analyze the data. Among the most useful are time arithmetic functions that can be used to measure the difference between two timestamps or add/subtract a defined period of time to a timestamp.
As an example, let’s find the time difference between the two timestamps imported above:
# WE COMPARE THE DIFFERENCE IN TIME BETWEEN THE TWO TIMESTAMPS
timeDiff = (az_date_utc - ny_date_utc)
The raw output here reads a time difference of 9 seconds and 18 milliseconds. Python can also represent this in rounded integer form for a specified time measurement. For example:
# WE OUTPUT THE ABOVE AS AN INTEGER IN SECONDS
This shows us that the time difference between the two timestamps is 9 seconds. Such functions can be useful for quickly calculating the duration of events in an event log. For example, the total duration of a process could be quickly calculated by comparing the difference between the earliest and latest timestamp for a case within a dataset.
These date arithmetic functions can also be used to add or subtract defined periods of time to a timestamp. Such functions can be useful when manually adding events to an event log. For example, the event log may record the start time of an automated process, but not the end time. We may know that the step in question takes 147 seconds to complete (or this length may be recorded in a separate log). We can generate a timestamp for the end of the step by adding 147 seconds to the timestamp for the start of the step:
# WE ADD 147 SECONDS TO OUR TIMESTAMP AND THEN OUTPUT THE NEW RESULT
az_date_utc_end = az_date_utc + datetime.timedelta(seconds=147)
>>> 2015-04-11 03:14:53.720000+00:00
Understanding the meaning of timestamps in your dataset
Having the data cleaned up and ready for analysis is clearly important, but equally important is understanding what data you have and what it means. Particularly for data sets that have a global geographic scope, it is crucial to first determine how timestamps have been represented in the data. Relative to timestamps in your event logs some key questions you should be asking are:
How was my dataset generated? (e.g. has the data been pre-processed from multiple systems and what steps were taken to standardize it?)
Are all timestamps standardized to a single time zone?
What are the expected hours of activity for each geography and does the data confirm this?
What triggers the creation of the timestamp in your log? (e.g., is it an automated process or is it triggered by a human pushing a button?)
Does the timestamp represent the beginning, middle or end of a particular step?
If the timestamp is recorded because of a human action, does that action always take place at the same point in the process? (e.g., do some users record data in a system at the beginning of a step while others wait until the end of that same step?)
What does the time between two adjacent timestamps represent? (e.g., does this time represent work occurring, or a pause waiting for work to begin again?)
Are there additional datasets that can be used to add additional detail to these gaps in the timestamps? (e.g., in the example above where a separate log contains the duration of a particular step)
While this piece was hardly an exhaustive look at programmatically handling timestamps, hopefully you’ve been able to see how some simple code is able to deal with the more common challenges faced by a data scientist working with timestamp data. By combining the concepts described above with a database it is possible to write an automated script to quickly ingest a range of complex event logs from different systems and output one standardized log in UTC. From there, the process mining opportunities are endless.
Note that in Disco you configure the timestamp pattern to fit the data (rather than having to provide the data in a specific format) and you can actually import merged data sets from different sources with different timestamp patterns: Just make sure they are in different columns, so that you can configure their formats independently. ↩