One of the challenges of applying process mining is that different skills need to come together to make it a success. Sometimes, you will find multiple skills in one person, but often you need to put together a multi-disciplinary team of people complementing each other.
Here is an overview of the most important roles that your team should cover.
While you will define what kind of data you need for your process mining project, you will typically not extract the data from the IT system yourself. Instead, you will work together with the IT department who will extract the data for you.
The IT administrator will also be able to help you clarify questions about the data itself and provide you with a data dictionary about the meaning of the different data fields.
It is a good idea to involve the IT team early in your project, so that they understand what you want to do and what kind of data you need.
Some systems can provide a data extract that can immediately be used for your process mining analysis. However, more often than not you will need to combine different data sources or re-format your data in some way.
While most process analysts will be able to re-work their source data in Excel, for larger data sets you need skills to merge and process your data via SQL, ETL tools, or via scripting languages like Python or R. For such projects, you need to have someone on board who can do these data transformations for you.
Data / Process Analyst
The actual analysis of the data is the home turf of the process mining analyst. Keep in mind that the data analysis does not only cover the answering of your process questions but also includes tests for data quality and the fixing of data quality problems.
If your project is a process improvement project, it is a very good idea to make sure that you have a Lean Six Sigma practitioner or some other kind of process improvement expert on board. They are trained to suggest and evaluate process improvement alternatives from a business perspective.
If your analysis falls into another process mining use case — for example, you may be using process mining to support your internal audits — then you need someone in your team who is an expert in this profession.
Project and Change Management
Just like with any other project, you need project management skills to scope your project, define realistic milestones, and manage the progress of the project.
Furthermore, actually implementing the process changes is necessary to realize the benefits from your process mining analysis. You need a change manager to help the business unit through the process changes that come out of your process mining project.
In many situations, the process mining team will perform projects for different business units in the company. To ensure that your process mining analysis will have an impact, you need a strong sponsor who is actually interested in the results.
A sponsor who crosses their arms and says “Surprise me” is a read flag. Instead, look out for someone who is also enthusiastic about the possibilities of process mining and who is willing to provide you with the support and the resources that you need.
One of the resources that you need for a successful process mining project is access to a domain expert. Typically, this is not the process manager themselves but another process expert in their team.
This subject matter expert will help you define the analysis questions for the project, perform the data validation session with you, and review intermediary findings in a series of workshop sessions throughout the project.
A last stakeholder who is not in the picture above but nevertheless very important is the privacy and ethics expert in your company. Read our guidelines on Privacy, Security, and Ethics in Process Mining here and take those lessons aboard in your process mining project.
The date has been set: Process Mining Camp 2017 will take place in Eindhoven1, the birth place of process mining, on 29 & 30 June 2017.
For the sixth time, process mining enthusiasts from all around the world will come together for a unique experience. Last year, more than 210 people from 165 companies and 20 different countries came to camp to listen to inspiring talks, share their ideas and experiences, and make new friends in the global process mining community.
For the first time, this year’s Process Mining Camp will run for two days:
On the second day (30 June), we will have a half day of hands-on workshops. Here, smaller groups of participants will get the chance to dive into various process mining topics in depth, guided by an experienced expert.
Eindhoven is located in the south of the Netherlands. Next to its local airport, it can also be reached easily from Amsterdam’s Schiphol airport (direct connection from Schiphol every 15 minutes, the journey takes about 1h 20 min). ↩
When you perform a process mining analysis, then the discovered process map and the variants are only the starting point. You then want to dive deeper into the process based on the questions that you have about it.
One of the typical questions is about the performance of the process. For example, you may have a service level agreement (SLA) with respect to the overall throughput time of the process. Within Disco, you can analyze the case duration distribution and you can filter your data to focus on the slow cases to find out where in the process they lose so much time (see also the video at the top for a demonstration of how to do this).
Once you discover a bottleneck in your process, the animation is a very powerful tool to visualize the bottleneck to your co-workers. Rather than just giving them abstract statistics and charts, they can literally see where a lot of the cases are piling up and where the queuing occurs (see below). This will help you to explain your findings and engage them in discussions about how the process can be improved. As soon as a bottleneck has been resolved, you can focus on the next one to support a continuous improvement of your process.
Once you dig into the performance analysis for your process, there are two things to know that can be helpful. So, in this article, we want to give you these two tips that will help you perform better bottleneck analyses on your own data.
Tip No. 1: Consider the median instead of the mean
All the performance metrics in Disco, for example, the case durations, the activity durations, but also the performance metrics in the process map, give you both the mean and the median duration.
Often, there is quite a difference between the two. For example, if you look at the case duration below (click on the image to see a larger picture) then you will notice that the mean case duration is 21.5 days while the median case duration is just 12 days — That means the median case duration is almost half of the mean case duration for this process!
The reason that this can happen is that the mean is much more susceptible to outliers. To understand why, let’s take a look at how both the mean and the median are calculated. In the figure below, you can see seven measurements lined up according to their size. For example, these could be seven cases of which we have measured the throughput time: Two cases were measured with 1 day throughput time, one case was measured with 2 days throughput time, three cases were measured with 3 days throughput time, and one case — our outlier — had a throughput time of 30 days.
Now, the median is defined as the value in the middle of the lower 50% and the higher 50% of measurements. So, 3 would be the median value in this example, because half of the cases took longer (or equally long) and half of the cases were faster. In contrast, the mean or average value is calculated as the sum of all values divided by the number of values. So, the mean yields 6.14 in this example. The mean is more than twice as high compared to the median, because the mean is much more influenced by the one extreme case with the 30 days throughput time.
In practice, many processes have a distribution similar to the picture above. For example, your customer service process may typically take up to two weeks, but you have these few, very complicated cases that took one or two years to resolve. Or when a typical incident can be closed with 8-10 steps, there is this one extreme case that was ping-ponged between different groups more than 200 times.
In such processes, the median (also known as the 50th percentile) gives you a much better idea of the typical performance characteristics of a process than the arithmetic mean. Therefore, the median can often better point you to the places in the process that typically are quite slow. For example, from the mean durations visualized in the illustration below on the left, you can get the impression that basically the whole area on the left of the process is problematic in terms of performance. The median performance view, shown on the right, makes it clear that the bulk of the problems actually lies with one activity on the lower left.
Of course there are still situations, where you might want to use the mean. One reason can be that it is easier understood by people who are not statistically minded. Or your KPIs might be defined based on the mean, so you should use the mean for your analysis, too. But keep in mind that if you have a skewed distribution with heavy outliers, the mean can be misleading and the median will be a better metric to get a sense of what a typical value looks like.
Tip No. 2: Combine total duration with the median
The second tip that we want to give you is to keep in mind that neither the mean nor the median take the frequency into account. This can be a problem, because you want to focus your improvement efforts on those places in the process, where they can have the most impact.
For example, let’s take a look at the process map below. We have used the median for the performance visualization and it looks like that path that typically takes 5.6 days is the biggest problem.
However, once we switch to the frequency view, we can see that the path right next to it is about 10 times as frequent. So, although the median delay on that path was just 3 days (instead of 5.6 days), the impact of improving this particular bottleneck will be greater.
The best way to take the frequency into account in your bottleneck analysis is to use the total duration (see the screenshot below). The total duration gives you the sum of all the delays in the data set and, therefore, naturally takes both the actual delays but also the frequency into account. So, you can clearly see the big, fat, red arrow in the process map point to the biggest bottleneck that you should address first.
The only drawback of the total duration is that the numbers easily add up to months or years. As a result, it is hard to get a sense of what the typical delay of a path or activity is in the process. To address this, you can add the mean or median duration as a secondary metric (see screenshot below). The secondary metric will appear in smaller font below the primary metric in the process map. We can see the 5.6 days median measurement re-appear in the process map, but it is now clear that the path to the left is the bigger problem we should focus on.
Now, you have the best of both worlds: The total duration as the primary metric is driving your attention to the right places in the process map and helps you to focus on the high-impact areas for your improvement project. At the same time, you can easily see what the average or the typical delay is in this place through the secondary metric.
By using Process mining, organizations can see how their processes really operate. The results are amazing new insights about these processes that cannot be obtained in any other way. However, there are a few things that can go wrong.
Process mining doesn’t usually begin as a top-down initiative. Typically, there are a few enthusiastic people who want to do something with it. When they start a process mining initiative within their organization, they need to bypass the following classic pitfalls.
First of all: Being too fascinated with the technology itself can lead to an inability to show the added value from a business perspective. Secondly: An unrealistic image of the data availability, coming from the promise of Big Data, can lead to overblown expectations. And the third pitfall: Due to a wrong understanding of what process mining can do, the first project is often too ambitious in scope. Too much is being promised and it takes too long before the first results can be shown. This undermines the belief within the business that process mining produces a good ROI. A failed project then not only leads to a decrease in the entrepreneurial and innovative spirit among the process mining enthusiasts, but there is also the risk that process mining will not be picked up again in a new project for years.
In this article, Frank van Geffen and Anne Rozinat give you tips about the pitfalls and advice that will help you to make your first process mining project as successful as it can be.
So, how can you make sure that your process mining initiative is successful? What makes the difference between success and failure? We provide you with a roadmap (see Figure 1) and discuss four success factors.
Figure 1: Roadmap to making your process mining project successful
Success factor No. 1: Focus on the business value
Do: Define the business value in terms of effectiveness (customer experience and revenue), efficiency (costs) and risk (reliability). Determine into which process aspects you want to gain insights. To which business driver does this insight contribute? Better customer experience, cost reduction, risk mitigation?
Don’t: Don’t be overly fascinated with the possibilities of the technology. There are often multiple ways to get answers for your questions, and sometimes multiple data analysis techniques must be combined to get the full picture. Do not become fixated on ‘only’ using process mining.
Success factor No. 2: Start small, think big
Do: Connect the business driver to a specific business domain. Choose a process where the beginning and the end are clearly defined. Check whether this process is supported by an IT system. For example, call center or service desk processes are very suitable for a first project, because the data can be easily extracted from these systems. Also workflow systems are a good source of data for your process mining project. Each manager of such a process will benefit from insights that help to reduce costs or increase the effectiveness. This allows sponsorship on the management level. Choose a sponsor who is willing to support you (a sponsor who crosses their arms and says “Surprise me” is a red flag). And while you think about the possible use cases and application possibilities, also make sure to communicate what process mining is not (see Figure 2). By indicating clear boundaries, you can manage expectations on what it is.
Don’t: Do not start with the most important core process of your company. That will come later once the first results have convinced people of the approach. For example, don’t choose the production and supply process of your beer company for your first process mining process. Instead, start with the purchasing process. You will be amazed about how much value is added to the primary process through an effective and efficient purchasing process.
Figure 2: To fully communicate what process mining is, you need to understand what Process Mining is not
Success factor No. 3: Work hypothesis-driven and in short cycles
Do: Divide the main business driver into sub hypotheses that you can confirm or disprove with a process mining analysis. For example: There is a gut feeling that this service process takes too long. How long does the process really take? How much does it deviate from the expectation? Where are the bottlenecks that cause the delays in this process? In practice, measuring and making the actual throughput times visible already provides an insight over which the ‘business’ loses sleep. In addition, you can then indicate where exactly the delays are in the process. Take your business stakeholders from insight to insight. Stimulate them to ask questions. Explore, analyze and innovate. Time-box the intermediate results and the project. Eight weeks for the first project is usually a good aim.
Don’t: Do not try to immediately answer all questions. The first insights often raise further questions, which then require further analysis. Avoid the pitfall of wanting to answer all possible questions beforehand (analysis paralysis) and use your initial hypotheses as a guideline to avoid being lost in the data and its possibilities.
Success factor No. 4: Facts don’t lie
Do: Process mining allows you to analyze processes based on facts instead of subjective opinions. Speak openly and transparently about the data that you use and about the facts that come out of this analysis. This can be confrontational and for some people even unwelcome. Put a change management team together that has the competency to handle resistance. For example, you can integrate process mining in a project, where the Lean philosophy is used. In these types of projects, people are stimulated to tell each other the ‘truth’ and, therefore, are enabled to tackle and solve the real problems. Process mining can be the perfect assistance in this truth finding. Always use experts from the business process domain and the IT-domain for a sanity check of the data and the analysis. Use process mining as a constructive starting point to ask the right questions and avoid too quick judgments.
Don’t: Never be careless in handling, preparing and analyzing the data. If you skip the data quality checks and present conclusions based on data that turns out to be wrong, you will often lose the trust of the business forever. Do not assume that all the information is in your data (often relevant context information needs to be considered to draw the right conclusions). Do not draw forced conclusions based on incomplete data (if your questions cannot be answered based on the available data, say so) and do not present anything that cannot be supported by facts.
Because of all these challenges you can sometimes lose track of the great possibilities that process mining provides. But don’t despair and look forward to an exciting journey!
With process mining it is possible to look at your processes at a much more detailed level. You connect to the real processes and you analyze them based on facts. And after each process change, the analysis can be repeated quickly and easily.
But what exactly could be the outcome of such a process mining analysis?
On a high level, there are four main outcomes of a process mining analysis (see also picture above). For any process mining project, a combination of these outcomes can apply.
Sometimes, the outcome is just an answer. For example, imagine you are the manager of a process and have received complaints that this process is taking too long. There is an internal Service Level Agreement (SLA) and you want to know whether the complaints are justified (and if so, how often it happens that the SLA is not met). Getting an answer to this question is the primary goal of the process mining analysis.
Another example would be a data science team that supports a customer journey project, where the customer experience is completely re-designed. To make sure that the new system supports the customers in the best way, the data scientists have been asked to analyze what the most common interaction scenarios are.
Finally, think of an auditor who assesses the compliance of a process. The audit report with the summary of their findings will be the main outcome of the process mining analysis.
2. Process change
In many situations, the outcome will be a process change. For example, a particular process step may be automated. There might be organizational changes to address the high workload and shortage of resources in a certain group. An update to the FAQ or website of the company could be made to prevent unnecessary customer calls. Based on the assessment of the audit team, a new control could be implemented in the IT system to reduce the risk of fraud. Or based on the analysis of an outsourced service process at an electronics manufacturer, the contracts with the outsourcing partners will be renegotiated in the next year.
Typically, the analysis will be repeated after some time to see whether the change was as effective as one had hoped. It is easy to repeat a process mining analysis with fresh data to investigate these effects. The outcome of the follow-up analysis can then again be just an answer or result into more process changes.
Sometimes, you can also discover a new KPI that was not known before. For example, imagine you are analyzing a payment process where the company can get 2% discount from their suppliers if they pay within 10 days. You realize that there are two main phases in this process: (1) the posting of the invoice to the system and (2) several approval steps, before the payment can be run on two fixed days in the week. You implement an additional reminder to the approvers in the financial system (a process change), which reminds the managers who need to approve the invoice to do so more quickly. But now the late posting of the invoices is the main problem. You realize that if they are not posted within 3 days, there is almost no chance to get the payment through on time. And you want to monitor this new KPI in an automated way.
Like the process change, this will be outside of the process mining tool. But after understanding the process and the data (to know where the measure points for the KPI need to be placed) it is typically easy to add such a new KPI to your existing dashboard or BI system.
4. Optimization and further analysis
Finally, sometimes further analysis is needed after the process mining analysis has been completed. For example, let’s say you analyze the fall-out from a sales process, which means that you are looking at those customers who were interested in your products but for whichever reason never completed the ordering process (their revenue has been lost). You want to follow up with them and be pro-active offering help before it is too late. However, you only want to follow-up with the customers who are most likely to buy.
This would be a scenario, where a data science team sets up and trains a prediction algorithm in one the available data mining or machine learning frameworks. It will be a custom application that is targeted at one very specific problem (predicting which customers you should call). The prediction algorithm gets better over time, learning from the historical data, but to set it up in the first place it helps to understand the process and possible process patterns that might have an influence and, therefore, could be a good parameter in the model.
In addition, there are many scenarios where process miners will perform further analyses in other, complementary tools. For example, a Lean Six Sigma practitioner will want to perform additional statistical analyses in Minitab, data scientists might use data mining tools to discover correlations between the process variants and other attributes in the data, process improvement experts might want to run alternative what-if scenarios in a simulation software, and auditors might take some of the findings from their explorative analysis in Disco to their regular audit tools to include them in the standard check procedures.
All of these tools are specializing in different areas and can be used together. Process mining provides important input for these follow-up analyses by providing a process perspective on the data.
So, what outcomes can you expect from process mining for your own work?
To find out, first start learning more about process mining to fully understand how it works and what it can do. Download the process mining software Disco and contact us for an extended evaluation license to explore some of your own data sets.
Although it is not strictly necessary to understand the algorithms behind process mining for using a process mining tool, it will greatly enhance your view of the process mining field and we highly recommend to sign up for the MOOC and give it a try. This is a university-level process mining course of excellent quality, given by Prof. Wil van der Aalst himself. You can read an interview with Wil about the MOOC here.
Over 100.000 people have registered for earlier versions of the course in the last two years. If you have not participated yet, don’t wait and register now!
One of the questions when starting out with process mining is “What is the added value for me and my organization?”. To answer this question, you first have to understand your use case. One ingredient of understanding your use case is to understand who will be using process mining and why.
In the above picture you see some of the most typical places in an organization, where process mining is used. Depending on the role the concrete value will be different. Given your role, you have to think about “How is my job getting easier or better with process mining — compared to not using process mining?”.
Let’s take a quick look at the six use cases above1.
1. Process Improvement Teams
There are many different terms used for process improvement teams in organizations: Process Excellence, Operational Excellence, Process Performance Management, etc. These teams often use Lean Six Sigma methods in their improvement initiatives and, as a central team, help different business units in the organization. Process mining fits very well into their toolbox and allows them to analyze the true processes based on data, rather than through manual inspections and interviews.
Process mining itself is agnostic to the improvement method that you use. This means that it does not matter whether your organization uses BPM, Theory of Constraints, Lean, Six Sigma, or Lean Six Sigma. Process mining does not replace these methods. Instead, the business analysts will use their improvement framework to interpret the process mining results, drive the change, and verify whether the outcome was effective.
The benefit of using process mining in process improvement projects is that the actual processes can be analyzed much faster, and much deeper, than they could be in any manual way. This does not mean that the workshops with process managers and other stakeholders in the business unit go away: Instead, you will start the conversation with them on another level. You can show them the process and say “This is what we are seeing. Do you know why this is happening?” (instead of wasting hours of their time by letting them explain to you how the process works).
Many organizations have started to build data science teams, because they have recognized the value of increasing amounts of data and they want to be able to make use of it. Data scientists are typically well-versed in all kinds of technologies. They know how to work with SQL, NoSQL, ETL tools, statistics, scripting languages such as Python, data mining tools, and R. And they know that 80% of the work consists of the processing and cleaning of data.
Data scientists are starting to adopt process mining, because it fills a gap that is not covered by existing data-mining, statistics and visualization tools: It can discover the actual end-to-end processes. Process mining also allows data scientists to work much faster. Even if you could write an SQL query that answers your particular process question, the process mining tool shows you the full process right after importing and allows you to directly filter the data without any programming.
Furthermore, data science teams do not analyze data for themselves, but to solve problems and issues for the business. Process mining helps them to communicate their analysis results back to the business in a meaningful way. Charts and statistics are often too abstract when summarizing a process. So, being able to provide a visual representation of the process to the process manager makes your explanation much more accessible to them.
Process managers are responsible for one particular process in the organization. The methods they use are often similar to the central process improvement teams (see above), but instead of working with different departments at different times they focus on their own processes and repeatedly analyze them for continuous improvement.
When a process manager adopts process mining, they have the advantage that they have all the domain knowledge available to interpret the data and the process correctly. This is a great advantage, because process mining does not only require expertise in how to do the actual process mining analysis, but the domain knowledge to interpret what you are seeing is absolutely crucial. At the same time, they typically need some training in a process improvement method (like Lean).
Process managers focus on operational questions and process mining brings them an eye-opening transparency about what is actually going on in their process. Once they have completed a process mining analysis, they can easily repeat it to see whether the improvements were as effective as they have hoped.
The role of internal audit departments is to help organizations ensure effectiveness and efficiency of operations, reliability of financial reporting, and compliance with laws and regulations in an independent and objective manner. External auditors provide assurance from outside the organization.
Both groups can benefit from process mining in many ways. Clearly, processes are not all an auditor looks at. For example, an IT auditor also looks at which system controls are in place to prevent fraud. However, when they do look at processes they typically do it in a very manual way (by looking at the process documentation, interviewing people, and inspecting samples). This is time-consuming and does not guarantee that the actual process problems will be detected.
When auditors use process mining they focus on compliance questions (like segregation of duties and process deviations). The advantage of using process mining is that they can be much faster. Furthermore, they can analyze the full process (not just samples) and, therefore, achieve a higher assurance. They can focus on the deviations (by quickly seeing what goes right) and better identify the true risks for the organization. Finally, the visual representation helps them as well, because in the end they will need to communicate their findings in an audit report.
If you look at process mining from the perspective of an IT department, you are mostly concerned about how well the IT systems (or apps, or websites) are working.2 There can be many different reasons to try to understand how IT systems are actually used. For example, you might want to replace a legacy system. Or you might want to scale back unnecessary customizing to make upgrades easier and save maintenance costs.
More recently, organizations have started to analyze the so-called customer journeys by combining click-stream data from their apps and websites with data from other customer interaction channels. The goal to improve the customer experience is typically at the center of these customer journey process mining analyses.
Customer journey processes are often more complex than, for example, administrative processes. Therefore, it is really important to formulate concrete questions and filter down the data to the subset that relates to your question (see this article for 9 simplification strategies). However, if done right, customer journey analyses can contribute greatly to not just improving the usability of websites and apps, but also to shift the perspective from ‘How are we doing things’ to ‘How does the customer experience our service’ in any process improvement project.
Process mining fits into many types of consultancy projects. Whether you are helping your client to introduce a new IT system (transformation projects), build an operational dashboard, or help them to work more efficiency, in all of these projects you need to understand what the ‘As is’ process looks like.
The most common use case of process mining for consultants is in process improvement projects. As such, the use case is very similar to the one of Process Improvement Teams (see above). But instead of an internal team working with a business unit in the organization, you are coming in as an expert from the outside, bringing with you a fresh perspective and your experience of working with different clients.
Consultants can specialize in many different areas by, for example, focusing on particular industries or IT systems. Furthermore, if you build up your process mining skills, you can help clients to try out or adopt process mining, when they do not have these skills themselves yet.
So, which benefits can process mining bring to you?
To find out, first start learning more about process mining to fully understand how it works and what it can do. Download the process mining software Disco and contact us for an extended evaluation license to explore some of your own data sets.
This is the eleventh article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
A common and unfortunate process mining scenario goes like this: You present a process problem that you have found in your process mining analysis to a group of process managers. They look at your process map and point out that this can’t be true. You dig into the data and find out that, actually, a data quality problem was the cause for the process pattern that you discovered.
The problem with this scenario is that, even if you then go and fix the data quality problem, the trust that you have lost on the business side can often not be won back. They won’t trust your future results either, because “the data is all wrong”. That’s a pity, because there could have been great opportunities in analyzing and improving this process!
To avoid this, we recommend to plan a dedicated data validation session with a process or domain expert before you start the actual analysis phase in your project. To manage expectations, communicate that the purpose of the session is explicitly not yet to analyze the process, but to ensure that the data quality is good before you proceed with the analysis itself.
You can ask both a domain expert and a data expert to participate in the session, but especially the input of the domain expert is needed here, because you want to spot problems in the data from the perspective of the process owner for whom you are performing the analysis (you can book a separate meeting with a data expert to walk through your data questions later). Ideally, your domain expert has access to the operational system during the session, so that you can look up individual cases together if needed.
To organize the data validation session with the domain expert, you can do the following:
Start by explaining briefly what process mining is. Show up to a maximum of 5 slides and consider giving a very short demo with a clean and simple example. Unless they have recently participated in a presentation about process mining, you should assume that they either do not know what process mining is at all or only have a vague idea.
Then, restate the purpose of the session and explain that you want to validate the data with them and collect potential issues and questions on the way.
Consider asking them to draw a very simple process map (just boxes and arrows) of the process from their perspective with up to a maximum of 7 steps at a flip-chart or whiteboard. This will be useful as a reference point, when you are trying to understand the meaning of certain process steps later on in the meeting.
Show them the data in raw format (for example, in Excel) and explain where you got the data and how it was extracted. Point out the Case ID, Activity, and Timestamp columns that you are using.
Then, import the data in front of their eyes and go over the summary information (showing the timeframe of the data, the attributes, etc.). Afterwards, look at the process map and inspect the top variants with them. Look at some example cases and ask them: “Does this make sense to you?”. Write down any issues that they mention.
If you find strange patterns in the process behavior, filter the data to get to some example cases for further context. Simplify the process map if needed (see this article on simplification strategies) and interactively look into the issues that you find together. Try to find answers to questions right in the session if possible and otherwise write them up as an action point.
If you can, look up a few cases in the operational system together (many systems allow you to search by case number, or customer number, and inspect the history of an individual case) and compare them with the case sequences that you find in Disco to see whether they match up as expected.
Of course, you may have already run into questions yourself while going through the data quality checklist before this data validation session. You can go through them with the domain expert to see whether they have some explanations for the problems that you have observed.
You may find that the domain expert brings up questions about the process that are relevant for the analysis itself. This is great and you should write them down, but do not get side-tracked by the analysis and steer the session back to your data quality questions to make sure you achieve the goal of this meeting: To validate the data quality and uncover any issues with the data that might need to be cleaned up.
After the validation session, follow-up on all of the discovered data problems and investigate them. Also, keep track which of your original process questions may be affected by the data quality issues that you found. Document the actions that you have taken, or intend to take, to fix them.
This is the tenth article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
Last week, we were looking at missing activities and missing timestamps. Today, we will discuss another common data quality problem that I am sure most of you will encounter at some point in time in the future.
Take a look at the following data snippet (you can click on the image to see a larger version). In this data set, you can see three cases (Case ID 1, 2, and 3). If you compare this data set below with a typical process mining data set, you can see the following differences:
There is just one row per case (see case 1 highlighted). Normally, you would have multiple rows — One row for each event in the case.
The activities are in columns (here, activity A, B, C, D and E), with the dates or timestamps recorded in the cell content.
When you encounter such a data set, you will have to re-format it into the process mining format in the following way (see screenshot below):
Add a rows for each activity (again, case 1 is highlighted).
Create an activity and a timestamp column to capture the name and the time for each activity.
However, the important thing to realize here is that this is not purely a formatting problem. The column-based format is not suitable to capture event data about your process, because it inherently loses information about activity repetitions.
For example, imagine that after performing process step D the employee realizes that some information is missing. They need to go back to step C to capture the missing information and will only then continue with the proces step E. The problem with the column-based format as shown in the first data snippet is that there is no place where these two timestamps regarding activity C can be captured. So, what happens in most situations is that the first timestamp of activity C is simply overwritten and only the latest timestamp of activity C is stored.
You might wonder why people store process data in this column-based format in the first place. Typically, you find this kind of data in places, where process data has been aggregated. For example, in a data warehouse, BI system, or an Excel report. It’s tempting, because in this format it seems easy to measure process KPIs. For example, do you want to know how long it takes between process step B and E? Simply add a formula in Excel to calculate the difference between the two timestamps.1
People often implicitly assume that the process goes through the activities A-E in an orderly fashion. But processes are really complex and messy in reality. As long as the process isn’t fully automated, there is going to be some rework. And by pressing your data in such a column-based format you lose information about the real process.
So what can you do if you encounter your data in such a column-based format?
How to fix:
First of all, you should use the data that you have and transform it into a row-based format like shown above. However, in the analysis you need to be aware about the limitation of the data and know that you can encounter some distortions in the process because of it (see an example below).
If the process is important enough, you might want to go back in the next iteration and find out where the original data that was aggregated in the BI tool or Excel report comes from. For example, it might come from an underlying workflow system. You can then get the full history data from the original system to fully analyze the process with all its repetitions.
To understand what kind of distortions you can encounter, let’s take a look at the following data set, which shows the steps that actually happened in the real process before the data was aggregated into columns. You can see that:
Only case 2 followed the expected path A-E.
In case 1 and in case 3 rework occurred that is simply lost in the column-based, and then the transformed, data set (see blue mark-up).
Now, when you first import the data set that was transformed from the column-based format to the row-based format into Disco, you get the following simplified process map (see below).
The problem is that if a domain expert would look at this process map, they might see some strange and perhaps even impossible process flows due to the distortions from the lost activity repetition timestamps. For example, in the process map above it looks like there was a direct path from activity B to activity D at least once.
However, in reality this never happened. You can see the discovered process map from the real data set (where all the activity repetitions are captured) below. There was never a direct succession of the process steps B and D, because in reality activity C happened in between.
So, use the data that you have but be aware that such distortions can happen and what is causing them.
The process maps above were simplified process maps (see this guide on simplifying complex process models to learn more about the different simplification strategies). If you are curious to see the full details of each map to make sure there was really no path from activity B to activity D, you can find them below:
Left: The full process map that was discovered from the column-based and transformed data set (click on the image to see a larger version)
Right: The full process map for the real process (click on the image to see a larger version).
Another danger of this approach is that if the two steps are not in the expected order, you will actually end up with a negative duration. ↩
This is the ninth article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
Earlier in this series, we have talked about how missing data can be a problem. We looked at missing events, missing attribute values, and missing case IDs. But what do you do if you have missing activities, or missing timestamps for some activities?
There are two scenarios for missing timestamps.
1. Missing activities
Some activities in your process may not be recorded in the data. For example, there may be manual activities (like a phone call) that people perform at their desk. These activities occur in the process but are not visible in the data.
Of course, the process map that you discover using process mining will not show you these manual activities. What you will see is a path from the activity that happened before the manual activity to the activity that happened after the manual activity.
For example, in the process map below you see the sandbox example in Disco. There is a path from activity Create Request for Quotation to Analyze Request for Quotation. However, it could be that there was actually another activity that took place between these two process steps, which is not visible in the data.
How to fix:
There is not much you can do here. What is important is to be aware that these activities take place although you cannot see them in the data. Process mining mining cannot be performed without proper domain knowledge about the process you are analyzing. Make sure you talk to the people working in the process to understand what is happening.
You can then take this domain knowledge into account when you interpret your results. For example, in the process above you would know that not all the 21.7 days are actually idle time in the process. Instead, you know that other activities are taking place in between, but you can’t see them in the data. It’s like a blind spot in your process. Typically, with the proper interpretation you are just fine and can complete your analysis based on the data that you have.
However, sometimes the blind spot becomes a problem. For example, you might find that your biggest bottlenecks are in this blind spot and you really need to understand more about what happens there. In this situation, you may choose to go back and collect some manual data about this part of the process either through observation or by asking the employees to document their manual activities for a few weeks. Make sure to record the case ID along with the activities and the timestamps in this endeavor. Afterwards, you can combine the manually collected data with the IT data to analyze the full process, but now with visibility on the blind spot.
2. Missing timestamps for some activities
In a second scenario you actually have information about which activities were performed, but for some of the activities you simply don’t have a timestamp.
For example, in the data snippet from an invoice handling process (see screenshot below – click on image to see a larger version) we can see that in some of the cases an activity Settle dispute with supplier was performed. In contrast to all the other activities, this activity has no timestamp associated. It simply might not have been recorded by the system, or the information about this activity comes from a different system.
The problem with a data set where some events have a timestamp and others don’t is that the process mining tool cannot infer the sequence of the activities. Normally, the events are ordered based on the timestamps during the import of the data. So, what can you do?
There are essentially three options.
How to fix:
1. Ignoring the events that have no timestamp. This will allow you to analyze the performance of your process but omit all activities that have no timestamp associated (see example below).
2. Importing your data without a timestamp configuration. This will import all events based on the order of the activities from the original file. You will see all activities in the process map, but you will not be able to analyze the waiting times in the process (see example below).
3. You can “borrow” the timestamps of a neighbouring activity and re-use them for the events that do not have any timestamps (for example, the timestamp of their successor activity). This data pre-processing step will allow you to import all events and include all activities in the process map, while preserving the possibility to analyze the performance of your process as well.
Let’s look at how option 1 and 2 look like based on the example above.
First, we can import the data set in the normal way. When the timestamp column is selected, Disco gives you a warning that the timestamp pattern is not matching all rows in the data (see screenshot below). The reason for this mismatch are the empty timestamp fields of the Settle dispute with supplier activity.
When you go ahead and import the data anyway, Disco will import only the events that have a timestamp (and sort them based on the timestamps to determine the event sequence for each case). As a result, you get a process map without the Settle dispute with supplier activity (see screenshot below). You can now fully analyze your process also from the performance perspective, but you have a blind spot (similarly to the example scenario discussed at the beginning of the article).
Let’s say we now want to include the Settle dispute with supplier activity in our process map. For example, we would like to visualize how many cases have a dispute in the first place.
To do this, we import the data again but make sure that no column is configured as a Timestamp in the import screen. For example, we can change the configuration of the ‘Complete Timestamp’ column to an Attribute (see screenshot below). As a result, you will see a warning that no timestamp column has been defined, but you can still import the data. Disco will now use the order of the events in the original file to determine the activity sequences for each case. You should only use this option if the activities are already sorted correctly in your data set.
As a result, the Settle dispute with supplier activity is now displayed in the process map (see screenshot below). We can see that 80 out of 412 cases went through a dispute in the process.
We can further analyze the process map along with the variants, the number of steps in the process, etc. However, because we have not imported any timestamps, we will not be able to analyze the performance of the process, for example, the case durations or the waiting times in the process map.
To analyze the process performance, and to keep the activities without timestamps in the process map at the same time, you will have to add timestamps for the events that currently don’t have one in your data preparation.