Deal With Incomplete Cases¶
Before you start with your process mining analysis, you need to assess whether your data is suitable for process mining and check your data for data quality problems (see Detect and Fix Data Quality Problems). Afterwards, one of the next steps is to understand how you can differentiate between complete and incomplete cases in your process.
An ‘incomplete case’ is a case where either the start or the end of the process is missing. There can be different reasons for why a case is incomplete, such as:
- Your data extraction method has retrieved only events in a certain timeframe. For example, let’s say that you have extracted all the process steps that were performed in a particular year. Some cases may have actually started in the previous year (before January). Furthermore, some cases may have started in the year that you are looking at but continued until the next year (after December). In this situation, you will only see the part of these cases that took place in the year that you are analyzing.
- Some cases have not finished yet. Even if you have extracted all the data there is, some of the cases may not have finished yet. This means that, if you are extracting your process mining data today, some of the cases may have started recently and did not yet progress until the end of the process. They are still “somewhere in the middle”. If you would wait for a few weeks with your data extraction, then these cases would probably be finished, but then there might be new ones that have just recently started!
- Some cases might never finish. You may have a clear picture of how your process should go. But a customer might not get back to you as you expected, a supplier might never send you the data that was needed to sign them up, or a colleague might close a case in an unexpected phase, because there was an error, a duplicate or another problem with it detected. These cases do not end at any of the expected end points, but they will never be finished even if you waited for ages. The same can be true for the start points.
Looking for incomplete cases is a standard step that you should always take before you dive into your actual process mining analysis. In this chapter, we will give you clear guidelines for how to deal with incomplete cases.
Why Incomplete Cases Can Be Problematic¶
At first, it might not be obvious why incomplete cases are a problem in the first place. This is what the data shows, so my process mining analysis should show what actually happened, right?
Wrong. At least as far as incomplete cases are concerned: If your data has incomplete cases because of Reason No. 1 or Reason No. 2 (see above), then these missing start or end points are not reflecting the actual process, but they occur due to the way that the data was collected.
Take a look at the customer refund process picture below: The dashed lines leading to the endpoint (the square symbol at the bottom of the process map) indicate which activities happened as the very last step in the process. For example, for 333 cases ‘Order completed’ was the very last step that was recorded - See (1) in Figure 1. This seems to be a plausible end point for the process. However, there were also 20 cases for which the activity ‘Invoice modified’ was the very last step that was observed - See (2) in Figure 1. This does not seem like an actual end point of the process, does it?
If we look up an example case that ends with ‘Invoice modified’ (see Figure 2), then we can see that the ‘Invoice modified’ step indeed happened just before the end of the data set. It occurred on 20 January 2012 and the data set ends on 23 January 2012. What if we had data until June 2012? Would there have been any steps after ‘Invoice modified’ then?
So, we can see that not all end points in the data necessarily need to be meaningful endpoints in the process. Some cases can be incomplete, just because we are missing the end or the beginning of what actually happened, either because of how the data was extracted or because we don’t know yet what is going to happen with cases that are still ongoing. When you look at your process map, or the variants, for a data set that includes incomplete cases then the map and the variants do not show you the actual start and end points in your process but the start and end points in your data.
Another problem with incomplete cases is that their case duration can be misleading. The process mining tool does not know which cases are finished and which are incomplete. Therefore, it always calculates the case duration as the time between the very first and the very last event in the case.
As a result, the case durations of incomplete cases appear shorter in the process mining tool than the throughput time of the cases they represent has actually been. Let’s take a look at another example case in the process to understand what this means (see Figure 3). The shown Case72 seems to be very fast. There were just two steps in the process so far (‘Order created’ and ‘Missing documents requested’) and it took just 3 minutes.
However, when you consider that ‘Missing documents requested’ is not the actual end point of this process (we are just in an intermediate state, waiting for the customer to send us some additional information) and we look at the timeline of where this case sits, then we can see that this case has been open for more than 1 month. So, the true throughput time of this case (so far) should be at least 1 month and 3 minutes!
If you simply leave incomplete cases in your data set, then calculations like the average or median case duration in the statistics view of your process are influenced by these shorter durations. So, not only the process map and the variants are influenced by incomplete cases but also your performance measurements are impacted.
Therefore, you need to investigate incomplete cases in your data before you start with your actual analysis. You want to understand what kind of incomplete cases you have and how many there are. Then, you want to remove them from your data set before you analyze your process in more detail. You can do all this right in Disco and in the remainder of this guide we will show you how to do it.
Finally, some data sets may be extracted in such a way that there are no incomplete cases in it. For example, you may have received a data set from your IT department that only contains closed orders. So, any orders that are still open do not show up in your data.
In this situation, you don’t need to remove incomplete cases anymore. However, you should realize that you do not have visibility into how representative your data set is with respect to the whole population of orders. Understanding how many cases remain after removing your incomplete cases is an important step. Be aware of this limitation and consider requesting the set of open cases from the same period in addition to your current data set to be able to check them and to make sure you get the full picture.
How To Determine The Start and End Points For Your Process¶
Once you start analyzing your data set for incomplete cases, you need to determine what the expected start and end points in your process are. Typically, you do this by looking at which activities appear to be the last step in the process (look at the dashed lines in your process map) and by using your domain knowledge about the process.
In the refund process, we have already identified one possible regular endpoint in the activity ‘Order completed’. But are there other regular end points as well? For example, by digging deeper in the data we find that there is another activity ‘Cancelled’ that also appears as the last step in the process. From the name ‘Cancelled’ we can guess what this step means (the processing of the refund order has been stopped). The question is whether we consider ‘Cancelled’ a regular end point in the process, or whether we would rather remove cancelled cases from our process analysis?
The answer to this question depends on the questions that you want to answer in you process mining analysis. Furthermore, you typically need domain knowledge to definitively clarify how the process end points should be interpreted. It is fine for you as the process analyst to take some initial guesses, but it is critical that you document your assumptions along the way and verify them with a domain expert later on (see Data Validation Session).
If you have no idea at all which activities could be candidates for a start or end point in your process, there are two tricks you can try out to see if they help:
- Work from the process map and click on one of the dashed lines leading to the endpoint (see Figure 4). If the case frequency is the same as the end frequency (or very close) then this is a hint that the activity might be an end point in the process, because there is never anything happening afterwards. The same can be done with the start activities by clicking on the dashed lines leading from the start point.
To investigate some example cases with a particular end point in more detail, click on the shortcut ‘Filter for this end activity…’ and apply the pre-configured Endpoints Filter that Disco has added.
- If you should decide that this activity is a regular end point in the process, remove the filter again from the filter stack, apply the updated filter settings, and continue looking at the next dashed line in the process map.
- If you should decide that cases that end with this activity are incomplete, invert the selection of the Endpoints filter and apply it to remove all cases that end there. Then, continue looking at the remaining data set and click on the next dashed line in the process map.
By gradually removing end points that you consider incomplete, more and more end points that are currently hidden due to the low ‘Paths’ slider will appear until you have investigated all endpoints (keep pulling up the ‘Paths’ slider until you have seen them all) and have decided which to keep and which to remove.
- The second trick only works if you have data covering a large enough timeframe compared to the case durations in your data set. But if you do, try to apply a Timeframe Filter before investigating the start and end points as described above in the following way:
To investigate the process endpoints, add a Timeframe Filter and cover the first half of the timeframe (see Figure 5). As a result, only cases where there has been no further activity for the latter half of the time of your data set remain. Therefore, the end activities that are revealed through the dashed lines leading to the end point in the process map are much more likely to be actual endpoints in the process. In a way, you can think of it as having excluded those cases that just performed some kind of intermediary step yesterday, or a few days before the end of the data set.
To investigate the process startpoints, you can do the same but configure the Timeframe filter in such a way that it covers only the latter half of the timeline. This way, start points that emerge only because cases have been started shortly before the start of the data set timeframe will be excluded.
The Different Meanings of “Finished”¶
Once you have determined what your startpoints and what your endpoints are, you still need to think about what “finished” or “completed” actually means for your process.
Multiple interpretations are possible and the differences can be subtle, but you will need to use different filters depending on the meaning that you want to apply. The results will be different and you need to be clear about which meaning is right for your data set.
Here are four examples for how you can filter incomplete cases. It’s not that any of these are better or more appropriate than others in general. Instead, it depends on your process and on the meaning of “finished” that you want to choose.
Perhaps the most common meaning of “finished” is to look at which activities have occurred as the very last activity (for end points) or as the very first activity (for start points) in a case.
This corresponds to the dashed lines that you see in the process map and you can use the Endpoints Filter in Discard cases mode to filter all cases that start or end with a particular set of activities (see Figure 6).
When you add this filter, only the activities that occurred as the very first event in any of the cases are shown in the ‘Start event values’ on the left and only activities that occurred as the very last event in any of the cases are shown in the ‘End event values’ on the right.
You can then select only the regular start and end activities that you have identified in the previous step to focus on your completed cases. For example, if we only select the ‘Order completed’ activity as a regular end point for our refund process, then the remaining data set will only contain the 333 cases that actually ended with ‘Order completed’. If you use the shortcut ‘Filter for this start/end activity’ after clicking on a dashed line in the process map, Disco will automatically add a pre-configured Endpoints filter to your data set.
To use your filtered data set as the new reference point for your further analysis, you can enable the checkbox ‘Apply filters permanently’ after pressing the ‘Copy and filter’ button (see also Applying Filters). The outcome of applying the filter will be the same (the same 333 cases remain), but the applied filter will be consolidated in a new data set, so that successive analyses use this new baseline as the new 100% of cases.
Sometimes, the very last activity that happened in a case is not the best way to determine whether a case has been completed or not.
For example, after completing an order there might be back-end activities such as archiving or other documentation steps that occur later. In these cases, ‘Order completed’ will not be the very last step in the process (so, the case would not be picked up if you use the Endpoints filter).
If you are mainly concerned that one or more milestone activities that indicate the completion of your process have occurred or not, you can use the Attribute Filter in Mandatory mode (see Figure 7). This way, you determine all cases where any of the selected activities has happened, but you don’t care whether they were the very last step in the process or whether other activities were recorded afterwards.
Instead of manually adding this filter, you can also use the shortcut Filter this activity after clicking on the activity in the process map. Disco will automatically add a pre-configured Attribute Filter in Mandatory mode to your data set with the right activity already selected.
If we apply this meaning of “finished” based on the milestone activity ‘Order completed’ for the refund process, we get a slightly different outcome compared to the Endpoints Filter before. Instead of 333 cases, there now remain 334 cases after applying the filter and we can see that the additional case ended with the activity ‘Warehouse’ (see Figure 8).
If we now click on this dashed line leading from the ‘Warehouse’ activity and use the short-cut to investigate this case in more detail, we can see in the history of the case that the activity ‘Order completed’ did indeed occur. However, it occurred in the middle of the process after the order was initially rejected. Then, the case got picked up again and the refund was actually granted (see Figure 9).
In another scenario, you might be analyzing the refund process from a customer perspective: This is a process that the customers of an electronics manufacturer go through after the product that they purchased was broken and they now want to get their money back. So, from the customer’s point of view the process is “finished” as soon as they have received their refund.
To analyze the data from this perspective, we can focus on the three payment activities ‘Payment issued’, ‘Refund issued’ and ‘Special Refund issued’ (see Figure 10).
If we search for these activities in the process map, then we can see that there are several activities that happen afterwards. Sometimes, the delays in the back-end processing can be quite long (for example, 7.5 days on average after the ‘Payment issued’ step), but from the customer’s perspective this delay is not relevant.
So, to focus our analysis on the part of the process that is relevant for the customer, we can use the Endpoints Filter in Trim longest mode (see Figure 11).
When we change the Endpoints Filter mode from Discard cases to Trim longest, then all of the activities become available as ‘Start event values’ on the left and as ‘End event values’ on the right. We can now select only the three payment activities as the customer endpoints in our process.
As a result, everything that happened after any of these three payment activities is cut off. We can see that the customer payments now appear as the endpoints in our process map (see Figure 12).
The cases that remain in the data set after applying the filter are the same ones as if we would have used the Attribute filter in ‘Mandatory’ mode. But cutting off all activities after the payments enables us to focus our process analysis on the part of the process that is relevant from the customer’s perspective:
- The process map does not show the back-end activities after the payments anymore, so our bottleneck analysis (see also Analyze SLAs and Bottlenecks) will point us to the right places in the map that we should focus on.
- The case durations in the statistics views are only shown for the times from the creation of the refund order until the time that the customer has received their money back.
- The variants now only show us the process scenarios from the time of the order creation until the payment activities, so they are more meaningful for this perspective.
Open for longer than X¶
There might be activities in your process that can be considered an endpoint if there has been a certain period of inactivity afterwards (see also Reason No. 3 at the beginning of this chapter). For example, we can request missing information (like the purchase receipt) from a customer to handle their refund order but the customer might not get back to us.
If we want to focus on cases where the activity ‘Missing documents requested’ was the last step in the process but nothing has happend for a month, we can use a combination of filters in the following way.
First, we add an Endpoints filter as shown in Figure 13.
Then, we add a second filter by clicking the ‘click to add filter…’ button again and we add a Timeframe Filter on top of it (see Figure 14).
By adapting the selected timeframe in such a way that the past month is not covered, we will only keep those cases that did end with ‘Missing documents requested’ and where that last step took place more than one month ago.
When Incomplete Cases Shouldn’t Be Removed¶
There are also situations in which you should not remove incomplete cases from your data set. Here are two examples:
- Many compliance questions like the check for segregation of duties (see Segregation of Duties Analysis) can be best verified on the full data set. If you have a compliance rule that should be followed in a certain part of the process, then this should also be true for cases that have not reached the end of the process yet. So, by focusing your compliance analysis on only the completed cases you might unnecessarily limit your analysis.
For example, in the refund process customers should only receive their payment after they have returned the broken product to the manufacturer. The refund order does not need to have reached the state ‘Order completed’ for this compliance rule to hold. So, you can best perform the analysis on the full data set to make sure you catch all the deviations.
Be careful, however, to understand what the pre-conditions for the compliance rule are and filter your data set in such a way that the pre-conditions are met. For example, if your purchasing process requires that an order needs to be approved again after a change was made, then you might not have seen the approval step yet but it could still happen if the case is still open. So, you can think about at which milestone activity the process rule definitely should hold (for example, before the invoice is paid) and filter your data set accordingly before starting the compliance analysis.
- For some analysis questions, you actually want to focus on the incomplete cases. For example, you might want to analyze where open cases are currently stuck, how long they have been stuck there, and how long they have been open in total (see also How to Analyze Open Cases).
Finally, do not forget to assess the representativeness of your data set after you have removed your incomplete cases. For example, if it appears that 80% of your cases are incomplete then it would be very dangerous to base your process analysis on the remaining 20%!
If you do not have enough completed cases in your data set, you may need to go back and request a larger data sample from a longer time period to be able to get representative results.