Are you getting ready for this year’s Process Mining Camp? If you haven’t registered yet, make sure to secure your ticket for 10 June. The early bird tickets were gone within less than five days, so be quick!
To get us all into the proper camp spirit, we will be releasing the videos from last year’s camp over the next weeks. The first speaker at Process Mining Camp 2015 was Léonard Studer from the City of Lausanne. As a process analyst, Léonard helps people at the municipality to better organize their work.
At camp, Léonard told us about a project, where they analyzed a complex construction permit process. Construction permit processes are notoriously complicated, because there are so many parties and rules involved. For example, the City of Lausanne is regulated by 27 different laws from Swiss federal law, cantonal law, and communal regulation.
In his presentation, Léonard did not only tell us all about the project and what came out of it, but he also did a deep dive into the overall approach, the enormous data challenges they were facing, and the tools that he used to resolve them. He gave an honest talk with lots of practical details. In his introduction, he puts it best:
Process mining itself is not a problem anymore. When I do process mining live in front of people they believe that they can do it themselves. What is difficult now is to get all these little things around your process mining project arranged correctly. This is a talk I will give you without any shame, I will not be blowing any smoke, I will not be bragging. I just want to tell you what I really did. I will also give you some tricks around process mining that may be useful to you.
In the previous article on wrong timestamp configurations we have seen how timestamp problems can influence the process flows and the process variants. One reason for why timestamps can cause problems is that they are not sufficiently different. For example, if you only have a date (and no time) then it may easily happen that two activities within the same case happen on the same day. As a result you don’t know in which order they actually happened!
Take a look at the following example: We can see a simple document signing process with four activities and three cases.
The order of the rows in each case is arbitrary. When importing this data set, the sequence of events is determined based on the timestamps. For example, the sequence of the steps ‘Created’ and ‘Sent to Customer’ for case 1 is reversed (compared to the original file), because the dates reflect that the two steps have happened in the opposite order (see screenshot below).
However, if two activities happen at the same time (on the same day in this example), then Disco does not know in which order they actually occurred. So, it keeps the order in which they appear in the original file. Because the order of the activities in the example file is random, this creates some additional variation in the process map (and in the variants) that should not be there.
For example, the three cases in the above example come from a purely sequential process. However, because sometimes multiple steps happen on the same day, and the order between them is arbitrary, you can see some additional interleavings in the process map. They reflect the different orderings of the same timestamp activities in the file (see screenshot below).
So, if you don’t have sufficiently fine-granular timestamps to determine the order of all activities, or if you have many steps in your process occurring exactly at the same time, it often creates more complexity than is already there. What can you do to distinguish the real process complexity from the one just caused by the same timestamp problem?
How to fix: You can either leave out events that have the same timestamps by choosing a “representative” event (see strategy 1 below), or you can try pre-sorting the data (see strategies 2-4 below) to reduce the variation that is caused by the same timestamp activities.
Strategy 1: “Representative” (Leaving out events)
The reason for ‘Same Timestamp’ activities is not always an insufficient level of granularity in the timestamp pattern. Sometimes, it is simply a fact that many events are logged at the same time.
Imagine, for example, a workflow system in a municipality, where the service employee types in the new address of a citizen who moved to a new apartment. After the street, street number, postal code, city, etc., fields in the screen have been filled, they press ‘Next’ to finalize the registration change and print the receipt.
In the history log of the workflow system, you will most likely see individual records of the changes to each of these fields (for example, a record of the ‘Old value’ and the ‘New value’ of the ‘Street’ attribute). However, all of them may have the same timestamp, which is the time when the employee pressed the ‘Next’ button and the data field changes were all finalized (at once).
Below, you can see another example of a highly automated process. Many steps happen at the same time.
However, you may not need all of these detailed events and can choose one of them to represent the whole subsequence. For example, in the case below the first of the four highlighted events could stand for the sequence of four. You can deselect the other steps via the ‘Keep selected’ option in the Attribute filter.
In general, focusing on just a few – the most relevant – milestone activities is one of the most effective methods to trim down the data set to more meaningful variants if you have too many – See also Strategy No. 9 in this article about How to Manage Complexity in Process Mining.
Strategy 2: Sorting based on sequence number
Sometimes you actually have information about in which sequence the activities occurred in some kind of sequence number attribute. This is great, because you can now sort your data set based on the sequence number (see below) and avoid the whole Same Timestamp Activities problem altogether.
Because Disco uses the sequence from the activities in your original file for the events that have the same timestamp, this pre-sorting step will influences the order in which the variants and the process flows are formed and, therefore, fix the random order of the Same Timestamp Activities.
Strategy 3: Sorting based on activity name
Of course you don’t always have a sequence number that you can rely on for sorting the data. So what else can you do?
Another way that often helps is that you can pre-sort the data simply based on the activity name. The idea is that at least the activities that have the same timestamp (and are sometinmes in this and sometimes in that order) are now always in the same order, even if the order itself does not make much sense.
This is easy to do: Simply sort the data based on your activity column before importing it. However, sometimes this strategy can also backfire, because you may – accidentally – introduce wrong orders in same timestamp activities that by coincidence were fine before.
For example, consider the outcome of sorting the data based on activity name for the document signing process above:
It has helped to reduce the variation in the beginning of the process, but at the same time it has introduced a reverse order for the activities ‘Document Signed’ and ‘Response Received’ for case 3 (which have the same timestamps but were in the right order by coincidence in the original file).
Strategy 4: Sorting based on ideal sequence
To influence the order of the Same Timestamp Activities in the “right” way, you can analyze those process sequences in your data that are formed by actual differences in the timestamp. You can also talk to a domain expert to help you understand what the ideal sequence of the process would be.
For example, if you look at case 2 in the document signing process, then you can see that the sequence is fully determined by different timestamps (see screenshot below).
We are now going to use this ideal sequence to influence the sorting of the original data. One simple way to do it is to pre-face the activity names by a sequence number reflecting their place in the ideal sequence (i.e., ‘1 – Created’, ‘2 – Sent to Customer’, ‘3 – Response Received’, and ‘4 – Document Signed’) by using Find and Replace.
After adding the sequence numbers, you can simply sort the original data by the activity column (see below).
This will bring the activities in the ideal sequence. When you now import the data in Disco, you should only see deviations from the ideal sequence if the timestamps actually reflect that.
In less than two months we all come together for this year’s Process Mining Camp in Eindhoven. Right now we are busy working with a number of speakers to bring you the most interesting and inspiring talks at this year’s camp.
Like our camp audience, our speakers come to camp from all over the world. They come from the biggest companies out there, and from smaller organizations, and they apply process mining in a wide spectrum of use cases and roles. What they have in common is, they have a great story to tell. This year’s program is almost finalized, and we can’t wait to share it with you very soon.
In the article on Zero timestamps we have seen how timestamp problems can lead to faulty case durations. But faulty timestamps do not only influence the case durations. They also impact the variants and the process maps themselves, because the order of the activities is derived based on the timestamps.
For example, take a look at the following data set with just one faulty timestamp. There is one case with a 1970 timestamp (see screenshot below – click on the image to see a larger version). As a result, the ‘Create case’ activity is positioned before the ‘Import forms’ activity.
If we look at the process map, then you see that in all other 456 cases the process flows the other way. Clearly, the reverse sequence is caused by the 1970 timestamp.
And if we look at the average waiting times in the process map, then this one faulty timestamp creates further problems and shows a huge delay of 43 years.
As you can see, data quality problems due to timestamp issues can distort your process mining analysis in many different places. Therefore, it is important to carefully assess the process map and the variants, if possible together with a domain expert, to spot any suspicious orderings of activities.
If you have found a problem with the timestamps, then there can be different reasons for why this is happening. Zero timestamps are just one possible reason. Here is the next one: Wrong timestamp configuration during import.
Wrong Timestamp Pattern Configuration
When you import a CSV or Excel file into Disco, the timestamp pattern is normally detected automatically. You don’t have to do anything. If it is not automatically detected, Disco lets you specify how the timestamp pattern should be interpreted rather than forcing you to convert your source data into a fixed timestamp format. And you can even work with different timestamp patterns in your data set.
However, if you have found that activities show up in the wrong order, or if you find that your process map looks weird and does not really show the expected process, then it is worth verifying that the timestamps are correctly configured during import.
You can do that by going back to the import screen: Either click on the ‘Reload’ button from the project view or import your data again. Then, select the timestamp column and press the ‘Pattern…’ button in the top-right corner. You will see a few original timestamps as they are in your file (on the left side) and a preview of how Disco interprets them (in green, on the right side).
Check in the green column whether the timestamps are interpreted correctly. Pay attention to the lower and upper case of the letters in the pattern, because it makes a difference. For example, the lower case ‘m’ stands for minutes while the upper case ‘M’ stands for months.
How to fix: If you find that the preview does not pick up the timestamps correctly, configure the correct pattern for your timestamp column in the import screen. You can empty the ‘Pattern’ field and start typing the pattern that matches the timestamps in your data set (use the legend on the right, and for more advanced patterns see the Java date pattern reference for the precise notation and further examples). The green preview will be updated while you type, so that you can check whether the timestamps are now interpreted correctly. Then, press the ‘Use Pattern’ button
Wrong Timestamp Column Configuration
Another timestamp problem that can result from mistakes during the import step is that you may have accidentally configured some columns as a timestamp that are not actually a timestamp column in the sense of a process mining timestamp (but, for example, indicate the birthday of the customer).
In the customer service refund example below, the purchase date in the data has the form of a timestamp. However, this is a date that does not change over time and should actually be treated as an attribute. You can see that both the ‘Complete Timestamp’ as well as the ‘Purchase Date’ column have the title clock symbol in the header, which indicates that currently both are configured as a timestamp.
If columns are wrongly configured as a timestamp, Disco will use them to calculate the duration of the activity. As a consequence, activities can show up in parallel although the are in reality not happening at the same time.
How to fix: Make sure that only the right columns are configured as a timestamp: For each column, the current configuration is shown in the header. Look through all your columns and make sure only your actual timstamp columns are showing the little clock symbol that indicates the timestamp configuration. Then, press again the ‘Start import’ button.
For example, in the customer service data set, we would change the configuration of the ‘Purchase Date’ column to a normal attribute as shown below.
Data Scientists spend a large part of their day on exploratory analysis. In the 2015 Data Science Salary Survey, 46% of respondents said that they use one to three hours per day on the summarizing, visualization, and understanding of data, even more than on data cleansing and data preparation.
Process mining is focused on the analysis of processes, and it is an excellent tool in particular for the exploratory analysis of process-related data. If your data science project concerns business or IT processes, then you need to explore these processes and understand them first before you can train machine learning algorithms or run statistical analyses in any meaningful way.
With process mining you can get a process view of the data. The specific process view results from the following three parameters:
Case ID: The selected case ID determines the scope of the process and connects the individual steps of a process instance from the beginning to the end (for example, a customer number, order number or patient ID)
Activity: The activity name determines the steps that are shown in the process view (such as “order received” or “X-ray examination completed”).
Timestamp: One or more timestamps per step (for example for the beginning and the end of an X-ray examination) are used to calculate the process sequence and to derive parallel process steps.
When you analyze a data set with process mining, then you determine at the beginning of the analysis, which columns in the data correspond to the Case ID, activity name, and timestamps. You can set these parameters in the configuration when importing the data into the process mining tool.
When importing a CSV file into the process mining software Disco, you can specify for each column in your data set how it should be interpreted.1
In the following example of a purchasing process, the Case ID column (the purchase order number) is configured as Case ID, the start and complete timestamps as Timestamp, and the Activity column as Activity. As a result, the process mining software automatically produces a graphical representation of the actual purchasing process based on historical data. The process can now be further analyzed based on facts.
Usually, the first process view– and the import configuration derived from it–follows from the process understanding and task at hand.
However, many process mining newcomers are not yet aware of the fact that a major strength of process mining, as an exploratory analysis tool, is that you can rapidly and flexibly take different perspectives on your process. The above parameters function as a lens with which you can adjust process views from different angles.
Here are three examples:
1. Focus on Another Activity
For the above purchasing process, we can change the focus on the organizational process flow by setting the Role column (the function or department of the employee) as Activity.
This way, the same process (and even the same data set) can now be analyzed from an organizational perspective. Ping-pong behavior and increased transfer times when passing on operations between organizational units can be made visible and addressed.
2. Combined Activity
Instead of changing the focus, you can also combine different dimensions in order to get a more detailed picture of the process.
If you look at the following call center process, you would probably first set the column “Operation” as activity name. As a result, the process mining tool derives a process map with six different process steps, which represent the accepting of incoming customer calls (“Inbound Call”), the handling of emails, and internal activities (“Handle Case”).
Now, imagine that you would like to analyze the process in more detail. You would like to see how many first-level support calls are passed on to the specialists in the back office of the call center. This information is actually present in the data. The attribute “Agent Position” indicates whether the activity was handled in the first-level support (marked as FL) or in the back office (marked as BL).
To include the “Agent Position” in the activity view, you can set both the column “Operation” and the column “Agent Position” as activity name during the data import step. The contents of the two columns are now grouped together (concatenated).
As a result, we get a more detailed view of the process. We see for example that calls accepted at the first-level support were transferred 152 times to the back office specialists for further processing. Furthermore, no email-related activities took place in the back office.
3. Alternative Case Focus
Finally, we could question whether the service request ID of the CRM system, which was selected as the case ID, provides the desired process view for the call center process. After all, there is also a customer ID column and there are at least three different service requests noted for “Customer 3” (Case 3, Case 12 and Case 14).
What if these three requests are related and the call center agents just have not bothered to find the existing case in the system and re-open it? The result would be a reduced customer satisfaction because “Customer 3” has had to repeatedly explain the problem with every call.
The result would also be an embellished “First Call Resolution Rate.” The “First Call Resolution Rate” is a typical performance metric for call centers, which measures the number of times a customer problem could be solved with the first call.
That is exactly what happened in the customer service process of an Internet company. In a process mining project, initially the customer contact process (via telephone, Internet, e-mail or chat) was analyzed with the Service ID column chosen as the case ID. This view produced an impressive “First Contact Resolution Rate” of 98%. Of 21,304 incoming calls, apparently only 540 were repeat calls.
Then the analysts noticed that all service requests were closed fairly quickly and almost never re-opened again. To analyze the process from the customer’s perspective, the Customer ID column was chosen as a case ID. This way, all calls of a specific customer in the analyzed time period were summarized into one process instance and repeating calls became visible.
The “First Contact Resolution Rate” in reality amounted to only 82%. Only 17,065 cases were actually started by an incoming call. More than 3,000 were repeat calls, but were counted as new service requests in the system (and on the performance report!).
Process mining allows you to get a process perspective on your data. Moreover, it is worthwhile to consider different views on the process. Look out for other activity perspectives, possible combinations of fields, and new perspectives on what constitutes the case in the process.
You can take different views to answer different questions. Often, multiple views are necessary to obtain an overall picture of the process.
Do you want to explore the perspective changes presented in this article yourself in more detail? You can download the used example files here and analyze them directly with the freely available demo version of our process mining software Disco.
Note: For the open-source software ProM (http://www.promtools.org/) you often use XML formats such as XES or MXML, which contain this configuration. ↩
Have you completed a successful process mining project in the past months that you are really proud of? A project that went so well, or produced such amazing results, that you cannot stop telling anyone around you about it? You know, the one that propelled process mining to a whole new level in your organization? We are pretty sure that a lot of you are thinking of your favorite project right now, and that you can’t wait to share it.
We want to help you showcase your best work and share it with the process mining community. This is why we are introducing the Process Miner of the Year awards. The best submission will receive this award at this year’s Process Mining Camp, on 10 June in Eindhoven.
What we are looking for
We want to highlight process mining initiatives that are inspiring, captivating, and interesting. Projects that demonstrate the power of process mining, and the transformative impact it can have on the way organizations go about their work and get things done.
There are a lot of ways in which a process mining project can tell an inspiring story. To name just a few:
Process mining has transformed your organization, and the way you work, in an essential way.
There has been a huge impact with a big ROI, for example through cost savings or efficiency gains.
You found an unexpected way to apply process mining, for example in a domain that nobody approached before you.
You were faced with enormous challenges in your project, but you found creative ways to overcome them.
You developed a new methodology to make process mining work in your organization, or you successfully integrated process mining into your existing way of working.
Of course, maybe your favorite project is inspiring and amazing in ways that can’t be captured by the above examples. That’s perfectly fine! If you are convinced that you have done some great work, don’t hesitate: Write it up, and submit it, and take your chance to be the Process Miner of the Year 2016!
How to enter the contest
You can either send us an existing write-up of your project, or you can write about your project from scratch. It is probably better to start from a white page, since we are not looking for a white paper, but rather an inspiring story, in your own words.
In any case, you should download this Word document, which contains some more information on how to get started. You can use it either as a guide, or as a template for writing down your story.
When you are finished, send your submission to email@example.com later than 30 April 2016.
We can’t wait to read about your amazing projects!
Eindhoven can be reached conveniently through a direct train connection from Amsterdam’s Schiphol airport. Mark the day in your calendar, and start making plans for your trip to the birthplace of process mining! You should also sign up for the camp mailing list to receive updates about this year’s camp, and to be the first to know when ticket sales open.
Share your story
We are currently busy putting together the program of this year’s camp, and we have already secured a number of speakers with great stories to tell. A lot of you have been doing great work lately, and some of the best process mining stories that we are aware of have already made their way onto this year’s camp program.
Before we finalize the program, we wanted to give all of you the opportunity to help us shape this year’s camp. Would you like to point us to interesting stories or topics that may not be on our radar yet? Do you have a great process mining story you would like to share at this year’s camp, or do you know someone who might? Send Christian an email at firstname.lastname@example.org and let us know!
See you on 10 June!
Process mining camp is our annual practitioner conference for process miners all over the world. It is not only a place to hear interesting and inspiring talks from other process miners, but also the annual family meeting of the global process mining community. Over the past four years, process mining enthusiasts from more than 17 different countries (including Australia, Korea, Brazil, South Africa and the United States) have come together to exchange their experiences and meet their peers.
In 2012, more than 70 smart and driven people joined us for the first Process Mining Camp. In 2013, we moved Process Mining Camp to the Zwarte Doos and added workshops, and we had a great day with more than 100 process mining enthusiasts from all over the world. In 2014, camp tickets sold out very quickly, and process mining enthusiasts from more than 16 countries came for a varied program including workshops, keynotes, and a panel discussion. In 2015, we moved to the auditorium to make more room, and 173 people from 17 different countries joined us at camp.
This year will be the greatest camp ever, and we cannot wait to meet you in Eindhoven!
This week, we are moving to the timestamp problems. Timestamps are really the Achilles heel of data quality in process mining. Everything is based on the timestamps: Not just the performance measurements but also the process flows and variant sequences themselves. So, over the next weeks we will look at the most typical timestamp-related issues.
Zero timestamps (or future timestamps)
One data problem that you will most certainly encounter at some point in time are so-called zero timestamps, or other kind of default timestamps that are given by the system. Often, zero timestamps were initially set as an empty value by the programmer of the information system. They can either be a mistake or indicate that the real timestamp has not yet been provided (for example, because an expected process step has not happened yet). Another reason can be typos in manually entered data.
These Zero timestamps typically take the form of 1 January 1900, the Unix epoch timestamp 1 January 1970, or some future timestamp (like 2100).
To find out whether you have Zero timestamps in your data, you can best go to the Overview statistics and take a look at the earliest and the latest timestamps in the data set. For example, in the screenshot below we can see that there is at least one 1900 timestamp in the imported data (click on the screenshot to see a larger version).
You should know what timeframe you are expecting for your data set and then verify that the earliest and latest timestamp confirm the expected time period. Be aware that if you do not address a problem like the 1900 timestamp in the picture above, you may end up with case durations of more than 100 years!
How to fix: You can remove Zero timestamps using the Timeframe filter in Disco (see instructions below).
You may also want to communicate your findings back to the system administrator to find out how these Zero timestamps can be avoided in the future.
To understand the impact of the Zero timestamps, you first need to investigate in more detail what is going on.
You want to find out whether just a few cases are affected by the Zero timestamps, or whether this is a wide-spread problem. For example, if Zero timestamps are recorded in the system for all activities that have not happened yet, you will see them in all open cases.
To investigate the cases that have Zero timestamps, add a Timeframe filter and use the ‘Intersecting timeframe’ mode while focusing on the problematic time period. This will keep all those cases that contain at least one Zero timestamp. Then use the ‘Copy and filter’ button to create a new data set focusing on the Zero timestamp cases (see screenshot below).
As a result, you will see just the cases that have Zero timestamps in them. You can see how many there are. Furthermore, you can inspect a few example cases to see whether the problem is always in the same place or whether multiple activities are affected. In our example, just two cases contain Zero timestamps (see below).
Now, let’s move on to fix the Zero timestamp problem in the data set.
Then: Remove cases or Zero timestamps only
Depending on whether Zero timestamps are a wide-spread problem or not you can take two different actions:
If only a few cases are affected, you can best remove these cases altogether. This way, they will not disturb your analysis. At the same time you will not be left with partial cases that miss some activities because of data issues.
If many cases are affected, like in the situation that Zero timestamps were recorded for activities that have not happened yet, you can better remove just the events that have Zero timestamps and keep the rest of these cases for your analysis.
In our example, just two cases are affected and we will remove these cases altogether. To do this, add a Timeframe filter and choose the ‘Contained in timeframe’ option while focusing your selection on the expected timeframe. This will remove all cases that have any events outside the chosen timeframe (see screenshot below).
If you just want to remove the activities that have Zero timestamps, choose the ‘Trim to timeframe’ option instead. This will “cut off” all events outside of the chosen timeframe and keep the rest of these cases in your data (see below)
Note that if your Zero timestamps indicate that certain activities have not happened yet, it would be better to keep the timestamp cells in the source data empty, rather than filling in a 1900 or 1970 timestamp value (see example below).
Events with empty timestamps will not be imported in Disco, because they cannot be placed in the sequence of activities for the case. So, keeping the timestamp cell empty for activities that have not occurred yet will save you this extra clean-up step in the future.
Finally: Make a clean copy
Once you have cleaned up the Zero timestamps from your data, you can best make a new copy using the ‘Apply filters permanently’ option to get a fresh start (see screenshot below). The result will be a new (cleaned) data set, which can now serve as the starting point for your analysis.
That’s it! You have successfully removed your Zero timestamps and any new filters that you add from now an will be based on your cleaned data.
Even if your data imported without any errors, there may still be problems with the data. For example, one typical problem is missing data. Keep reading to learn more about some of the most common types of missing data in process mining.
Gaps in the timeline
Check the timeline in the ‘Events over time’ statistics to see whether there are any unusual gaps in the amount of data over your log timeframe.
The picture above shows an example, where I had concatenated three separate files into one file before importing it in Disco. Clearly, something went wrong and apparently the whole data from the second file is missing.
How to fix:
If you made a mistake in the data pre-processing step, you can go back and make sure you include all the data there.
If you have received the data from someone else, you need to go back to that person and ask them to fix it.
If you have no way of obtaining new data, it is best to focus on an uninterrupted part of the data set (in the example above, that would be just the first or just the third part of the data). You can do that using the Timeframe filter in Disco.
Unexpected amount of data
You should have an idea about (roughly) how many rows or cases of data you are importing. Take a look at the overview statistics to see whether they match up.
For example, the picture below shows a screenshot of the overview statistics of the BPI Challenge 2013 data set. Can you see anything wrong with it?
In fact, the total number of events is suspiciously close to the old Excel limit of 65,000 rows. And this is what happened: In one of the data preparation steps the data (which had several hundred thousand rows) was opened with an old Excel version and saved again.
Of course, this is a bit more subtle than an obvious gap in the timeline but missing data can have all kinds of reasons. For some systems or databases, a large data extract is aborted half-way without anyone noticing. That’s why it is a very good idea to have a sense of how much data you are expecting before you start with the import (ask the person that gives you the data how they structured their query).
How to fix:
If you miss data, you must find out whether you lost it in a data pre-processing step or in the data extraction phase.
If you have received the data from someone else, you need to go back to that person and ask them to fix it.
If you have no way of obtaining new data, try to get a good overview about which part of the data you got. Is it random? Was the data sorted and you got the first X rows? How does this impact your analysis possibilities? Some of the BPI Challenge submissions noticed that something was strange and analyzed the data pattern to better understand what was missing.
Unexpected distribution or empty attribute values
Similarly, you should have an idea of the kind of attributes that you expect in your data. Did you request the data for all call center service requests for the Netherlands, Germany, and France from one month, but the volumes suggest that the data you got is mostly from the Netherlands?
Another example to watch out for are empty values in your attributes. For example, the resource attribute statistics in the screenshot below show that 23% of the steps have no resource attached at all.
Empty values can also be normal. Talk to a process domain expert and someone who knows the information system to understand the meaning of the missing values in your situation.
How to fix:
If you have unexpected distributions, this could be a hint that you are missing data and you should go back to the pre-processing and extraction steps to find out why.
If you have empty attribute values, often these values are really missing and were never recorded in the first place. Make sure you understand how these missing (or unexpectedly distributed) attribute values impact your analysis possibilities. You may come to the conclusion that you cannot use a particular attribute for your analysis because of these quality problems.
It is not uncommon to discover data quality issues in your original data source during the process mining analysis, because nobody may have looked at that data the way you do. By showing the potential benefits of analyzing the data, you are creating an incentive for improving the data quality (and, therefore, increasing the analysis possibilities) over time.
Cases with unexpected number of steps
As a next check, you should look out for cases with a very high number of steps (see below). In the shown example, the callcenter data from the Disco demo logs was imported with the Customer ID configured as the case ID.
What you find is that while a total of 3231 customer cases had up to a maximum of 30 steps, there is this one case, (Customer 3) that had a total of 583 steps in total over a timeframe of two months. That cannot be quite right, can it?
To investigate this further, you can right-click the case ID in the table and select the “Show case details” option (see below).
This will bring up the Cases view with that particular case shown (see below). It turns out that there were a lot of short inbound calls coming in rapid intervals. The consultation with a domain expert confirms that this is not a real customer, but some kind of default customer ID that is assigned by the Siebel CRM system if no customer was created or associated by the callcenter agent (for example, because it was not necessary, or because the customer hung up before the agent could capture their contact information).
Although in this data set there is technically a case ID associated, this is really an example of missing data. The real cases (the actual customers that called) are not captured. This will have an impact on your analysis. For example, analyzing the average number of steps per customer with this dummy customer in it will give you wrong results. You will encounter similar problems if the case ID field is empty for some of your events (they will all be grouped into one case with the ID “empty”).
How to fix:
You can simply remove the cases with such a large number of steps in Disco (see below). Make sure you keep track of how many events you are removing from the data and how representative your remaining dataset still is after doing that.
To remove the “Customer 3” case from the callcenter data above, you can right-click the case in the overview statistics and select the Filter for case ‘Customer 3’ option.1
In the filter, you can then invert the selection (see the little Yin Yang button in the upper right corner) to exclude Customer 3. To create a new reference point for your cleaned data, you can tick the ‘Apply filters permanently’ option after pressing the ‘Copy and filter’ button:
The result will be a new log with the very long case removed and the filter permanently applied (you have a clean start).
Alternatively, you could also use a Performance filter with the ‘Number of events’ metric to remove cases that are overly long. ↩
However, most of that data was not originally collected for process mining purposes. And especially data that has been manually entered can always contain errors. How do you make sure that errors in the data will not jeopardize your analysis results?
Data quality is an important topic for any data analysis technique: If you base your analysis results on data, then you have to make sure that the data is sound and correct. Otherwise, your results will be wrong! If you show your analysis results to a business user and they turn out to be incorrect due to some data problems, then you can lose their trust into process mining forever.
For example, the delimiting character (“,” “;” “I” etc.) cannot be used in the content of a field without proper escaping. If you look at the example snippet below then you can see that the “,” delimiter has been used to separate the columns. However, in the last row the activity name itself contains a comma:
Another problem might be that your file has less columns in some rows compared to others (see example below).
Other typical problems are invalid characters, quotes that open but do not close, and there are many more.
If Disco encounters a formatting problem, it gives you the following error message with the sad triangle and also tries to indicate in which line the problem occurs (see below).
In most cases, Disco will still import your data and you can take a first look at it, but make sure to go back and investigate the problem before you continue with any serious analysis.
We recommend to open the file in a text editor and look around the indicated line number (a bit before and afterwards, too) to see whether you can identify the root cause.
How to fix: Occasionally, the formatting problems have no impact on your data (for example, an extra comma at the end of some of the lines in your file). Or the number of lines impacted are so few that you choose to ignore it. But in most cases you do need to fix it.
Sometimes, it is enough to use “Find and Replace” in Excel to replace a delimiting character from the content of your cells and export a new, cleaned CSV that you then import.
However, in most cases it will be the easiest to point out the problem that you found to the person who extracted the data for you and ask them to give you a new file that avoids the problem.