This is Flux Capacitor, the company weblog of Fluxicon.
You can find more articles here.

You should follow us on Twitter here.

Top 5 Data Quality Problems for Process Mining 11

“Garbage in, garbage out” — Most of you will know this phrase. For any data analysis technique the quality of the underlying data is important. Otherwise you run the risk of drawing the wrong conclusions.

In this post, I want to go over the five biggest data problems that you might encounter in a process mining project.

1. Incorrect logging

In the process mining world most people use the term “Noise” for exceptional behavior – not for incorrect logging. This means that if a process discovery algorithm is said to be able to deal with noise, then it can abstract from low-frequent behavior by only showing the main process flow. The reason is simple: It is impossible for discovery algorithms to distinguish incorrect logging from exceptional events.

What incorrect logging means is that the recorded data is wrong. The problem is that in such a situation the data does not reflect “the Truth” but instead provides wrong information about reality.

Here are two true stories of incorrect data:

The message here is to be careful with manually created data because it is usually less reliable than automatically registered data. If there are doubts about the trustworthiness of the data, then the data quality should be examined first before proceeding with the analysis.

Another example are inconsistencies in logging due to human differences: For example, one person may hit the “completed” button in a workflow system at the beginning and another person at the end of a task. Only when you are aware of such inconsistencies then they can be factored in during the analysis.

2. Insufficient logging

While incorrect logging is about wrong data, insufficient logging is about missing data. The minimum requirements for process mining are a case ID, an activity name, and a timestamp per event to reconstruct the history of each process instance.

Typical problems with missing data are:

Typical OLAP and data mining techniques do not require the whole history of a process, and therefore data warehouses often do not contain all the data that is needed for process mining.

Another problem is that, ironically, by logging too much data sometimes there is not enough data. I have heard of more than one SAP or enterprise service bus system that does not keep logs longer than one month for the sheer amount of data that would accumulate otherwise. But processes often run longer than one month and, therefore, logs from a larger timeframe would be needed.

Finally, for specific types of analysis additional data is required. For example, to calculate execution times for activities both start and completion timestamps must be available in the data. For an organizational analysis, the person or the department that performed an activity should be included in the log extract, and so forth.

3. Semantics

One of the biggest challenges can be to find the right information and to understand what it means.

In fact, figuring out the semantics of existing IT logs can be anything between really easy and incredibly complicated. It largely depends on how distant the logs are from the actual business logic. For example, the performed business process steps may be recorded directly with their activity name, or you might need a mapping between some kind of cryptic action code and the actual business activity.

It is best to work together with an IT specialist who helps you extract the right data and explain the meaning of the different fields. In terms of process mining it helps not to try to understand everything at once. Instead, focus first on the three essential elements:

  1. How to differentiate process instances,
  2. Where to find the activity logs, and
  3. The start and/or completion timestamps for activities.

In the next phase, one can look further for additional data that would enhance the analysis from a business perspective.

4. Correlation

Because process mining is based on the history of a process, the individual process instances need to be reconstructed from the log data. Correlation is about stitching everything together in the correct way:

Overall, it is best to start simple (and ideally with one system) to pick the low-hanging fruits first and demonstrate the value of process mining.

5. Timing

Precisely because process mining evaluates the history of performed process instances, the timing is very important for ordering the events within each sequence. If the timestamps are wrong or not precise enough, then it is difficult to create the correct order of events in the history.

Some of the problems I have seen with timestamps are:

Ideally, timestamps should be precise, not be rounded up or down, and synchronized (if there are multiple systems). If there are differences, it may help to work with offsets. If too many events have the same timestamp, one can try to use the original sequence of events.

Too many problems?

If all this sounds terrible, do not despair. Not all data are bad, and starting simple helps. Furthermore, it is surprising how many valuable results can be obtained from existing log data that were not even created with analysis purposes in mind.

Insight into data quality problems and bad data is often one of the first good results. Improving data is important as analyzability becomes more and more relevant. I liked what Mark Norton wrote in his comment on a recent blog post about the monetary value of data by Forrester Analyst Rob Karel:

If you don’t have the data, decisions can’t be made (by definition), and if decisions can’t be made, the organization cannot create value. So there is also an ‘opportunity cost’ associated with non-existent or bad data.

What are your experiences with bad data?

Comments (11)

These are important points. Some of my own –

Incorrect logging – another issue is misinterpreting the data. On many systems there are fields to enter the date and time the goods and documentation has arrived. Typically the clerk or manager will enter these into the ERP system some time after the work has been done (in a batch). For example, in many systems the ‘goods received’ time-stamp is often the system time the data was entered (not the time it arrived). Beware the system generated time-stamp – it may not be what you want!

Timestamp resolution – SAP has this in certain PO fields – only a date and no time – so accuracy is compromised.

Data cleansing is often overlooked. You can look at a log and use apply ‘intelligence’ – your own!
Does the data look right? Do the headings mean what you think they mean? Is the spelling correct? Are the time-stamps consistent? Can you fill any the blank fields yourself based on intelligent guesswork. Are there special causes of variation – that should be removed? Get an SME to do the same. Improve the quality of the log and the process mining will be more accurate and meaningful.

Great comment. Thank you George!

Although I would be careful with filling in blank fields myself. Do you have an example where this makes sense?

Yes care must be taken – where there is an obvious pattern in a long list. For example, a phone code is ’61’ and the country code is blank “AUS” would be a good guess. I would only go with the straightforward ones.

Another thing (I just thought) of would involve a situation where a code was out of date or no longer used. The history logs don’t ever get amended so some there might be non unique codes for the same thing on file – this is where creativity is important

This could be important if the country code or phone code was driving a decision. I suppose you have to way up the risk of being wrong against the importance a complete log.

Yes, I agree. Thanks for the additions.

Great topic which is really important. I have worked with a couple of datasets both from the healthcare domain and the dentistry domain and I recognize many of the issues raised by Anne.

Let’s mention the one that immediately come in my mind:
– incorrect logging – Once I analyzed a dataset coming from an IC department of a certain hospital. In the system, the doctors and nurses registered every action that they have taken for a patient. In many cases, starts or completes where missing. However, most importantly, I saw often that several actions had been completed a couple of milliseconds after each other. This just showed that the people working with the system registered at a certain point that they had finished several actions. So, the data rather showed how the people where working with the system instead of how they performed their work. Here, I really experienced that it makes sense to check a couple of process instances before starting mining…

Semantics – I had a dataset from a dental practice consisting of a group of over 50 patients. For each patient, all the appointments that had taken place were stored. Here, the issue was that for the name of each event the ‘remarks’ field of the appointment had to be used. The ‘remarks’ field of an appointment is actually a free text field. As a consequence, many events refer to the same subject although they have different names! For example, For example, for the event names ‘impl cons: 15 min eerder!!’, ‘kaart! impl cons: 15 min eerder!!’, ‘kaart !! impl
cons: 15 min eerder!!’ only some characters are different but they all refer to the subject ‘impl cons’ which represents an implant consultation. Additionally, it may also be the case that event names are completely different but that they still
refer to the same subject. For example, the event names ‘pijn na impl vrijdag’ and ‘mw is bang voor ontst’ refer to the fact that there is a problem regarding the placement of a dental implant.

Here, I developed a simple ProM5 plugin which helped me to map the different event names to the correct (higher level) event name. In case this mapping would not have been done I surely would have obtained a spaghetti model which would cetainly not be correct because many event names refer to the same subject.

  • Filling in blank fields as mentioned by George – For the dental dataset mentioned above I also had the issue of blank fields such that the name of the event was not known. Based on the data that was associated with the event, it was possibly to identify the correct name of some of the events. However, things get dangerous to my opinion if you fill in a blank field, in the sense like ‘I think it is this event name’. Basically, you’re influencing here the result of the process mining. So, I would choose here to throw out the entire process instance.

Correlation – Again a nice example in the context of the dental dataset. For this dataset I also had data from a corresponding dental lab which was responsible for the production of the final restoration for each patient (e.g. a crown on an implant). Here, the challenge was to stitch each patient and the corresponding product(s) in the lab together. The end of the story was that it needed to be done manually. For example, if we have as name W.M.P. van der Aalst. In the dental practice, for example, they use as patient name ‘W.M.P. van der Aalst’ and ’15’ as identifier for the patient. In the lab this could mean that they use as identifier ‘W.M.P. van der Aalst’, ‘W.M.P. van der Aalst 15’, ‘Aalst, W.M.P. van der’, etc.

As the final goal is to arrive at a simulation model it was clear that the stitching together needed to be done manually such that for each patient we could follow the things done in the dental practice and the dental lab.

Obviously, to my opinion, data quality is certainly a very important issue that needs to be well addressed before starting mining and is something that should not be overlooked. Even more, it could take more time to perform preprocessing than the time needed to do the mining itself!!!

Thank you, Ronny. These are fantastic examples.

Your comments actually bring me to the conclusion that there should be two additional problem categories: 6. Structuredness and 7. Sanity of the data.

With ‘Structuredness’ I mean that the data needs to be available in a structured form (so, no freetext that first needs to be processed into structured data).

‘Sanity’ then relates more to the syntactic requirements, such that there should be no illegal characters, or that a CSV file cannot use the delimiting character (such as the ‘ or semicolon) in the body of the data without escaping (both are things I had to clean up in the past).

Hi Anne,

I completely agree with you.

Sanity tests have always been around – not just in IT. It goes to the definition of ‘garbage’. I was not suggesting you throw out data you don’t like but you should remove data that has slipped through edit checks or bad programming. Special causes of variation need to be removed for the same reason. This what SMEs can help with too!

[…] are some challenges regarding data quality that are specific to process mining. Many of these challenges revolve around problems with timestamps. In fact, you could say that […]

Great article on problems on process mining. One of the great challenge I face during process mining is mining loops. Like for my thesis I am doing, the user is allowed to upload multiple images. Problems could arise in any stages of uploading but these occurrences are mapped as single tasks.

Hi Suresh,

Right, I can imagine that it is not easy to map this kind of data for process mining. Feel free to reach out via anne@fluxicon.com if you want to discuss your case in more detail!

Best,
Anne


Leave a reply