You are reading Flux Capacitor, the company weblog of Fluxicon.
Here, we write about process intelligence, development, design, and everything that scratches our itch. Hope you like it!

Regular Updates? Use this RSS feed, or subscribe to get emails here.

You should follow us on Twitter here.

Privacy, Security and Ethics in Process Mining — Part 4: Establish a Collaborative Culture 2

This is the 4th and last article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

Perhaps the most important ingredient in creating a responsible process mining environment is to establish a collaborative culture within your organization. Process mining can make the flaws in your processes very transparent, much more transparent than some people may be comfortable with. Therefore, you should include change management professionals, for example, Lean practitioners who know how to encourage people to tell each other “the truth”, in your team (see also our article on Success Criteria for Process Mining).

Furthermore, be careful how you communicate the goals of your process mining project and involve relevant stakeholders in a way that ensures their perspective is heard. The goal is to create an atmosphere, where people are not blamed for their mistakes (which only leads to them hiding what they do and working against you) but where everyone is on board with the goals of the project and where the analysis and process improvement is a joint effort. 


Do: 


Don’t:


There are 2 comments for this article.
Privacy, Security and Ethics in Process Mining — Part 3: Anonymization

This is the 3rd article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

If you have sensitive information in your data set, instead of removing it you can also consider the use of anonymization techniques. When you anonymize a set of values, then the actual values (for example, the employee names “Mary Jones”, “Fred Smith”, etc.) will be replaced by another value (for example, “Resource 1”, “Resource 2”, etc.).

If the same original value appears multiple times in the data set, then it will be replaced with the same replacement value (“Mary Jones” will always be replaced by “Resource 1”). This way, anonymization allows you to obfuscate the original data but it preserves the patterns in the data set for your analysis. For example, you will still be able to analyze the workload distribution across all employees without seeing the actual names.

Some process mining tools (Disco and ProM) include anonymization functionality. This means that you can import your data into the process mining tool and select which data fields should be anonymized. For example, you can choose to anonymize just the Case IDs, the resource name, attribute values, or the timestamps. Then you export the anonymized data set and you can distribute it among your team for further analysis. 


Do:

Don’t: 


Anonymization of Common Process Mining Fields

Here is an overview of the typical process mining attributes and why you might want (or might not want) to anonymize them: 


Resource name

Removing the names of the employees working in the process is one of the more common anonymization steps. It can help to decrease friction and put employees more at ease when you involve them in a joint analysis workshop. Anonymizing employee names certainly is a must if you make your data publicly available in some form.

Be aware that it may still be possible to trace back individual employees. For example, if you look up a concrete case based on the case ID in the operational system, you will see the actual resource names there.

Finally, keep in mind that anonymizing employee names for an internal process mining analysis also removes valuable information. For example, if you identify process deviations or an interesting process pattern, normally the first step is to speak with the employees who were involved in this case to understand what happened and learn from them. 


Case ID

Anonymizing the case ID is a must if it contains sensitive information. For example, if you analyze the income tax return process at the tax office, then the case ID will be a combination of the social security number of the citizen and the year of the tax declaration. You will have to replace the social security information for obvious reasons.

However, for data sets where the case ID is less sensitive it is a good idea to keep it in place as it is. The benefit will be that you can look up individual cases in the operational system to verify your analysis or obtain additional information. Losing this link will limit your ability to perform root cause analyses and take action on the process problems that you discover. 


Activity name

Normally, you would not anonymize the activity name itself. The activities are the process steps that appear in the process map and in the variant sequences in the Process Mining tool. The reason why you do not want to replace the activity names by, for example, “Activity 1”, “Activity 2”, “Activity 3”, etc., is that most processes become very complex very quickly and without the activity names you have no chance to build a mental model and understand the process flows you are analyzing. Your analysis becomes useless.

Keeping the activity names in full is usually not a problem, because they describe a generic process step (like “Email sent”). However, especially if you have many different activity names in your data, you should review them to ensure they contain no confidential information (e.g., “Email sent by lawyer X”).

Other Attributes

Sensitive information is often contained in additional attribute columns. For example, even if you are analyzing an internal ordering process, there might be additional data fields revealing information about the customer.

You can either completely remove data columns that you don’t need, or you can anonymize their values. Keep the attribute columns that are not sensitive in their original form, because they can contain important context information when you inspect individual cases during your Process Mining analysis.

Finally, be aware that sensitive information can also be hidden in a ‘Notes’ attribute or some other kind of free-text field, where the employees write down additional information about the case or the process step. Simply anonymizing such a free-text field would be useless, because the whole text would be replaced by “Value 1”, “Value 2”, etc. To preserve the usefulness of the free-text field while removing sensitive information requires more work in the data pre-processing step and is not something that process mining tools can do for you automatically. 


Timestamps

Sometimes, the time at which a particular activity happened already reveals too much information and would make it possible to identify one of your business entities in an unwanted way. In such situations, you can anonymize the timestamps by applying an offset. This means that a certain number of days, hours, and minutes will be added to the actual timestamps to create new (now anonymized) timestamps.

Keep in mind that some of the process patterns may change when you analyze data sets with anonymized timestamps. For example, you might see activities appear on other times of the day than you would see in the original data set. For this reason, timestamp anonymization is mostly used if data sets are prepared for public release and not if you analyze a process within your company.

There are no comments for this article yet. Add yours!
Privacy, Security and Ethics in Process Mining — Part 2: Responsible Handling of Data

This is the 2nd article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.

Like in any other data analysis technique, you must be careful with the data once you have obtained it. In many projects, nobody thinks about the data handling until it is brought up by the security department. Be that person who thinks about the appropriate level of protection and has a clear plan already prior to the collection of the data.

Do:

Don’t:

There are no comments for this article yet. Add yours!
Privacy, Security and Ethics in Process Mining — Part 1: Clarify Your Goal

[This article previously appeared in the Process Mining News – Sign up now to receive regular articles about the practical application of process mining.]

When I moved to the Netherlands 12 years ago and started grocery shopping at one of the local supermarket chains, Albert Heijn, I initially resisted getting their Bonus card (a loyalty card for discounts), because I did not want the company to track my purchases. I felt that using this information would help them to manipulate me by arranging or advertising products in a way that would make me buy more than I wanted to. It simply felt wrong.

The truth is that no data analysis technique is intrinsically good or bad. It is always in the hands of the people using the technology to make it productive and constructive. For example, while supermarkets could use the information tracked through the loyalty cards of their customers to make sure that we have to take the longest route through the store to get our typical items (passing by as many other products as possible), they can also use this information to make the shopping experience more pleasant, and to offer more products that we like.

Most companies have started to use data analysis techniques to analyze their data in one way or the other. These data analyses can bring enormous opportunities for the companies and for their customers, but with the increased use of data science the question of ethics and responsible use also grows more dominant. Initiatives like the Responsible Data Science seminar series1 take on this topic by raising awareness and encouraging researchers to develop algorithms that have concepts like fairness, accuracy, confidentiality, and transparency built in2.

Process Mining can provide you with amazing insights about your processes, and fuel your improvement initiatives with inspiration and enthusiasm, if you approach it in the right way. But how can you ensure that you use process mining responsibly? What should you pay attention to when you introduce process mining in your own organization?

In this article series, we provide you four guidelines that you can follow to prepare your process mining analysis in a responsible way.

1. Clarify Goal of the Analysis (this article)
2. Responsible Handling of Data
3. Consider Anonymizatione
4. Establish a Collaborative Culture

1. Clarify Goal of the Analysis

The good news is that in most situations Process Mining does not need to evaluate personal information, because it usually focuses on the internal organizational processes rather than, for example, on customer profiles. Furthermore, you are investigating the overall process patterns. For example, a process miner is typically looking for ways to organize the process in a smarter way to avoid unnecessary idle times rather than trying to make people work faster.

However, as soon as you would like to better understand the performance of a particular process, you often need to know more about other case attributes that could explain variations in process behaviours or performance. And people might become worried about where this will leave them.

Therefore, already at the very beginning of the process mining project, you should think about the goal of the analysis. Be clear about how the results will be used. Think about what problem you are trying to solve and what data you need to solve this problem.

Do:

Don’t:


  1. Responsible Data Science (RDS) initiative: http://www.responsibledatascience.org  
  2. Watch Wil van der Aalst’s presentation on Responsible Data Science at Process Mining Camp 2016: https://www.youtube.com/watch?v=ewQbmINuXeU  
There are no comments for this article yet. Add yours!
Meet The Process Miners of the Year 2017!

At the end of Process Mining Camp this year, we had the pleasure to hand out the annual Process Miner of the Year award for the second time. Carmen Lasa Gómez (left on the photo at the top) from Telefónica received the award on behalf of her co-author Javier García Algarra (middle on the photo at the top) and the whole team.

Congratulations to the team at Telefónica!

The winning contribution from the Telefónica team was a case study about how they discovered operational drifts in their IT service management processes with process mining. Operational drifts are slow changes in the informal culture of groups that are not dramatic enough to produce a sharp impact on quality of service. They are not easy to detect, even for experienced analysts, because they do not change the overall process map.

Learn more about how Carmen and Javier managed to discover these operational drifts in the case study here.

To signify the achievement of winning the Process Miner of the Year award, we commissioned a unique, one-of-a-kind trophy. The Process Miner of the Year 2017 trophy is sculpted from two joined, solid blocks of plum and robinia wood, signifying the raw log data used for Process Mining. A vertical copper inlay points to the value that Process Mining can extract from that log data, like a lode of ore embedded in the rocks of a mine.

It’s a unique piece of art that could not remind us in any better way of the wonderful possibilities that process mining opens up for all of us every day.

Become the Process Miner of the Year 2018!

There are now so many more applications of process mining than there were just a few years ago. With the Process Miner of the Year competition, we want to stimulate companies to showcase their greatest projects and get recognized for their success.

Will you be the Process Miner of the Year 2018? Lear more about how to submit your case study here!

There are no comments for this article yet. Add yours!
Data Quality Problems In Process Mining And What To Do About Them — Missing Complete Timestamps for Ongoing Activities

This is the 13th article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.

If you have ‘start’ and ‘complete’ timestamps in your data set, then you can sometimes encounter situations, where the ‘complete’ timestamp is missing for those activities that are currently still running.

For example, take a look at the data snippet below (click on the image to see a larger version). Two process steps were performed for case ID 1938. The second activity that was recorded for this case is ‘Analyze Purchase Requisition’. It has a ‘start’ timestamp but the ‘complete’ timestamp is empty, because the activity has not yet completed (it is ongoing).

Missing Complete Timestamp (click to enlarge)

In principle, this is not a problem. After importing the data set, you can simply analyze the process map and the variants, etc., as you would usually do. When you look at a concrete case, then the activity duration for the activities that have not completed yet is shown as “instant” (see the history for case ID 1938 in the screenshot below).

Activity duration is instant (click to enlarge)

However, where this does become a problem is when you analyze the activity duration statistics (see screenshot below). The “instant” activity durations influence the mean and the median duration of the activity. So, you want to remove those activities that are still ongoing from the calculation of the activity duration statistics.

The activity duration statistics are affected by this (click to enlarge)

How to fix:

  1. Import your data set again and only configure the complete timestamp as a ‘Timestamp’ column (keep the start timestamp column as an attribute via the ‘Other’ configuration). This will remove all events, where the complete timestamp is missing.
  2. Export your data set as a CSV file and import it again into Disco, now with both the start and the complete timestamp columns configured as ‘Timestamp’ column.

Your activity duration statistics will now only be based on those activities that actually have both a start and a complete timestamp.

There are no comments for this article yet. Add yours!
Dealing With Parallelism in Your Process Maps

Last week, we have seen how you can differentiate between active time and passive time if you have a start and end timestamp in your data set.

If you do have a start and end timestamp in your data, it can also happen that some of the activities are running at the same time. Disco detects parallelism if two activities overlap in time (see illustration below).

In the example above you can see that activity C starts two hours before activity B has ended. Therefore, both activities are shown in parallel in the process map (see left at the top). You can see that for processes that have parallel activities the frequencies do not add up to 100% anymore. For example, after activity A both the path to activity B and C are followed and their frequencies (1 + 1) do not add up to frequency of the previous activity as they would if there was a choice between them.1

Furthermore, the waiting times in the process are now calculated with respect to the previous activities — not the ones that are running in parallel (see top right).

If you have a parallel process, then this is typically what you want. For example, the screenshot below shows a project management process (click on the image to see a larger version of it).

You can see that there are several milestones in the process, such as ‘Install in test environment’. To reach a milestone in this process, several activities need to be completed beforehand but they can be completed in parallel. In the example below we can see that not all the parallel activities are always performed. For example, a ‘Project risk review’ has only be done for 11 out of the 120 cases.

When you switch to the performance view for this process, you can analyze the times of the different parallel paths to perform a Critical Path Analysis. A critical path analysis is only applicable for parallel processes and allows to see which of the parallel branches, if delayed, would delay the next milestone even more.

Challenges with Parallel Processes

Im most situations, if you have parallelism in your process, this is exactly what you want to see. However, there can be some problems related to parallelism as well. For example:

Fortunately, if you find yourself in one of these situations, there is a simple way to get around the parallelism problem: You can import your data set again and configure only one of your timestamps as a ‘Timestamp’ column in Disco (you can keep the other one as an attribute). If you have only one timestamp configured, Disco always shows you a sequential view of your process. Even if two activities have the same timestamp they are shown in sequence with ‘instant’ time between them.

Looking at a sequential view of your process is a great way to investigate the process map and the process variants without being distracted by parallel process parts. You can then always go back and import the data with two timestamps again if you want to analyze the activity durations and the parallel flows.


  1. If you run the animation for this process, you will also see that one token splits into two tokens for the parallel part of the process and then they merge again.  
There are no comments for this article yet. Add yours!
Understanding the Meaning of Your Timestamps

In earlier articles of this series we already discussed how you can change your perspective of the process by how you configure your case ID and activity columns during the import step, and by combining multiple case ID fields and by bringing additional attribute dimensions into your process view.

All of these articles were about changing how you interpret your case and your activity fields. But you can also create different perspectives with respect to the third data requirement for process mining — Your timestamps.

There are two things that you need to keep in mind when you look at the timestamps in your data set:

1. The Meaning of Your Timestamps

Even if you have just one timestamp column in your data set, you need to be really clear about what exactly the meaning of these timestamps is. Does the timestamp indicate that the activity was started, scheduled or completed?

For example, if you look at the following HR process snippet then it looks like the ‘Process automated’ step is a bottleneck: 4.8 days median delay are shown at the big red arrow (see screenshot below).1

However, in fact the timestamps in this data set have the meaning that an activity has become available in the HR workflow tool. This means that at the moment that one completes an activity automatically the next activity is scheduled (and the timestamp is recorded for the newly scheduled activity).

This shifts the interpretation of the bottleneck back to the activity ‘Control request’, which is a step that is performed by the HR department: At the moment that the ‘Control request’ activity was completed, the ‘Process automated’ step was scheduled. So, the big red path shows us the time between when the step ‘Control request’ became available until it was completed.

You can see how knowing that the timestamp in the data set has the meaning of ‘scheduled’ rather than ‘completed’ shifts the interpretation of which activity is causing the delay from the target activity (the activity where the paths is going to) to the source activity (the activity from which the path is starting out).

2. Multiple Timestamp Columns

If you have a start and a complete timestamp column in your data set, then you can include both timestamps during your data import and distinguish active and passive time in your process analysis (see below).

However, sometimes you have even more than two timestamp columns. For example, let’s say that you have a ‘schedule’, a ‘start’ and a ‘complete’ timestamp for each activity. In this case you can choose different combinations of these timestamps to take different perspectives on the performance of your process.

For the example above you have three options.

Option a: Start and Complete timestamps

If you choose the ‘start’ and ‘complete’ timestamps as Timestamp columns during the import step, you will see the time between ‘start’ and ‘complete’ as the activity duration and the times between ‘complete’ and ‘start’ as the waiting times in the performance view (see above).

Option b: Schedule and Complete timestamps

If you choose the ‘schedule’ and ‘complete’ timestamps as Timestamp columns during the import step, you will see the time between ‘schedule’ and ‘complete’ as the activity duration and the times between ‘complete’ and ‘schedule’ as the waiting times in the performance view (see above). So, it shows the time between when an activity became available until it was completed rather than focusing on the time that somebody was actively working on a particular process step.

Option c: Schedule and Start timestamps

If you choose the ‘schedule’ and ‘start’ timestamps as Timestamp columns during the import step, you will see the time between ‘schedule’ and ‘start’ as the activity duration and the times between ‘start’ and ‘schedule’ as the waiting times in the performance view (see above). Here, the activity durations show the time between when an activity became available until it was started.

All of these views can be useful and you can import your data set in different ways to take these different views and answer your analysis questions.

Conclusion

Timestamps are really important in process mining, because they determine the order of the event sequences on which the process maps and variants are based. And they can bring all kinds of problems (see also our series on data quality problems for process mining here).

But the meaning of your timestamps also influences how you should interpret the durations and waiting times in your process map. So, in summary:


  1. Learn more about how to perform a bottleneck analysis with process mining here.  
There are no comments for this article yet. Add yours!
Combining Attributes into Your Process View

Previously, we discussed how you can take different perspectives on your data by choosing what you want to see as your activity name, case ID, and timestamps.

One of the ways in which you can take different perspectives is to bring an additional dimension into your process map by combining more than one column into the activity name. You can do this in Disco by simply configuring more than one column as ‘Activity’ (learn how to do this in the Disco user guide here).

By bringing in an additional dimension, you can “unfold” your process map in a way that does not only show which activities took place in the process, but also in which department, for which problem category, or in which location the activity took place. For example, by bringing in the agent position from your callcenter data set you can see which activities took place in the first level support team and differentiate them from the steps that were performed by the backoffice workers, even if the activity labels for their tasks are the same.

You can experiment with bringing in all kinds of attributes into your process view. When you do this, you can observe two different effects.

1. Comparing Processes

When you bring in a case-level attribute that does not change over the course of the case, you will effectively see the processes for all values of your case-level attribute next to each other — in the same process map. For example, the screenshot below shows a customer refund process for both the Internet and the Callcenter channel next to each other.

Seeing two or more processes next to each other in one picture side by side can be an alternative to filtering the process in this dimension. Of course, you can still apply filters to only compare a few of the processes at once.

2. Unfolding Single Activities

When you have an attribute that is only filled for certain events, then bringing in this attribute into your activity name will only unfold the activities for which it is filled.

For example, a document authoring process may consist of the steps ‘Create’, ‘Update’, ‘Submit’, ‘Approve’, ‘Request rework’, ‘Revise’, ‘Publish’, and ‘Discard’ (performed by different people such as authors and editors). Imagine that in this document authoring process, you have additional information in an extra column about the level of required rework (major vs. minor) in the ‘Request rework’ step.

If you just use the regular process step column as your activity, then ‘Request rework’ will show up as one activity node in your process map (see image below).

However, if you include the ‘Rework type’ attribute in the activity name, then two different process steps ‘Request rework – major’ and ‘Request rework – minor’ will appear in the process map (see below).

This can be handy in many other processes. For example, think of a credit application process that has a ‘Reject reason’ attribute that provides more information about why the application was rejected. Unfolding the ‘Reject’ activity in the ‘Reject reason’ dimension will enable you to visualize the different types of rejections right in the process map in a powerful way.

Conclusion

So, already while you are in the stage of preparing your data set it is worth thinking about how you can best structure your attribute data.

As a rule of thumb:

There are no comments for this article yet. Add yours!
Combining Multiple Columns as Case ID

In a previous article, we discussed how you can take different perspectives on your data by choosing what you want to see as your activity name, case ID, and timestamps.

One of the examples was about changing the perspective of what we see as a case. The case determines the scope of the process: Where does the process start and where does it end?

You can think of a case as the streaming object that is moving through the process. For example, the travel ticket in the picture above might go through the steps ‘Purchased’, ‘Printed’, ‘Scanned’ and ‘Validated’. If you want to look at the process flow of travel tickets, you would choose the travel ticket number as your case ID.

In the previous article we saw how you can change the focus from one case ID to another. For example, in a call center process you can look at the process from the perspective of a service request or from the perspective of a customer. Both are valid views and offer different perspectives on the same process.

Another option you should keep in mind is that, sometimes, you might also want to combine multiple columns into the case ID for your process mining analysis.

For example, if you look at the callcenter data snippet below then you can see that the same customer contacts the helpdesk about different products. So, while we want to analyze the process from a customer perspective, perhaps it would be good to distinguish those cases for the same customer?

Let’s look at the effect of this choice based on the example. First, we only use the ‘Customer ID’ as our case ID during the import step. As a result, we can see that all activities that relate to the same customer will be combined in the same case (‘Customer 3’).

If we now want to distinguish cases, where the same customer got support on different products, we can simply configure both the ‘Customer ID’ and the ‘Product’ column as case ID columns in Disco (you can see the case ID symbol in the header of both columns in the screenshot below):

The effect of this choice is that both fields’ values are concatenated (combined) in the case ID value. So, instead of one case ‘Customer 3’ we now get two cases: ‘Customer 3 – MacBook Pro’ and ‘Customer 3 – iPhone’ (see below).

There are many other situations, where combining two or more fields into the case ID can be necessary. For example, imagine that you are analyzing the processing of the tax returns at the tax office. Each citizen is identified by a unique social security number. This could be the case ID for your process, but if you have data from multiple years then you also need the year to separate the returns from the same citizen across the years.

To create a unique case identifier, you can simply configure all the columns that should be included in the case ID as a ‘Case’ column like shown above, and Disco will automatically concatenate them for the case ID.

As before, there is not one right and one wrong answer about how you should configure your data import but it depends on how you want to look at your process and which questions you want to answer. Often, you will end up creating multiple views and all of them are needed to get the full picture.

There are no comments for this article yet. Add yours!
Older posts »