Process Mining Transformations — Part 7: Domain Specific Transformations


This is the 7th article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

In earlier editions of this series, we have shown some common data transformations that happen more frequently.

However, most data transformation tasks are domain-specific. The data comes in a certain form and to truly answer your analysis question you need to re-shape the data a little bit. Fran Batchelor from the UW Health Nursing Informatics Department allowed me to share one of her data transformation examples from a recent analysis to illustrate this point.

In this example, one department at the hospital has a dedicated block of time in a surgery room that they can use for their surgeries. If no surgeries are planned by them, then the block opens up to other departments at the hospital as well.

If, on the other hand, they want to schedule a surgery and no room is available, then they have to wait for a room to open up. The perception among the staff of the department with the dedicated block of time was that this happened frequently. The general feeling was that they “can’t get their cases on the schedule” and that they “need more block time”.

Fran’s analysis was aimed at confronting this gut feeling with actual data. She wanted to look at how often it actually happened that surgeries had to wait and why.

The data came from three different data sources that she had already combined into an event log. She shared an anonymized sample of 15 cases for the purpose of illustration in this article. You can see the events from one single case in the screenshot below.

Figure 1: Data snippet of one case (Log ID 1)

As we can see in the data snippet above, a surgery date was requested four times (see ‘Surgery Date Requested’ activities). In between, several re-planning activities happened (see ‘Sched Into Room’ activities). Ultimately, the surgery took place on 4 August 2020. We can see that it took longer than planned: The ‘Sched End’ activity indicates that the surgery was scheduled until 13:15. But the ‘Out Room’ activity shows that the surgery actually ended at 15:44.

This data would be suitable to analyze how many surgeries take longer than scheduled. However, for the purpose of analyzing the availability of the blocked rooms in the operational flow of the department, this process view is too high-level. For example, the ‘Surgery Date Requested’ activity does not show whether a room was immediately available nor into which room it was scheduled (see screenshot below - click on the image to see a larger version).

Figure 2: Initial process view with the main activities

The data set contains the availability information in an additional column called ‘Z Rm & Block Status’ (see screenshot below). When the value is ‘Block’ then there was an availability in the dedicated block for the department. This is the ideal situation. When the value is ‘Unblocked OR Room’ then there was no space in the block but another operating room was available. When the value is ‘Z Room’ then this means that no room was available at all. The case has been scheduled into a virtual holding room, which does not exist and is only a waiting position to be assigned a real room later.

However, Fran wants to analyze the availability at the moment that the surgery date is requested. The data is not in a form that is immediately usable to answer this question, because the ‘Z Rm & Block Status’ information is only attached to the ‘Sched Into Room’ events — not the ‘Surgery Date Requested’ events (see red highlighting). What she needs for her analysis is the ‘Z Rm & Block Status’ information of the first ‘Sched Into Room’ activity after the ‘Surgery Date Requested’ activity (highlighted in green below).

Figure 3: The 'Z Rm & Block Status' field shows whether a room was available or not (but not for the 'Surgery Date Requested' activity)

To make this information available, she adds two new columns to the data set. In the first additional column, the ‘Z Rm & Block Status’ information is combined with the location information (see the Excel formula and the resulting values in the yellow fields in the screenshot below).

Figure 4: The first additional column combines the 'Z Rm & Block Status' information with further location information

The second additional column then takes this newly combined value from the previous column and makes it available for the ‘Surgery Date Requested’ events, where it is needed (see the Excel formula and the resulting values in the orange fields below).

Figure 5: The second additional column attaches the newly combined field to each previous 'Surgery Date Requested' event

When you create new data fields via formulas in Excel as shown above, then keep in mind that you need to save the file as a CSV file before you import the extended data set into Disco again. Otherwise the values that have been created by the formulas are not visible.1

During the import step, Fran can now select the two new columns together with the original ‘Activity’ column as the activity name (see screenshot below).

Figure 6: The high-level activity name can be combined with the detailed room status information

As a result, both the ‘Sched Into Room’ activities as well as the ‘Surgery Date Requested’ activities are now “unfolded” depending on their availability status. This provides a much more detailed view on the scheduling flow in the process (see screenshot below).

Figure 7: The scheduling and re-planning activities now show whether a room was available or not

As one of the analyses, Fran can now focus on the ‘Surgery Date Requested’ activities to see how often rooms were available at the moment of the initial surgery request. An Attribute filter can be used to filter only ‘Surgery Date Requested’ activities for an even more focused view.

Figure 8: Focusing on 'Surgery Date Requested' activities

This view shows that new surgery dates were requested frequently even if the first request could be scheduled in the block time of the department right away (see red mark-up in screenshot below). These re-scheduling requests were often initiated by the staff or patients themselves and not due to the non-availability of the surgery rooms.

The resulting picture that emerged from the analysis showed that the reality was more complicated than the staff of the department with the dedicated block time initially thought. Based on Fran’s analysis, they could align their perception with the reality of the process. It became clear that being able to secure more block time would not necessarily solve the problem of the frequent re-scheduling and re-planning of the surgeries.

Figure 9: Surgeries are re-scheduled frequently even if the initial request could be accommodated by the block of the department

There are many different ways in which the data could have been transformed to get to the same result. The scenario above is just an example.

What is important to realize is that the data is not fixed. You as the process mining analyst need to think about how exactly you need your data set to be to answer the questions for your analysis (and to communicate with the people who work in this process).

Furthermore, there is almost never just one view that can be used to answer all your questions. Instead, different views are needed to answer different questions, and sometimes the data needs to be transformed in different ways as well.2

  1. Copying the columns with the Paste as Values option is another alternative. ↩︎

  2. While multiple domain-specific transformations may be needed, general concepts still apply. For example, it is generally a good idea to put additional attributes into separate columns, so that you can leverage them for filtering or for unfolding individual activities in the most flexible way. ↩︎

New Knowledge About Old Systems

Last week’s Process Mining Café was all about legacy systems. Two veterans of the trace, Steve Kilner and Derek Russel, joined us for a discussion on the role process mining can play in better understanding, and improving, legacy systems. You can now watch the recording here.

Our session started with a primer on what legacy systems are in the first place: Old systems that are often poorly understood and critical at the same time. Process mining can help to understand both how these systems are actually used, as well as how the processes that are run on them can be improved.

We also discussed the different approaches to get process mining data from these old systems: Some may have existing logs that can be used. Other setups need to be instrumented or otherwise observed.

Here are the links that we mentioned during the session:

Thanks again, Derek and Steve, for joining us!

Process Mining Café 4: Mining Legacy Systems

Process Mining Café 4

Our Process Mining Café sessions already start to feel like a tradition. Join us again next Wednesday 24 February, at 16:00 CET! (Check your own timezone here).

There is no registration required. Simply point your browser to when it is time. You can watch the café, and share your thoughts and questions while we are on the air, right there on the site.

This time, we are all about process mining in legacy systems.

Legacy systems are old, often mission-critical systems that can cause quite some headaches for their owners. Replacing these old systems is not easy, precisely because so much knowledge has been poured into them. And because the developers who built them are often long gone.

Process mining can help to understand how these systems are used. We have invited Derek Russell, who wrote about legacy system mining on our blog last week, and Steve Kilner, who dove into the topic already many years ago. Derek and Steve know all about legacy systems, and we will be talking about the different approaches to legacy system mining.

Tune in live for the Process Mining Café next week! Add the time to your calendar to make sure you don’t miss it. Or sign up for the café mailing list here if you want to be reminded one hour before the session starts.

Disco 2.11

Software Update

We are happy to announce that we have just released Disco 2.11.

We recommend that you update at your earliest convenience. Like every release of Disco, this update fixes a number of bugs and improves the general performance and stability.

This release marks a big step forwards for the Airlift integration in Disco, with better performance, improved reliability, and a smoother user experience all around. If you pull your log data into Disco via Airlift, this will be a solid upgrade – and if you don’t, this may be a great time to start thinking about it.

A long-lost, dearly beloved, and oft-requested crowd favorite makes a triumphant return: The process map will now remain centered around the mouse pointer when you zoom via the mouse wheel. Just a little goodie that got lost in our transition to multi-touch gestures way back when, and we’re all happy to have it back.

Even if you’re not airlifting, and barely zooming, this update, as always, brings increased performance, the demise of many bugs, and lots of small improvements and fixes all over the place.

Thank you for using Disco! We love hearing about what you do with it, and what you like and don’t like about it, so keep your feedback coming!

How to update

Disco will automatically download and install this update the next time you run it, if you are connected to the internet.

If you prefer to install this update of Disco manually, you can download and run the updated installer packages from


  • Airlift:
    • UI fixes.
    • Improved import performance.
    • Smoother and more resilient client experience.
    • Keep bookmarks of recent connections (optional).
    • More consistent experience for connections with self-signed certificates.
  • Process Map: Keep the mouse pointer centered when zooming via mouse wheel.
  • CSV Import: Improved performance and stability.
  • Workspace: Ensure safe recovery from a corrupted workspace.
  • Octane: Fixed an issue where some case IDs could be truncated.
  • UI: Refined graph transitions.
  • Platform: Java update (Requires manual install).
Analyzing Legacy Systems with Process Mining

IBM 360

This is a guest article by Derek Russell from Objektum Modernization Ltd. You can find an extended version of this article here. If you have a process mining case study that you would like to share as well, please get in touch with us at

Legacy systems are old systems that often support particularly important processes in an organization. At the same time, precisely because they are so old, the inner workings of these systems are typically poorly understood. This makes them hard to adapt or replace altogether.

There have been previous examples, where process mining was used to understand the behavior of a legacy system. However, in these examples there was existing log data that could be analyzed. What do you do if your legacy system does not provide any suitable event log data at all?

This is where the following approach can help: We can create a new logging capability in the legacy system by combining model generation and instrumentation of software code. Here is how it works.

Example: Hotel management system

Let us look at the example of a hotel management system. The system is used by the hotel reception to create new reservations, check in and check out guests, and to keep records of the food and beverages for billing. Figure 1 shows a screenshot of the current desktop application.

Figure 1: Screenshot of the hotel management application

The hotel management wants to extend or replace the system with the goal to let guests make online reservations in the future. When we set out to modernize a system, we need to first fully understand how the existing system is used to make sure that all the important functionalities are covered in our redesign. Unfortunately, there is limited knowledge and documentation available for the hotel management system.

Therefore, we want to use process mining to understand the different scenarios of the current reservation and billing processes. However, the system creates no usage logs at the moment. All we have is the C# source code and the data model in the SQL database.

Step 1: Generate the static model

To create the logging that is required for process mining, we start with the SQL database that stores all the records in a so-called data model. The data model describes the tables, relations, fields, and field types. This description can be extracted from the database in terms of a so-called SQL schema. This schema is translated into objects with attributes and relationships. For example, a customer has a first name, a last name, and a reservation from entry day to departure day (see Figure 2 below).

Figure 2: Generate static model from data model and source code

This model is then extended by parsing the source code (this can be done with virtually any programing language) to provide an overview of all the components in the system including the classes, attributes and methods. This results in the so-called ‘static model’, which gives an overview of all the components in the system.

Step 2: Generate the dynamic model

The static model shows the information that is processed but not the order in which this is done. Software code is composed of classes, representation of objects and properties, and the methods that provide the behavior of the system. However, the static model does not describe the order in which the methods take place.

Figure 3: Generate dynamic model by simulating source code

To gain an understanding of the dependencies between the methods, it is necessary to record and analyze the dynamic execution of the software.

To achieve this, we instrument the source code to enable the logging of program flow during normal usage of the application. This results in a log from which UML sequence diagrams are generated. These sequence diagrams now describe the flow of the methods that are invoked at each object. This ‘dynamic model’ is not a business process but the sequence of methods related to one use case.

Step 3: Extend the dynamic model

For the process mining data we need information about what the case ID and the activity names are. In the dynamic model, we can define the activities by selecting which methods define the start or end of an activity. The model is extended by tagging the methods in the sequence diagram to define when to log what. Note that no code is changed, only properties in the model are set.

Figure 4: Extend the dynamic model with logging by tagging methods

At this point also the case ID and further attributes from the static model can be selected to be included as part of the logging. For example, the reservation number or customer number can be added to represent the case ID. One of the advantages is that you can start small, with minimal impact to the application, and add more information by repeating step 3, 4, and 5.

Step 4: Instrumentation, build, deploy and run

In the next step, we automatically re-generate the application by combining the original source code with the logging directives on the sequence diagrams. The original source code is not touched. It is only combined with the code to introduce the logging behavior described by the tags in the sequence diagrams. This is referred to as instrumentation. It is important that the original source code itself is not changed, because we don’t want to change anything else in the system’s behavior.

Figure 5: Instrument code, build and deploy new version of the system

The instrumented code can be built into a new version of the hotel management application. This instrumented application behaves identically to the original one, with the additional capability of event logging. The logging starts at the moment that the instrumented version is deployed. So, from that moment on it is possible to analyze new reservations and the execution of any other system use case.

Step 5: Analyze logging

The output of step 4 is the ‘runtime logging’ event log that we can now analyze with process mining. We have to wait until enough events have been collected to perform a representative process mining analysis. For each of the run methods for which a tag was added in the sequence diagram an event will be added to the log. A snippet of the resulting log is shown in Figure 6 below.

Figure 6: The event log captured by the instrumented version of the hotel management system

When you import this event log into Disco then the process map shown in Figure 7 below is discovered. In the process mining tool, we can further analyze the system behavior based on the actual usage of the instrumented system.

As soon as we understand the current behavior in detail, we can start working on the new system that supports online reservations for future customers without losing track of all the other scenarios from the current system that still need to be supported.

Figure 7: The discovered process map

This is a small and simple example, but imagine a large legacy system that has many different functionalities. Without process mining we would have to manually look at the source code to understand how the system works. For a large system, going through the entire source code can be a very time-consuming and daunting task.

Furthermore, looking at the source code does not give us any indication about how the system is actually used. So, we might end up transferring pieces of functionality to a replacement system that are no longer necessary, thereby making the new system more complicated than it needs to be.

Process mining is a great way to understand processes of any kind. Leveraging process mining to understand the inner workings of legacy systems is an application area, where this insight is especially valuable.