Become the Process Miner of the Year 2019!

Three years ago, we introduced the Process Miner of the Year awards to help you showcase your best work and share it with the process mining community. After Veco won the award in 2016, and after Telefonica took the trophy home in 2017, the university hospital Universitario Lucus Augusti HULA became the Process Miner of the Year 2018.

This year, we will continue the tradition and the best submission will receive the Process Miner of the Year award at this years Process Mining Camp, on 20 June in Eindhoven.

Have you completed a successful process mining project in the past months that you are really proud of? A project that went so well, or produced such amazing results, that you cannot stop telling anyone around you about it? You know, the one that propelled process mining to a whole new level in your organization? We are pretty sure that a lot of you are thinking of your favorite project right now, and that you can't wait to share it.

What we are looking for

We want to highlight process mining initiatives that are inspiring, captivating, and interesting. Projects that demonstrate the power of process mining, and the transformative impact it can have on the way organizations go about their work and get things done.

There are a lot of ways in which a process mining project can tell an inspiring story. To name just a few:

  • Process mining has transformed your organization, and the way you work, in an essential way.
  • There has been a huge impact with a big ROI, for example through cost savings or efficiency gains.
  • You found an unexpected way to apply process mining, for example in a domain that nobody approached before you.
  • You were faced with enormous challenges in your project, but you found creative ways to overcome them.
  • You developed a new methodology to make process mining work in your organization, or you successfully integrated process mining into your existing way of working.

Of course, maybe your favorite project is inspiring and amazing in ways that can't be captured by the above examples. Thats perfectly fine! If you are convinced that you have done some great work, don't hesitate: Write it up, and submit it, and take your chance to be the Process Miner of the Year 2019!

How to enter the contest

You can either send us an existing write-up of your project, or you can write about your project from scratch. It is probably better to start from scratch, since we are not looking for a white paper, but rather an inspiring story, in your own words.

In any case, you should download this Word document, which contains some more information on how to get started. You can use it either as a guide, or as a template for writing down your story.

When you are finished, send your submission to info@fluxicon.com no later than 30 April 2019.

We can't wait to read about your process mining projects!

Process Mining Transformations — Part 4: Transpose Data

This is the 4th article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

When you check whether your data set is suitable for process mining, you look for changing activity names and for changing timestamps to make sure that you have activity and timestamp history information. However, when looking for the case ID, you will be searching for multiple rows with the same case ID, because the case ID serves as the linking pin for all the events that were performed for the same process instance.

If you have different case IDs in each row, then this could mean that what you thought was your case ID is just an event ID, or that you don't actually have multiple events per case in your data set. But more often than not your data set is simply structured in columns rather than in rows: This means that the activity information is spread out over different columns for each case (in just one row per case).

The good news is that you can use such a data set for process mining. All you have to do is to transform it a little bit!

The screenshot below (click on the image to see a larger version of it) shows a data set from a hospital. Patients who are undergoing surgery in the Emergency Room (ER) are first admitted before the surgery (column C), ordered from the department before surgery (column D), enter the ER (column E), leave the ER (column F), and are submitted again to a department after the surgery (column G).

The data in this format is not suitable to be used for process mining yet, because the activity name is contained in the heading of the columns C, D, E, F and G, and the timestamps are in the cells of these columns. Nevertheless, the ingredients are there and all we need to do is to transpose the activity columns into rows.

For this example, the case Surgery_1 for Patient_1 needs to be structured into the following format (see below).

In this article we show you step by step how you can transpose your column-structured activity data into rows. We will first demonstrate how you can do this manually in Excel but then also show how you can scale this transformation outside of Excel for large data sets.

Furthermore, there are choices that need to be made with respect to the timestamps and about how additional data attributes should be represented in the new data set. We will discuss these choices and their consequences for your analysis.

Option 1: Columns to rows with one timestamp per activity

In most situations, you will want to create an event log with one activity per timestamp column (similar to the example above).

To do this in Excel, you can first create a new tab (or a new file) and add a column header for the caseID, timestamp and activity fields. In the hospital process above, both the SurgeryNr and the PatientID field can be used as a caseID, so we have included them both.

Then, we copy and paste the cells of both the SurgeryNr and PatientID fields from our source data into the corresponding case ID columns of the new data set (see below).

Now it is time to add the first activity. So, we first copy the timestamps for the first activity from the dtAdmission_before_surgery_timestamp column into the Timestamp column. We could then use the ‘dtAdmission_before_surgery_timestamp’ column header as the activity name as before but, while we are at it, we have the chance to give a nicer, more readable, name for this activity. Let's call it ‘Admission’, because this is the admission step of the surgery process. We simply copy and paste this activity name into the Activity column for each cell (see below).

We repeat this for each of the timestamp columns in our source file. So, for the second activity we again add all the SurgeryNr and PatientID values below the previous rows, thereby doubling the number of rows (see below).

Now, we copy the timestamps from dtPatient_ordered_before_surgery_timestamp column to the Timestamp column and fill in ‘Ordered’ as the simplified activity name for these timestamps in the Activity column (see below).

These steps are repeated for each of the activity columns in the original file. Make sure to add the activities in the expected process sequence to avoid the data quality problem of same timestamp activities (especially if you have just dates and no time in your timestamps).

After adding all five activities, the resulting event log has indeed grown five times in the number of rows compared to the initial, row-based data set. For more activities, it will grow even more. This is the reason that even for moderately sized data sets the Excel limit of 1 million rows can be exceeded quickly and more scalable methods are needed (see more on that at the end of this article).

The fully transposed surgery process data set still fits into Excel and can now be exported as a CSV file using the ‘File -> Save As’ menu in Excel. After importing the CSV file into Disco (using both the SurgeryNr and the PatientID as the combined case ID), we can see the process map shown below.

In case you are wondering: The process map has indeed some weird start and end points and some strange connections (see, for example, the path from ‘Admission’ to ‘Leave ER’). Most likely, these are data quality problems due to the manually collected timestamps. Before we analyze the process, we will need to investigate the start and end points as well as validate and clean the data. However, the focus of this article is on the data transformation itself, and the choices in the structuring of the data, before we even get to these two steps.

Option 2: Columns to rows with start and completion timestamp

When we look at the process map from a performance perspective, we can see that the point where the patients enter and leave the ER are represented as independent activities. The duration that the patient is in the ER is shown on the path between the ‘Enter ER’ and ‘Leave ER’ activities (see below).

We might prefer to show the process part where the patient is in the ER as one activity (using the entering as the start timestamp and the leaving as the end timestamp for the activity). In this way, the duration of the patient being in the ER will be shown _within the activity _in the process map.

To achieve this, you can follow the same approach as before but copy and paste the ‘Enter ER’ and ‘Leave ER’ timestamps into a start and complete timestamp column for the same ‘ER’ activity (see below).1

The resulting event log is ready to be imported and results into a process map with a single ‘ER’ activity as show below.

Adding case attributes and event attributes

When transposing your data, you typically want to include all additional attributes (columns that were not yet converted into a caseID, activity or timestamp column) to be able to answer certain questions using the filters in Disco or to take different perspectives on your data. When you include an attribute, you need to decide whether you include it as a case attribute or as an event attribute.

A case attribute is constant (not changing) for the whole case. In the surgery process, the diagnosis treatment code is established even before the admission of the surgery and will not change in the course of the process. For example, for Surgery_1 the ‘Treatmentcode’ attribute value is ‘Code_20’ (see below). In our process mining analysis, we can then later filter for patients with a particular treatment code.

In contrast, an event attribute can change in the course of the process and is related to a particular event. For example, the department from which the patient was admitted and ordered can be different from the department to which they were submitted after the surgery. Furthermore, the ER room that was used for the actual surgery is linked only to the ER activity (see an example for Surgery_16 below).

When structuring your attributes, we recommend that, if in doubt, you can best place them into separate columns. This way, you retain the maximum flexibility for your analysis. For example, while the ‘Admission Department’ attribute value and the ‘Submission Department’ attribute value 2 can be both placed in the same ‘Department’ event attribute column, the ‘Room’ event attribute should be kept as a separate column.

We can then analyze different perspectives of the patient logistics. For example, in the following screenshot we have configured the ‘Treatmentcode’ column as an attribute and included both the ‘Department’ and the ‘Room’ attributes as part of the activity name during the import step (see below).

This way, after filtering for the top 15 treatment codes, we can see the flow of Surgery_16 above (from AC department via room 9 to AL department) back in the process map. But we could have also chosen to just unfold the room, or to just unfold the department, or none of them, to take a different view on the process.

Beware of missing repetitions!

So, when you receive your data in a column-shaped format, you should take the data and transform it as described above. But, as we have discussed in this previous article about missing repetitions for activities, seeing the activities in columns rather than in rows should immediately bring up a warning flag in your mind: Most likely you will not be able to see loops in this process.

The reason is that there is no place to put a second timestamp for the same activity, so typically the first timestamp is overwritten and only the last one will be kept. For example, in case 1 in the following data set the first occurrence of activity C is lost, because only the timestamp of the second occurrence of C is stored in the ‘Activity C’ column (see below).

As a result, it looks as if activity B was followed directly by activity D at least once, while in reality this never happened (see below).

There is typically nothing you can do about this data quality problem at that point (you would need to go deeper to recover the activity repetition timestamps from the original data source).

What is important now is that you are aware of the issue and keep it in mind during the analysis to interpret the discovered process maps correctly. By knowing that distortions like the B -> D flow above can be due to the missing loops in your data, you know that you are not seeing the complete picture of the process.

Transpose large data sets in an ETL tool

Finally, transposing your data in Excel can be a good option if you have to do it just once and the data set is not that big. However, as with any manual data transformation, you run the risk of accidentally making a mistake such as copying and pasting the wrong column. Furthermore, especially if you want to repeat this analysis more often, or if your data set gets too big for Excel in the process, an ETL tool can save you a lot of time.

For example, by building an ETL workflow in the open source tool KNIME you can transpose your data with just a few mouse clicks. To transform the data as we have shown manually in option 1 above, we just need three steps in a simple reader -> unpivot -> writer workflow as shown below.

In the first step (here ‘File Reader’) the data is loaded. The second step (‘Unpivoting’) automatically transposes the timestamps from columns to rows. The last block (‘CSV Writer’) saves the result into a new CSV file. You can download this KNIME workflow file here.

The nice thing about building an ETL workflow like the one shown above is that you can use it on really large data sets. And you can re-run it on fresh data as often as you want.


  1. Note that in this case you actually first need to clean the data set of any instances where the ‘Enter ER’ timestamp is later than ‘Leave ER’ timestamp, because similar to the case of missing complete timestamps activities with this data quality problem cannot be detected after importing the data anymore.

  2. Yes, any event attribute values that should end up in the same attribute column will need to come from separate columns in the column-shaped source data. Otherwise, you will have lost the history of those changing attribute values and most likely only see the last one (e.g., the department, where the patient ended up after the surgery).

Process Mining Camp on 20 & 21 June — Save the Date!

Open up your agenda and mark the date: Process Mining Camp takes place again on 20 & 21 June in Eindhoven1 this year!

For the eighth time, process mining enthusiasts from all around the world will come together in the birth place of process mining. We are already super excited to meet you all, and we are very proud of the fact that Process Mining Camp is just as international as the process mining community itself. Over the past years, people from 34 different countries have come to camp to listen to their peers, share their ideas and experiences, and make new friends in the community.

Process Mining Camp is not your run-of-the-mill, corporate conference but a community meet-up with a unique flair. Our campers are really nice people who do not just brag about their successes but also share their pitfalls and failures, from which you can learn even more than from stories that go well. In addition, you will get lots of ideas about new approaches and use cases that you have not considered before.

Like last year, this years Process Mining Camp will run for two days:

  • The first day (20 June) will be a day full of inspiring practice talks from different companies, as you have seen from previous camps.
  • On the second day (21 June), we will have a hands-on workshop day. Here, smaller groups of participants will get the chance to dive into various process mining topics in depth, guided by an experienced expert.

Mark these dates in your calendar and sign up for the camp mailing list here to be notified when tickets go on sale! Even if you can't make it this year, you should sign up to receive the presentations and video recordings as soon as they become available.

We can't wait to see you in Eindhoven on 20 June!

Anne, Rudi and Christian


  1. Eindhoven is located in the south of the Netherlands. Next to its local airport, it can also be reached easily from Amsterdams Schiphol airport (direct connection from Schiphol every 15 minutes, the journey takes about 1h 20 min).

Case Study: Government Process Mining in the Brazilian Executive Branch

This is a guest article by Henrique Pais da Costa from the Brazilian government. If you have a guest article or process mining case study that you would like to share as well, please contact us via anne@fluxicon.com.

The Federative Republic of Brazil is the fifth largest country in the world in land area [1], sixth in population, with more than 200 million inhabitants [2], and one of the ten major world economies [3]. Due to its legal nature, Brazil has several formal processes for the preparation of standards, away from the idea of the common law.

Since the date of the promulgation of the Brazilian current Constitution, in 1988, until september 2016, have been edited more than 163.000 federal rules [4], including 99 constitutional amendments [5]. This number becomes very significant when compared to other countries. The American Constitution, for example, has only 27 amendments [6] in over 230 years of existence. All this legal framework governs the lives of millions of citizens, which makes relevant the task of diagnosing imperfections in the federal regulatory process, since small improvements can generate profound positive impact in the lives of the Brazilian people. According to Davi Lago [7], the degree of delay in Brazilian public bureaucracy is simply absurd. In spite of its economic wealth, Brazil has pitiful administrative efficiency indices that deviate from the advanced nations”.

The purpose of the study, object of this article, was to identify gaps in regulatory processes proposed by the Federal Executive Branch, such as overlapping regulations in several layers, bottlenecks and rework. This challenge provided a unique opportunity for application of process mining, a methodology never used in the diagnosis of imperfections in the course of one of the main activities of the Federal Government: to legislate.

Organization

The Brazilian State is structured in three Branches with distinct and complementary attributions. The Legislative Branch has the competence to propose and produce laws. The Judiciary has the task of solving doubts in possible divergences. The Executive Branch has the function of administering the State, applying what the normative apparatus orders (see Figure 1).

Figure 1: Separation of powers [1]

Figure 1: Separation of powers [1]

However, a Branch often practises in secondary ways the essential attributions of the rest. The Federal Constitution mentions the laws that must start by initiative of the President of the Republic, as well as on their competence to issue decrees and provisional measures, giving relevance to the legislative process in the Federal Executive Branch. It is in this context that the modernization team of the Civil House of the Presidency of the Republic has worked to improve the Government performance in the normative process.

Process

The normative process in the Executive Branch comprises the activities associated with the production of administrative acts (proposals for constitutional amendments, laws, provisional measures, decrees, among others) from its initial conception until submission to the Legislative Branch, represented by the National Congress, or until its publication.

The present study focuses on the set of activities made by the different public organizations, the interaction between the Ministries and their relationship with the Presidency. The end of the acts in the Executive Branch is given in two ways: decrees and provisional measures must be published, and proposals for constitutional amendments and bills must be sent to the National Congress, whose procedural process was not the subject of this analysis (Figure 2).

Figure 2: The normative process

Figure 2: The normative process

Conception of the act: The proposal of normative acts is the responsibility of the Ministers of State, according to their respective areas of competence. As a rule, these acts are designed by the technical areas, which make a diagnosis, evaluate alternatives, costs and possible practical results for society. The project to modernize the normative process in the Executive Branch involves improvements in the intra- ministerial process, but the process mining software Disco was used with a focus on the relationship between the Ministries, in their relationship with the Presidency of the Republic and in the internal process in the Presidency.

Discussion with stakeholders: Citizens, companies, parliamentarians, foreigners and other government agencies are examples of the various stakeholders in the standards produced by the Federal Government. Process mining is part of a robust modernization project, which aims, among other actions, to allow the proposer to identify which Ministries are competent to deal with a particular standard, and implement text mining technologies to identify similar regulatory initiatives in other government agencies, avoiding bypass and minimizing rework.

Consolidation of the act: Through interviews with actors from various Ministries, it was possible to verify that, once at this stage, there is already a consensus regarding the content of the proposal. The consolidation of the act can be divided between the stage prior to its arrival in the Civil House of the Presidency of the Republic, when the matter is inserted in the System of Generation and Processing of Official Documents (‘Sidof’), and the later stage, already in Civil House, when it starts to process through the Electronic Information System (‘Sei!') until its finalization and preparation for the presidential signature.

Signature of the act: After the technical and legal analysis (internal procedure in the Civil House) the act is finally ready for presidential signature and referendum by the Ministers of State, in their respective areas of competence. Having diagnosed all this procedural context, it was possible to identify multiple opportunities for improvement to bring greater productivity, safety, control and reliability to the relevant activities performed.

Data

The complexity of the process, due to the heterogeneous databases and the trade-off between formal and informal flows, forced the use of creative ways to systematize ideas and define the scope of mining. The first step was to disregard the so-called informal flow, which was the internal process represented by the exchange of e-mails in the conception of the act and in the discussion with stakeholders (Figure 3).

The solution to simplify the extensive general flow of the normative process was to make cuts that allowed two different analyses:

  • the information exchange between Ministries and the standards sending to the Presidency of the Republic (‘Sidof’); and

  • the internal process in the Presidency in another system (‘Sei!').

These systems are administered by different areas, having different characteristics. Despite the lack of uniformity, both systems gather the essential logs to operate DISCO tool. The processes ID, timestamps, activities, areas and other attributes were extracted and imported into DISCO to arrive at the below results.

Figure 3: Information systems involved in the process

Figure 3: Information systems involved in the process

Results

The first results provided by our process mining analysis were quantitative, but no less relevant, allowing a sui generis study of the efficiency of the normative process in the Federal Executive Branch. This initial analysis also enabled the diagnosis of the most influential Ministries in this process: Foreign Relations (MP); Planning, Development and Management (MRE); and Finance (MF) are examples of Ministries that proposes most of the standards that the Executive Branch publishes or sends to the National Congress (Figure 4).

This is explained by the technical nature of the Ministry or even by its competence to initiate specific rules, such as International Agreements for example. The different Ministries relevance levels in the process, exposed by the mining, defined Civil House's priority for the project expansion to the Ministries.

Figure 4: Main Proposing Ministries

Figure 4: Main Proposing Ministries

The ‘Sidof’ database had 9,906 normative projects between October 1, 2010 and March 12, 2018. After applying some attribute and endpoint filters to remove the non-normative decrees (28%) and the incomplete cases it was possible to reach the following conclusions: Only 2,964 decrees and provisional measures were published. It was not possible to distinguish the amount sent to the National Congress (bills and amendments) from those filed in the rest of the cases.

The mean duration of these processes was 30 weeks (Figure 5), following 2,739 different paths. The most common path (variant no. 1) contains only 21 cases, which is not understandable, since several projects have the same nature, traveling the same course at least in theory. It was found that 2,637 decrees and provisional measures followed exclusive paths until their arrival in the Presidency. Almost a different trajectory for each published standard.

Figure 5: ‘Sidof’ Statistics in Disco

Figure 5: ‘Sidof’ Statistics in Disco

The findings from the ‘Sidof’ process mining analysis were already very helpful for the modernization team, but especially the analysis of the ‘Sei!’ database is the one that has generated immediate impacts in the normative process.

Because it is a more modern and recent implementation system, the ‘Sei!’ study involved a database of 2,470 normative projects evaluated by the Presidency between November 23, 2016 and November 28, 2017, including the non-normative acts, which this time were not segregated because they could not be distinguished in the system.

The study of variants (reflecting the different paths that the normative projects run in ‘Sei!') enabled the following discovery: The variety of procedural alternatives found “from the door out” of the Civil House also occurs internally. This means that, in theory, the process flow of a norm preparation is known by all stakeholders, but practice shows that there is no standard. There is a great deficit of information, since the Ministries don't have access to the Presidencys electronic system (they process the normative projects through another system) and cannot clearly identify which path their processes go through until presidential signature (evidenced by the study of variants in Disco). The result is that the process is seen as a black box by the proponent, one of the most relevant actors in this process and the one who truly knows the impact that the norm will have on society.

In the case of the ‘Sei!’ process mining analysis, especially the animation made the relevance of certain areas and the existence of possible bottlenecks visible. Generating the dynamic replay of the process data has helped to discover and illustrate the importance of two major players in the internal process of the Presidency. They are the legal unit (called SAJ) and the government policies unit (called SAG), which carry out, respectively, the legal and merit analysis of the normative projects, upon their arrival in the Presidency.

The image in Figure 6 is a clipping of the dynamic process map (animation). All the indicated sectors are areas of the SAJ. The activities indicated by arrows are the technical areas and the one indicated by a circle is the area of the administrative protocol.

Figure 6: Processes in SAJ

Figure 6: Processes in SAJ

The image in Figure 7 shows the participation of SAG in the process. Again, the arrows represent the technical units (economic policy, social policies, infrastructure, public finance and public management) and the circles the areas of administrative protocol (located at the top of Figure 7) and of the dispatch of documents (located at the bottom of Figure 7).

Figure 7: Processes in SAG

Figure 7: Processes in SAG

The first qualitative result of the process mining analysis shows that one of the bottlenecks is the SAGs documents dispatch area. As one can see in the animation, the area receives all the cases (yellow dots), regardless of their topic (economic policy, social policies, infrastructure, public finances or public management), for later processing. The accumulation of processes before this area indicates a possible administrative problem to be solved, since there are at least five queues before the activity (which usually does not take much time).

Initially, the proposal was made to eliminate this activity. However, the decision was made to maintain the activity as a means of control for the area through its central position. Nevertheless, our team found a possibility to improve the process for some cases, which do not need to pass through this activity anymore, because there is no reason for standards to be queued up in an administrative unit when there is no technical analysis involved (which takes more time than others).

The second qualitative result made possible by our process mining analysis was the discovery of the relevance of SAJ and SAG during the normative process. The legal and the merit analyses are the basis of the presidential signature and are the main activities performed by the Civil House in this process.

As a result of the analyses of the modernization team, it was agreed to focus on automation and on reducing information deficits, specifically in the activities carried out by the SAG and SAJ areas. In the Research and Development department, a project named LeXXIs was started about modeling the “normative process in the 21st century”.

Impact

Several actions that were derived from the visualizations of the normative process map in the process mining software Disco are already being adopted. Once the most critical areas and points were identified, the improvement initiatives were divided into three major strategies (see also Figure 8).

Figure 8: Link between proposed solutions and process mining results

Figure 8: Link between proposed solutions and process mining results

1. Project expansion to the Ministries (through the prototyping of a new system)

The first action was the prototyping of a new system called Seidof, which combines the qualities of both the ‘Sei!’ and the ‘Sidof’ systems and minimizes their defects. In this new environment, the modernization team specified types of processes by theme of the standard and defined patterns. Real normative processes from the Ministry of Planning, Development and Management (one of the main proponents as shown in Figure 4) were included in the prototype to test the process flow between author, coauthors and the Presidency.

In this way, our team has delivered the new system Seidof ready to begin the replacement of the old system by the end of 2018. The main goal is to make the process more transparent for the Ministries (one single system) and to establish more streamlined process patterns, thereby reducing the huge number of paths diagnosed by Disco.

2. Improvement of working conditions (workers from SAJ and SAG)

The second action was a proof of concept (PoC) in partnership with Microsoft's business area to use Office365 to test collaborative editing tools, such as SharePoint and Teams, in the preparation phase of standards. The goal is to provide collaborative editing (in real time) to the merit and legal analyses, facilitating the interaction between the two largest actors of the normative process in the Civil House.

Furthermore, we started to create a means for the automated cleaning and formatting of normative texts. This activity requires valuable time from several technicians in the process. The greatest difficulty of the tool will be to ensure that rules for drafting, articulating and changing normative acts were fulfilled. This solution represents the editing function of the virtual assistant created, named Doctor Norma.

Doctor Norma's artificial intelligence was developed using tools and techniques of data science and textual mining2. Some of the SAJ technicians who experimented and visualized the prototype identified a great potential for this solution. For them, it allows to check the latest recommendations of the public compliance organizations on a normative subject, and to find the related law projects in process in the Legislative Branch.

3. Redesign of internal administrative routines

The third action was a redesign of internal administrative routines. The initial idea was to adapt schedules and timetables of the SAG document dispatch areas to the schedules of the technical areas, so that the process could flow more naturally.

The relevance of the protocol and expedition area as a separate entity to increase administrative control is understood, but it makes no sense that this administrative step is the bottleneck for such a relevant process. The internal division of procedures may seem efficient, but it has been disrupting the process flow as shown in the animation in Figure 7.

The modernization team suggested a modification of the work hours of the bottleneck area with the intention to adapt the area to the rest of the process, unlike what happens today. In addition, we recommended that the activity of dispatching the normative projects occurs daily rather than on a certain day of the week, in order to give fluidity to the process. This redesign of administrative routines faced great resistance and was interrupted, forcing the modernization team to focus its work on the automation of the process at first.

The implementation of all these improvements, the expansion of the project and the follow-up of the gains obtained with process mining are the focus of the modernization team of the Civil House from now on.


You can download this case study as a PDF here for easier printing or sharing with others.


  1. Source: Viva La France! Support Our Revolution! (2013) [8]

  2. The prototype was developed by a post-doctor in computer science, specialist in text mining, using knowledge in textual similarity. The PoC (using training data from the 30-year period: 1988 to 2018) was made available for viewing at romualdoalves.com/decretos-2015-a-2018-poc-text-mining.

Process Mining Transformations Part 3: Combine Data Sets of the Same Shape

This is the 3rd article in our series on typical process mining data preparation tasks. You can find an overview of all articles in the series here.

In the previous articles, we have shown how loops can be split up into individual cases or unfolded activities. Another typical category of data transformations is that multiple data sets need to be combined into one data set.

For example, you might receive a separate report for all the status changes every month. These files then need to be combined into one file to analyze the full timeframe for this process. Another example would be a situation, where different process steps are recorded in different IT systems. After extracting the data from the individual systems, the files need to be combined to analyze the full end-to-end process.

When you combine multiple data sets into one event log, you need to look at the structure of these data sets to understand how exactly they can be combined. For example, the per-month data snippets need to be concatenated in a vertical manner (copied below each other in the same file).

The same is true if you want to combine different process steps across multiple systems. The assumption is that the activities in the different systems have a common case ID if they refer to the same case in the process. If different IDs are used in different systems, you first need to create a common case ID. Note also that if the timestamp patterns are recorded differently in the different systems, then you need to put them into separate columns when preparing the data.

In this article, we show you three approaches that you can take to combine data from multiple files below each other into a single data set for your process mining analysis.

We use the example of four months of data that has been collected in four individual files: November.csv, December.csv, January.csv, and February.csv. It is possible to import one file at a time into Disco and analyze each month separately. For example, after importing the November.csv file you would be able to see that the dataset covers the timeframe from 1 November 2016 until 30 November 2016 (see screenshot below - Click on the image to see a larger version of it).

However, we may want to answer questions about a larger timeframe. For example, we might want to look for cases that start in one month and are completed in the next month. For this, we need to combine these files into a single data set.

Note that the format of all four files in this example is identical: They all contain the same headings (a Case ID, Activity, and Completed Timestamp column) in the same order.

1. Combining the data in Excel

If your data is not that big, copying and pasting the data in Excel may be the easiest option.

The first step is to just open the November.csv file in Excel and scroll to the last row (208851) or use a shortcut1 and select the first empty cell (see screenshot below).

You can now simply add data from the December.csv file by choosing File -> Import and select the December.csv file. Note that you need to import from the 2nd row forward, otherwise the heading will be included again. We can see that 201135 rows are added to the Excel sheet (see below).

We can now save the data set as a CSV file and give it a new name, for example, November_and_December_Excel.csv. After importing the data into Disco we can check in the statistics that the dataset now covers two months of data (see below).

Using Excel is easy, but you need to be aware that current versions of Excel are limited up to 1,048,576 rows and older versions are even restricted to handle only 65,663 rows. In this example, we are able to combine all four files without exceeding the Excel limit. However, the more you approach the data volume limits it could be that Excel becomes very slow.

2. Combining the data in an ETL tool

Once the data becomes too big for Excel, you need a different approach. If you are not used to working with databases and looking for a simpler way to combine large datasets, then we recommend to use an ETL tool. ETL tools provide a graphical interface to drag and drop workflows to transform your data. It is therefore much more accessible for non-technical users.

In this example we use KNIME, which is open source and freely available at: www.knime.org.

Once you have KNIME installed, you can create a new workflow that starts with importing the individual CSV files. Each file can be imported by dragging a File Reader to the canvas and configured to read the right file (see below).

With a Concatenate two File Readers can be combined into a single dataset (see below).

Finally, the result can be saved as a CSV using a CSV Writer (see below). In the “CSV Writer” block you can configure the location to which the resulting file will be written. Finally, just execute the workflow that will save the combined dataset at the specified location.

3. Combining the data in an SQL database

Of course you can also do this data preparation in a good old database. This requires some technical skills to set up a database server and being able to write SQL queries.

There are many databases available. For this example, I downloaded and installed the open source MySQL Community Server and MySQL workbench from https://dev.mysql.com.

The simplest way to add data is to use the Table Data Import Wizard2 to import the csv files. For each file a table will be created in the database and the data will be inserted into this table see (1) in the screenshot below.

Now you access the data, for example the November data, in the database using the following SQL query:

SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`November`

Data from multiple tables can be combined using a Union between each select statement of the individual table see (2) in the screenshot above:

`SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`November`
Union
SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`December`
Union
SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`January`
Union
SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`February`
Union
SELECT `Case ID`, `Activity`,`Complete Timestamp` FROM `eventlog`.`March`

Finally, you can export the data and save is as a CSV file by using the export function see (3) in the screenshot above.

After importing this CSV file into Disco, we can see that now the dataset contains a total of 843,805 events and covers the timeframe from 1 November until 5 March (see below).

Whichever method you use, make sure to verify not only that the start and the end timestamps of the new data set are as expected, but also check that there are no gaps in the timeline.

A gap in the timeline would most likely indicate that something went wrong in your data preparation. For example, you could have forgotten to include one of the files (see the screenshot below).


  1. Shift+End on Windows or Command+Shift+Down on macOS

  2. Note that the Table Data Import Wizard (see https://dev.mysql.com/doc/workbench/en/wb-admin-export-import-table.html is slow because each row requires an insert statement to be executed. A faster approach would be to import use the INFILE import function. However, this requires to write a data import script.