This is Flux Capacitor, the company weblog of Fluxicon.
You can find more articles here.

You should follow us on Twitter here.

How Much Data Do You Need For Your Process Mining Project?

After our initial post on the mental model that underlies process mining, we started a data requirements FAQ series here and here.

Here is another question I get frequently once people are eager to get started with the data extraction phase for their process mining project.

FAQ #3: Which timeframe should my log cover?

As a rule of thumb, I usually recommend to try to get data for at least 3 months. Depending on the run time of a single process instance it may be better to get data for up to a year. For example, if your process usually needs 5–6 months to complete (think of a public building permit process), a 3-month-long sample will not get you even one complete process instance.

How long are your cases

So, it really depends on how long a case in your process is typically running. You want to get a representative set of cases and you need to keep some room to catch the usual few long-running instances as well.

If you are still unsure how much data you need to extract, use the following formula based on the expected throughput time for your process:

timeframe = expected case completion time * 4 * 5

The baseline is the expected process completion time for a typical case. The 4 ensures that you have as much data that you could see four cases that were started and completed after each other (of course there will be others in between). The 5 accounts for the occasional long-running cases (20/80 rule) and makes sure you see cases that take up to five times longer in the extracted time window.

For example, if the expected completion time of a typical case in your process is 5 days, then the formula yields 100 days = 5 days * 4 * 5, which is approximately 3 months of data. If, however, a typical process is completed in just a few minutes, then extracting a couple of hours of data may be enough.

Please take the formula with a grain of salt. It has worked well for me, but the more you know about your process the better you will be able to judge the amount of data you should extract.

Two ways to extract data

Another way to make sure you get a good data sample is to choose a timeframe that you want to analyze (say, for example, April this year) and then extract all events for the cases that were started that month. This way, you can catch long-running instances even though you are focusing on a shorter timeframe for your analysis.

The picture below illustrates the difference. Every horizontal bar represents one case over time. The highlighted area stands for the selected timeframe, and the dark blue areas are the events that are covered by the data extraction method.

If the end date of your timeframe is today, then there is no difference between (a) and (b): Cases may always be incomplete because they are still running.

It also depends on your questions

The amount of data you should extract also depends on the questions that you want to answer. For example, if you want to understand the regular process, then adding more data at a certain point won’t give you any more insights.

However, if you are looking for exceptions or irregularities that are important from a compliance angle, you probably want to check the data of the whole audit year to catch everything that went wrong in the audited period.




What is your experience with the amount of data that needs to be extracted? Let us know in the comments.


  1. Be aware, however, that any activity from earlier cases (started before the selected time period) will not be visible with this extraction method.  

Leave a reply