This is Flux Capacitor, the company weblog of Fluxicon.
You can find more articles here.

You should follow us on Twitter here.

Data Preparation for Process Mining — Part II: Timestamp Headaches and Cures 4

Did you know that 'Back to the future' contains an hommage to the classic 1928 silent comedy ‘Safety Last’?

This is a guest post by Nicholas Hartman (see further information about the author at the bottom of the page) and the article is part II of a series of posts highlighting lessons learned from conducting process mining projects within large organizations (read Part I here).

If you have a process mining article or case study that you would like to share as well, please contact us at anne@fluxicon.com.

Timestamps are core to any process mining effort. However, complex real-world datasets frequently present a range of challenges in analyzing and interpreting timestamp data. Sloppy system implementations often create a real mess for a data scientist looking to analyze timestamps within event logs. Fortunately, a few simple techniques can tackle most of the common challenges one will face when handling such datasets.

In this post I’ll discuss a few key points relating to timestamps and process mining datasets, including:

  1. Reading timestamps with code
  2. Useful time functions (time shifts and timestamp arithmetic)
  3. Understanding the meaning of timestamps in your dataset

Note that in this post all code samples will be in Python, although the concepts and similar functions will apply across just about any programming language, including various flavors of SQL.

Reading timestamps with code

As a data type, timestamps present two distinct challenges:

  1. The same data can appear in many different formats
  2. Concepts like time zones and daylight savings time mean that the same point in real time can be represented by entirely different numbers

To a computer time is a continuous series. Subdivisions of time like hours, weeks, months and years are formatted representations of time displayed for human users. Many computers base their understanding of time on so called Unix time, which is simply the number of seconds elapsed since the 1st of January 1970. To a computer using Unix time, the timestamp of 10:34:35pm UTC April 7, 2015 is 1428446075. While you will occasionally see timestamps recorded in Unix time, it’s more common for a more human-readable format to be used.

Converting from this human readable format back into something that computers understand is occasionally tricky. Applications like Disco are often quite good at identifying common timestamp formats and accurately ingesting the data. However, if you work with event logs you will soon come across a situation where you’ll need to ingest and/or combine timestamps containing unusual formats. Such situations may include:

The following scenario is typical of what a data scientist might find when attempting to complete process mining on a complex dataset. In this example we are assembling a process log by combining logs from multiple systems. One system resides in New York City and the other in Phoenix, Arizona. Both systems record event logs in the local time. Two sample timestamps appear as follows:

System in New York City: 10APR2015 23.12.17:54
System in Phoenix Arizona: 10APR2015 20.12.18:72

Such a situation presents a few headaches for a data scientist looking to use such timestamps. Particular issues of concern are:

You can see how this can all get quite complicated very quickly. In this example we may want to write a script that ingests both sets of logs and produces a combined event log for analysis (e.g., for import into Disco). Our primary challenge is to handle these timestamp entries.

Ideally all system admins would be good electronic citizens and run all their systems logging functions in UTC. Unfortunately, experience suggests that this is wishful thinking. However, with a bit of code it’s easy to quickly standardize this mess onto UTC and then move forward with any datetime analytics from a common and consistent reference point.

First we need to get the timestamps into a form recognized by our programming language. Most languages have some form of a ‘string to datetime’ function. Using such a function you provide a datetime string and format information to parse this string into its relevant datetime parts. In Python, one such function is strptime.

We start by using strptime to ingest these timestamp strings into a Python datetime format:

# WE IMPORT REQUIRED PYTHON MODULES (you may need to install these first)
import pytz
import datetime

# WE IMPUT THE RAW TEXT FROM EACH TIMESTAMP
ny_date_text="10APR2015 23.12.17:54"
az_date_text="10APR2015 20.12.26:72"

# WE CONVERT THE RAW TEXT INTO A NATIVE DATETIME
# e.g., %d = day number and %S = seconds
ny_date = datetime.datetime.strptime(ny_date_text, "%d%b%Y %H.%M.%S:%f")
az_date = datetime.datetime.strptime(az_date_text, "%d%b%Y %H.%M.%S:%f")

# WE CHECK THE OUTPUT, NOTE THAT FOR A NATIVE DATETIME NO TIMEZONE IS SPECIFIED
print(ny_date)
>>> 2015-04-10 23:12:17.540000

At this point we have the timestamp stored as a datetime value in Python; however, we still need to address the time zone issue. Currently our timestamps are stored as ‘native’ time, meaning that there is no time zone information stored. Next we will define a timezone for each timestamp and then convert them both to UTC:

# WE DEFINE THE TWO TIMZEONES FOR OUR DATATYPES
# NOTE: ‘ARIZONA’ TIMEZONE IS ESSENTIALLY MOUNTAIN TIME WITHOUT DAYLIGHT SAVINGS TIME
tz_eastern = pytz.timezone('US/Eastern')
tz_mountain = pytz.timezone('US/Arizona')

# WE CONVERT THE LOCAL TIMESTAMPS TO UTC
ny_date_utc = tz_eastern.localize(ny_date, is_dst=True).astimezone(pytz.utc)
az_date_utc = tz_mountain.localize(az_date, is_dst=False).astimezone(pytz.utc)

# WE PRINT CHECK THE OUTPUT, NOTE THAT THE TIMEZONE OF +0 IS ALSO NOW RECORDED
print(ny_date_utc)
>>> 2015-04-11 03:12:17.540000+00:00
print(az_date_utc)
>>> 2015-04-11 03:12:26.720000+00:00

Now we have both timestamps recorded in UTC. In this sample code we manually inputted the timestamps as text strings and then simply printed the results to a terminal screen. An example of a real-world application would be to leverage the functions above to read in raw data from a database for both logs, process the timestamps into UTC and then write the corrected log entries into a new table containing a combined event log. This combined log could then be subjected to further analytics.

Useful time functions

With timestamps successfully imported, there are several useful time functions that can be used to further analyze the data. Among the most useful are time arithmetic functions that can be used to measure the difference between two timestamps or add/subtract a defined period of time to a timestamp.

As an example, let’s find the time difference between the two timestamps imported above:

# WE COMPARE THE DIFFERENCE IN TIME BETWEEN THE TWO TIMESTAMPS
timeDiff = (az_date_utc - ny_date_utc)
print(timeDiff)
>>> 0:00:09.180000

The raw output here reads a time difference of 9 seconds and 18 milliseconds. Python can also represent this in rounded integer form for a specified time measurement. For example:

# WE OUTPUT THE ABOVE AS AN INTEGER IN SECONDS
print(timeDiff.seconds)
>>> 9

This shows us that the time difference between the two timestamps is 9 seconds. Such functions can be useful for quickly calculating the duration of events in an event log. For example, the total duration of a process could be quickly calculated by comparing the difference between the earliest and latest timestamp for a case within a dataset.

These date arithmetic functions can also be used to add or subtract defined periods of time to a timestamp. Such functions can be useful when manually adding events to an event log. For example, the event log may record the start time of an automated process, but not the end time. We may know that the step in question takes 147 seconds to complete (or this length may be recorded in a separate log). We can generate a timestamp for the end of the step by adding 147 seconds to the timestamp for the start of the step:

# WE ADD 147 SECONDS TO OUR TIMESTAMP AND THEN OUTPUT THE NEW RESULT
az_date_utc_end = az_date_utc + datetime.timedelta(seconds=147)
print(az_date_utc_end)
>>> 2015-04-11 03:14:53.720000+00:00

Understanding the meaning of timestamps in your dataset

Having the data cleaned up and ready for analysis is clearly important, but equally important is understanding what data you have and what it means. Particularly for data sets that have a global geographic scope, it is crucial to first determine how timestamps have been represented in the data. Relative to timestamps in your event logs some key questions you should be asking are:

Conclusion

While this piece was hardly an exhaustive look at programmatically handling timestamps, hopefully you’ve been able to see how some simple code is able to deal with the more common challenges faced by a data scientist working with timestamp data. By combining the concepts described above with a database it is possible to write an automated script to quickly ingest a range of complex event logs from different systems and output one standardized log in UTC. From there, the process mining opportunities are endless.


Nicholas Hartman

Nicholas Hartman is a data scientist and director at CKM Advisors in New York City. He was also a speaker at Process Mining Camp 2014 and his team won the BPI Challenge last year.

More information is available at www.ckmadvisors.com



  1. Note that in Disco you configure the timestamp pattern to fit the data (rather than having to provide the data in a specific format) and you can actually import merged data sets from different sources with different timestamp patterns: Just make sure they are in different columns, so that you can configure their formats independently.  

Comments (4)

[…] Are we on time? – Working with timestamps in event logs (spoiler alert: UTC is your friend) […]

Hi Anne,
how it is possible to import the unix time in your time-stamp pattern?
cheers
Lucy

Hi Lucy,

The unix time can be matched by the following pattern: SSSSSSSSSS.

Best,
Anne

[…] a guest article earlier this year, Nick talked about what a pain timestamps are in the data preparation […]


Leave a reply