Unix: Summing the total time from a log file
As I mentioned in my last post we’ve been doing some profiling of a data ingestion job and as a result have been putting some logging into our code to try and work out where we need to work on.
We end up with a log file peppered with different statements which looks a bit like the following:
18:50:08.086 [akka:event-driven:dispatcher:global-5] DEBUG - Imported document. /Users/mneedham/foo.xml in: 1298
18:50:09.064 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo2.xml in: 798
18:50:09.712 [akka:event-driven:dispatcher:global-4] DEBUG - Imported document. /Users/mneedham/foo3.xml in: 298
18:50:10.336 [akka:event-driven:dispatcher:global-3] DEBUG - Imported document. /Users/mneedham/foo4.xml in: 898
18:50:10.982 [akka:event-driven:dispatcher:global-1] DEBUG - Imported document. /Users/mneedham/foo5.xml in: 12298
I can never quite tell which column I need to get so end up doing some exploration with awk like this to find out:
$ cat foo.log | awk ' { print $9 }'
1298
798
298
898
12298
Once we’ve worked out the column then we can add them together like this:
$ cat foo.log | awk ' { total+=$9 } END { print total }'
15590
I think that’s much better than trying to determine the total run time in the application and printing it out to the log file.
We can also calculate other stats if we record a log entry for each record:
$ cat foo.log | awk ' { total+=$9; number+=1 } END { print total/number }'
3118
$ cat foo.log | awk 'min=="" || $9 < min {min=$9; minline=$0}; END{ print min}'
298
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.