12 Sep 2011

gawk: Getting story numbers from git commit messages

As I mentioned in my previous post I’ve been writing a little application to create graphs based on our git repository history and in one of them we wanted to try and create a graph showing which people had been working on which stories.

I needed a way to extract a story number from the git commit message and then store them all in a text file.

A typical commit with a story number in might look like this:

Mark/Uday #689 some awesome scala refactoring

I couldn’t think of an easy way to do this with my current knowledge of sed or the Mac version of awk but the http://www.gnu.org/software/gawk/manual/gawk.html#index-g_t_0040code_007bmatch_0028_0029_007d-function-1373 function of gawk (GNU awk) makes this really easy.

match(string, regexp [, array]) Search string for the longest, leftmost substring matched by the regular expression, regexp and return the character position, or index, at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero. ... If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp.

The array argument is what I needed and it’s only available as a gawk extension according to the documentation.

I ended up with the following command to strip the story numbers:

git log --no-merges --pretty="format:%s" |
gawk '{ match($0, /#([0-9]+)/, arr); if(arr[1] != "") print arr[1] }'

I had to install gawk using ports on my Mac but on Fedora the default installation of awk is gawk.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.