26 Jun 2013

Unix/awk: Extracting substring using a regular expression with capture groups

A couple of years ago I wrote a blog post explaining how I’d used GNU awk to extract story numbers from git commit messages and I wanted to do a similar thing today to extract some node ids from a file.

My eventual solution looked like this:

$ echo "mark #1000" | gawk '{ match($0, /#([0-9]+)/, arr); if(arr[1] != "") print arr[1] }'
1000

But in the comments an alternative approach was suggested which used the Mac version of awk and the RSTART and RLENGTH global variables which get set when a match is found:

$ echo "mark #1000" | awk 'match($0, /#[0-9]+/) { print substr( $0, RSTART, RLENGTH )}'
#1000

Unfortunately Mac awk doesn’t seem to capture groups so as you can see it includes the # character which we don’t actually want.

In this instance it wasn’t such a big deal but it was more annoying for the node id extraction that I was trying to do:

$ head -n 5 log.txt
Command[27716, Node[7825340,used=true,rel=14547348,prop=31734662]]
Command[27716, Node[7825341,used=true,rel=14547349,prop=31734665]]
Command[27716, Node[7825342,used=true,rel=14547350,prop=31734668]]
Command[27716, Node[7825343,used=true,rel=14547351,prop=31734671]]

$ head -n 5 log.txt | awk 'match($0, /Node\[([^,]+)/) { print substr( $0, RSTART, RLENGTH )}'
Node[7825340
Node[7825341
Node[7825342
Node[7825343
Node[7825336

I ended up having to brew install gawk and using a variation of the gawk command I mentioned at the beginning of this post:

$ head -n 5 log.txt | gawk 'match($0, /Node\[([^,]+)/, arr) { print arr[1]}'
7825340
7825341
7825342
7825343
7825336

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.