11 Feb 2011

Sed: Extended regular expressions

Irfan and I were looking at how to do some text substitution in a text file this afternoon and turned to sed to help us in our quest.

He had originally used grep to find what he wanted to replace on each line, using a grep regular expression to match one or more numbers:

cat the_file.txt | grep "[0-9]\+"

That works pretty well but since I knew how to do the substitution in sed we needed to convert the regular expression to work with sed.

We started off with just trying to print the lines which matched the regular expression:

cat the_file.txt | sed -n '/[0-9]\+/p'

Which prints nothing because sed uses basic regular expressions by default which means we can’t use '+' to match 1 or more numbers.

grep on the other hand…

Grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep, there is no difference in available functionality using either syntax. In other implementations, basic regular expressions are less powerful.

To get sed to allow us to use extended metacharacters we need to pass the '-E' flag to sed which also means that we no longer to escape the '+':

cat the_file.txt | sed -nE '/[0-9]+/p'

From what I understand you can also only use the following metacharacters in extended mode as well:

? - for matching zero or one occurrence of a regular expression
| - for matching either the preceding or following regular expression
() - grouping regular expressions
{n,m} - for matching a range of occurrences of the single preceding character

I’m told that you can use grep to do substitution as well but I haven’t figured out how exactly you do that yet.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.