8 Jun 2014

Ruby: Regex - Matching the Trademark ™ character

I’ve been playing around with some World Cup data and while cleaning up the data I wanted to strip out the year and host country for a world cup.

I started with a string like this which I was reading from a file:

1930 FIFA World Cup Uruguay ™

And I wanted to be able to extract just the 'Uruguay' bit without getting the trademark or the space preceding it. I initially tried the following to match all parts of the line and extract my bit:

p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]

Unfortunately that doesn’t actually compile:

tm.rb:4: syntax error, unexpected $end, expecting ')'
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
                                           ^

I was initially able to work around the problem by matching the unicode code point instead:

p text.match(/\d{4} FIFA World Cup (.*?) \u2122/)[1]

While working on this blog post I also remembered that you can specify the character set of your Ruby file and by default it’s ASCII which would explain why it doesn’t like the ™ character.

If we add the following line at the top of the file then we can happily use the ™ character in our regex:

# encoding: utf-8
# ...
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
# returns "Uruguay"

This post therefore ends up being more of a reminder for future Mark when he comes across this problem again having forgotten about Ruby character sets!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.