Ruby: Regex - Matching the Trademark ™ character
I’ve been playing around with some World Cup data and while cleaning up the data I wanted to strip out the year and host country for a world cup.
I started with a string like this which I was reading from a file:
1930 FIFA World Cup Uruguay ™
And I wanted to be able to extract just the 'Uruguay' bit without getting the trademark or the space preceding it. I initially tried the following to match all parts of the line and extract my bit:
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
Unfortunately that doesn’t actually compile:
tm.rb:4: syntax error, unexpected $end, expecting ')'
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
^
I was initially able to work around the problem by matching the unicode code point instead:
p text.match(/\d{4} FIFA World Cup (.*?) \u2122/)[1]
While working on this blog post I also remembered that you can specify the character set of your Ruby file and by default it’s ASCII which would explain why it doesn’t like the ™ character.
If we add the following line at the top of the file then we can happily use the ™ character in our regex:
# encoding: utf-8
# ...
p text.match(/\d{4} FIFA World Cup (.*?) ™/)[1]
# returns "Uruguay"
This post therefore ends up being more of a reminder for future Mark when he comes across this problem again having forgotten about Ruby character sets!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.