24 Nov 2012

A first failed attempt at Natural Language Processing

One of the things I find fascinating about dating websites is that the profiles of people are almost identical so I thought it would be an interesting exercise to grab some of the free text that people write about themselves and prove the similarity.

I’d been talking to Matt Biddulph about some Natural Language Processing (NLP) stuff he’d been working on and he wrote up a bunch of libraries, articles and books that he’d found useful.

I started out by plugging the text into one of the many NLP libraries that Matt listed with the vague idea that it would come back with something useful.

I’m not sure exactly what I was expecting the result to be but after 5/6 hours of playing around with different libraries I’d got nowhere and parked the problem not really knowing where I’d gone wrong.

Last week I came across a paper titled "That’s What She Said: Double Entendre Identiﬁcation" whose authors wanted to work out when a sentence could legitimately be followed by the phrase "that’s what she said".

While the subject matter is a bit risque I found that reading about the way the authors went about solving their problem was very interesting and it allowed me to see some mistakes I’d made.

Vague problem statement

Unfortunately I didn’t do a good job of working out exactly what problem I wanted to solve - my problem statement was too general.

In the paper the authors narrowed down their problem space by focusing on a specific set of words which are typically used as double entendres and then worked out the sentence structure that the targeted sentences were likely to have.

Instead of defining my problem more specifically I plugged the text into Mallet, morpha-stemmer and Stanford Core NLP and tried to cluster the most popular words.

That didn’t really work because people use slightly different words to describe the same thing so I ended up looking at Yawni - a wrapper around WordNet which groups sets of words into cognitive synonyms.

In hindsight a more successful approach might have been to find the common words that people tend to use in these types of profiles and then work from there.

No Theory

I recently wrote about how I’ve been learning about neural networks by switching in between theory and practice but with NLP I didn’t bother reading any of the theory and thought I could get away with plugging some data into one of the libraries.

I now realise that was a mistake as I didn’t know what to do when the libraries didn’t work as I’d hoped because I wasn’t sure what they were supposed to be doing in the first place!

My next step should probably be to understand how text gets converted into vectors, then move onto tf-idf and see if I have a better idea of how to solve my problem.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.