Context searching using Clojure-OpenNLP

March 9, 2010

This is an addon to my previous post, “Natural Language Processing in Clojure with clojure-opennlp“. If you’re unfamiliar with NLP or the clojure-opennlp library, please read the previous post first.

In that post, I told you I was going to use clojure-opennlp to enhance a searching engine, so, let’s start with the problem!

The Problem

(Ideas first, code coming later. Be patient.)

So, let’s say you have a sentence:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

I took this sentence out of a NYTimes article talking about the recent Toyota recalls, just pretend it’s an interesting sentence.

Let’s say that you wanted to search for things containing the word “brake“; well, that’s easy for a computer to do right? But, let’s say you want to search for things around the word “brake“, wouldn’t that make the search better? You would be able to find a lot more articles/pages/books/whatever you might be interested in, instead of ones that only contained the word key word.

So, in a naïve (for a computer) pass for words around the key word, we can come up with this:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

Right. Well. That really isn’t helpful, it’s not like the words “when”, “the” or “is” are going to help us find other things related to this topic. We got lucky with “pedal”, that will definitely help find things that are interesting, but might not actually have the word brake in them.

What we really need, is something that can pick words out of the sentence that we can also search for. Almost any human could do this trivially, so here’s what I’d probably pick:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

See the words in red? Those are the important terms I’d probably search for if I were trying to find more information about this topic. Notice anything special about them? Turns out, they’re all nouns or verbs. This is where the NLP (Natural Language Processing) comes into play for this, given a sentence like the above, we can categorize it into it’s parts by doing what’s called POS-Tagging (POS == Parts Of Speech):

[“The” “DT”] [“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“to” “TO”] [“deactivate” “VB”] [“the” “DT”] [“accelerator” “NN”] [“when” “WRB”] [“the” “DT”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”] [“.” “.”]

So next to each word, we can see what part of speech that word belongs to. Things starting with “NN” are nouns, things starting with “VB” are verbs. Doing this is non-trivial. (A full list of tags can be found here). That’s why there are software libraries written by people smarter than me for doing this sort of thing (*cough* opennlp *cough*). I’m just writing the Clojure wrappers for the library.

Anyway, back to the problem, so what I need to do is strip out all the non-noun and non-verb words in the sentence, that leaves us with this:

[“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”]

We’re getting closer, right? Now, searching for things like “is” probably won’t help, so let’s strip out all the words with less than 3 characters:

[“override” “NN”] [“system” “NN”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“pressed” “VBN”]

Now we get to a point where we have to make some kind of decision about what terms to use, for this project, I decided to weight the terms that were nearer to the original search term, only taking two of the closest words in each direction, after scoring this sentence you get:

[“override” 0] [“system” 0] [“meant” 0] [“deactivate” 0.25] [“accelerator” 0.5] [“brake” 1] [“pedal” 0.5] [“pressed” 0.25]

The red next to each word indicates how heavily we’ll weight each word when we use it for subsequent searches. The score is divided by 2 for each unit of distance away from the original key term. Not perfect, but it works pretty well.

That was pretty easy to follow, right? Now imagine having a large body of text, and a term. First you’d generate a list of key words from sentences directly containing the term, score them each and store them in a big map. On a second pass, you can start scoring each sentence by using the map of words => scores you’ve already built. In this way, you can then rank sentences, including those that don’t even contain the actual word you’ve searched for, but are still relevant to the original term. Now, my friend, you have context searching.

Congratulations, that was a really long explanation. It ended up being a bit more like pseudocode, right? (or at least an idea of how to construct a program). Hopefully after understand the previous explanation, the code should be easy to read.

The Code

(note: this is very rough code, I know it’s not entirely idiomatic and has a ton of room for enhancement and parallelization, feel free to suggest improvements in the comments however!)

The most important functions to check out are (score-words [term words]) method, which returns a list of vectors of words and their score, and the (get-scored-terms ) method, which returns a map of words as keys and scores as values for the entire text, given the initial term.

Here’s the output from the last few lines:

contextfinder=> (pprint (reverse (sort-by second (score-sentences mytext scorewords))))
(["The override system is meant to deactivate the accelerator when the brake pedal is pressed. " 13/4]
["The Obama administration is considering requiring all automobiles to contain a brake override system intended to prevent sudden acceleration episodes like those that have led to the recall of millions of Toyotas, the Transportation secretary, Ray LaHood, said Tuesday. " 5/2]
["Often called a \"smart pedal,\" the feature is already found on many automobiles sold worldwide, including models from BMW, Chrysler, Mercedes-Benz, Nissan and Volkswagen." 3/4]
["That will let the driver stop safely even if the cars throttle sticks open. " 0])

I highlighted the score for each sentence in red, see how sentences such as the third in the list have a score, but don’t contain the word “brake”? Those would have been missed entirely without this kind of searching.

contextfinder=> (println (score-text mytext scorewords))
13/2

Score a whole block of text, self-explanatory.

Anyway, I’ve already spent hours writing the explanation for why you’d want to do this sort of searching, so I think I’ll save the explanation of the code for another post. Hopefully it’s clear enough to stand on its own. If you find this interesting, I encourage you to check out the clojure-opennlp bindings and start building other cool linguistic tools!

tags: , , ,
posted in clojure, nlp, opennlp, search by Lee

8 Comments to "Context searching using Clojure-OpenNLP"

  1. Natural Language Processing in Clojure with clojure-opennlp : :wq – blog wrote:

    […] This library is available on clojars for inclusion in leiningen projects, or on github if you’re interested in the source. This is a fairly new project, and not all OpenNLP features are exposed at the moment so feedback is definitely encouraged. In the next post I’ll explain an in-depth example of how these functions can be used to enhance a searching engine. EDIT: It’s up! Check out “Context searching using clojure-opennlp.” […]

  2. Context searching using Clojure-OpenNLP : :wq – blog « The other side of the firewall wrote:

    […] March 9, 2010 at 8:19 pm · Filed under Open source, Programming [From Context searching using Clojure-OpenNLP : :wq – blog] […]

  3. Today in the Intertweets (March 9th Ed) | disclojure: all things clojure wrote:

    […] searching with clojure-opennlp (here, via @thnetos) — A follow-up article to this one in which the author introduces […]

  4. Clojure – Destillat #5 | duetsch.info - Open Source, Wet-, Web-, Software wrote:

    […] Context searching using Clojure-OpenNLP […]

  5. ogrisel wrote:

    Very interesting post. You should combine you approach with the TF-IDF weightings of a lucene index. Here is a new project I am working on that is very complementary to your work:

    http://code.google.com/p/iks-project/source/browse/sandbox/iks-autotagging/trunk/README.txt

    This is based on the MoreLikeThis similarity of Lucene:

    http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/search/similar/MoreLikeThis.html

    I was planning to use this to identify known named entities detected by the OpenNLP name finder by using a context of a few sentences around the span that contains the detected name.

  6. Jonathan Siegel wrote:

    Very cool project. Quick question… I’m using the EnglishSD.bin.gz from:
    http://opennlp.sourceforge.net/models/english/sentdetect/

    but get an error loading it:
    contextfinder=> (def get-sentences (make-sentence-detector “models/EnglishSD.bin.gz”))
    opennlp.tools.util.InvalidFormatException: Missing the manifest.properties! (NO_SOURCE_FILE:52)

    Is this a sign of deeper config issues or a missing library showing up? I’m in a lein repl that otherwise runs the clojure-nlp features. Any pointers appreciated–I’ve a strong LISP background, but debugging where clojure meets java is opaque ATM.

  7. Lee wrote:

    Jonathan: yea, this article is a little bit out of date, since I’ve written this the Opennlp project has updated their models with the 1.5 release, for the 1.5 release you’ll need the models here: http://opennlp.sourceforge.net/models-1.5/

    Also, the latest documentation for the project will always be on the github page here: https://github.com/dakrone/clojure-opennlp which should always be up to date (even if my posts get a little stale). The updated version of this file can be found here: https://github.com/dakrone/clojure-opennlp/blob/master/examples/contextfinder.clj

    Thanks for the heads up. I’ll work on updating the post for changes in the new versions.

  8. Kiran Umadi wrote:

    Good explanation, helps a lot for beginners.

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org