:wq - blog » opennlp

Context searching using Clojure-OpenNLP

Lee — Tue, 09 Mar 2010 18:31:00 +0000

This is an addon to my previous post, “Natural Language Processing in Clojure with clojure-opennlp“. If you’re unfamiliar with NLP or the clojure-opennlp library, please read the previous post first.

In that post, I told you I was going to use clojure-opennlp to enhance a searching engine, so, let’s start with the problem!

The Problem

(Ideas first, code coming later. Be patient.)

So, let’s say you have a sentence:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

I took this sentence out of a NYTimes article talking about the recent Toyota recalls, just pretend it’s an interesting sentence.

Let’s say that you wanted to search for things containing the word “brake“; well, that’s easy for a computer to do right? But, let’s say you want to search for things around the word “brake“, wouldn’t that make the search better? You would be able to find a lot more articles/pages/books/whatever you might be interested in, instead of ones that only contained the word key word.

So, in a naïve (for a computer) pass for words around the key word, we can come up with this:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

Right. Well. That really isn’t helpful, it’s not like the words “when”, “the” or “is” are going to help us find other things related to this topic. We got lucky with “pedal”, that will definitely help find things that are interesting, but might not actually have the word brake in them.

What we really need, is something that can pick words out of the sentence that we can also search for. Almost any human could do this trivially, so here’s what I’d probably pick:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

See the words in red? Those are the important terms I’d probably search for if I were trying to find more information about this topic. Notice anything special about them? Turns out, they’re all nouns or verbs. This is where the NLP (Natural Language Processing) comes into play for this, given a sentence like the above, we can categorize it into it’s parts by doing what’s called POS-Tagging (POS == Parts Of Speech):

[“The” “DT”] [“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“to” “TO”] [“deactivate” “VB”] [“the” “DT”] [“accelerator” “NN”] [“when” “WRB”] [“the” “DT”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”] [“.” “.”]

So next to each word, we can see what part of speech that word belongs to. Things starting with “NN” are nouns, things starting with “VB” are verbs. Doing this is non-trivial. (A full list of tags can be found here). That’s why there are software libraries written by people smarter than me for doing this sort of thing (*cough* opennlp *cough*). I’m just writing the Clojure wrappers for the library.

Anyway, back to the problem, so what I need to do is strip out all the non-noun and non-verb words in the sentence, that leaves us with this:

[“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”]

We’re getting closer, right? Now, searching for things like “is” probably won’t help, so let’s strip out all the words with less than 3 characters:

[“override” “NN”] [“system” “NN”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“pressed” “VBN”]

Now we get to a point where we have to make some kind of decision about what terms to use, for this project, I decided to weight the terms that were nearer to the original search term, only taking two of the closest words in each direction, after scoring this sentence you get:

[“override” 0] [“system” 0] [“meant” 0] [“deactivate” 0.25] [“accelerator” 0.5] [“brake” 1] [“pedal” 0.5] [“pressed” 0.25]

The red next to each word indicates how heavily we’ll weight each word when we use it for subsequent searches. The score is divided by 2 for each unit of distance away from the original key term. Not perfect, but it works pretty well.

That was pretty easy to follow, right? Now imagine having a large body of text, and a term. First you’d generate a list of key words from sentences directly containing the term, score them each and store them in a big map. On a second pass, you can start scoring each sentence by using the map of words => scores you’ve already built. In this way, you can then rank sentences, including those that don’t even contain the actual word you’ve searched for, but are still relevant to the original term. Now, my friend, you have context searching.

Congratulations, that was a really long explanation. It ended up being a bit more like pseudocode, right? (or at least an idea of how to construct a program). Hopefully after understand the previous explanation, the code should be easy to read.

The Code

(note: this is very rough code, I know it’s not entirely idiomatic and has a ton of room for enhancement and parallelization, feel free to suggest improvements in the comments however!)

The most important functions to check out are (score-words [term words]) method, which returns a list of vectors of words and their score, and the (get-scored-terms ) method, which returns a map of words as keys and scores as values for the entire text, given the initial term.

Here’s the output from the last few lines:

contextfinder=> (pprint (reverse (sort-by second (score-sentences mytext scorewords)))) (["The override system is meant to deactivate the accelerator when the brake pedal is pressed. " 13/4] ["The Obama administration is considering requiring all automobiles to contain a brake override system intended to prevent sudden acceleration episodes like those that have led to the recall of millions of Toyotas, the Transportation secretary, Ray LaHood, said Tuesday. " 5/2] ["Often called a \"smart pedal,\" the feature is already found on many automobiles sold worldwide, including models from BMW, Chrysler, Mercedes-Benz, Nissan and Volkswagen." 3/4] ["That will let the driver stop safely even if the cars throttle sticks open. " 0])

I highlighted the score for each sentence in red, see how sentences such as the third in the list have a score, but don’t contain the word “brake”? Those would have been missed entirely without this kind of searching.

contextfinder=> (println (score-text mytext scorewords)) 13/2

Score a whole block of text, self-explanatory.

Anyway, I’ve already spent hours writing the explanation for why you’d want to do this sort of searching, so I think I’ll save the explanation of the code for another post. Hopefully it’s clear enough to stand on its own. If you find this interesting, I encourage you to check out the clojure-opennlp bindings and start building other cool linguistic tools!

Natural Language Processing in Clojure with clojure-opennlp

Lee — Mon, 08 Mar 2010 20:51:57 +0000

NOTE: I am not a linguist, please feel free to correct me in the comments if I use the wrong term!

From Wikipedia:

Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations such as parse trees or first-order logic structures that are easier for computer programs to manipulate. Many problems within NLP apply to both generation and understanding; for example, a computer must be able to model morphology (the structure of words) in order to understand an English sentence, and a model of morphology is also needed for producing a grammatically correct English sentence.

Clojure-opennlp is a library to interface with the OpenNLP (Open Natural Language Processing) library of functions, which provide linguistic tools to perform on various blocks of text. Once a linguistic interpretation of text is possible, a lot of really interesting applications present themselves. Let’s jump right in!

Basic Example usage (from a REPL)

(use 'clojure.contrib.pprint) ; just for this example (use 'opennlp.nlp) ; make sure opennlp.jar is in your classpath

You will need to make the processing functions using the model files. These assume you’re running from the root project directory of the git repository (where some models are included). You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models/

user=> (def get-sentences (make-sentence-detector "models/EnglishSD.bin.gz")) user=> (def tokenize (make-tokenizer "models/EnglishTok.bin.gz")) user=> (def pos-tag (make-pos-tagger "models/tag.bin.gz"))

For name-finders in particular, it’s possible to have multiple model files:

user=> (def name-find (make-name-finder "models/namefind/person.bin.gz" "models/namefind/organization.bin.gz"))

The (make- "modelfile.bin.gz") functions return functions that perform the linguistic offering. I decided to have them return functions so multiple methods doing the same sort of action could be created with different model files (perhaps different language models and such) without having the pass the model file every time you wanted to process some text.

After creating the utility methods, we can use the functions to perform operations on text. For instance, since we defined the sentence-detector as ‘get-sentences’, we can us that method to split text by sentences:

user=> (pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea...")) ["First sentence. ", "Second sentence? ", "Here is another one. ", "And so on and so forth - you get the idea..."] nil

Or split a sentence into tokens using the tokenize function:

user=> (pprint (tokenize "Mr. Smith gave a car to his son on Friday")) ["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on", "Friday"] nil

Once we have a sequence of tokens, we can do what’s called POS Tagging. POS Tagging takes a list of words from only one sentence and applies an algorithms (using the morphology model) to determine what kind of tag to apply to each word:

user=> (pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))) (["Mr." "NNP"] ["Smith" "NNP"] ["gave" "VBD"] ["a" "DT"] ["car" "NN"] ["to" "TO"] ["his" "PRP$"] ["son" "NN"] ["on" "IN"] ["Friday." "NNP"]) nil

You can check out a list of all the tags if you want to know what they stand for.

The clojure-opennlp library also features a name finder, however it is extremely rudimentary at this point and won’t detect all names:

user=> (name-find (tokenize "My name is Lee, not John.")) ("Lee" "John")

Filters

In the library, I also provide some simple filters that can be used to pare down a list of pos-tagged tokens using regular expressions. There are some preset filters available, as well as a macro for generating your own filters:

(use 'opennlp.tools.filters)

user=> (pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))) (["Mr." "NNP"] ["Smith" "NNP"] ["car" "NN"] ["son" "NN"] ["Friday" "NNP"]) nil user=> (pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))) (["gave" "VBD"]) nil

Creating your own filter:

user=> (pos-filter determiners #"^DT") #'user/determiners user=> (doc determiners) ------------------------- user/determiners ([elements__52__auto__]) Given a list of pos-tagged elements, return only the determiners in a list. nil user=> (pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))) (["a" "DT"]) nil

Check out the filters.clj file for a full list of out-of-the-box filters.

That’s about all there is in the library at the moment, so I hope that made sense. Unfortunately clojars.org does not provide a nice way to public documentation for a library, so the documentation in this post and on the github page will have to do for now.

This library is available on clojars for inclusion in leiningen projects, or on github if you’re interested in the source. This is a fairly new project, and not all OpenNLP features are exposed at the moment so feedback is definitely encouraged. In the next post I’ll explain an in-depth example of how these functions can be used to enhance a searching engine. EDIT: It’s up! Check out “Context searching using clojure-opennlp.”

UPDATE: Hiredman has let me know that the jar on clojars is missing the 3 dependencies used for the library. I’m busy working on writing pom.xml’s for the jars so I can upload them to clojars as dependencies. In the meantime, make sure you have the 3 jars in the lib directory (of the github project) in your classpath. Feel free to report any other issues on the github tracker or in the comments.

UPDATE 2: I fixed the project.clj file and pushed new versions of opennlp.jar and the dependency jars. A regular ‘lein deps’ should work now.