:wq - blog » search http://writequit.org/blog Tu fui, ego eris Mon, 22 Dec 2014 14:54:59 +0000 en-US hourly 1 http://wordpress.org/?v=4.1.5 Context searching using Clojure-OpenNLP http://writequit.org/blog/2010/03/09/context-searching-using-clojure-opennlp/ http://writequit.org/blog/2010/03/09/context-searching-using-clojure-opennlp/#comments Tue, 09 Mar 2010 18:31:00 +0000 http://writequit.org/blog/?p=351 This is an addon to my previous post, “Natural Language Processing in Clojure with clojure-opennlp“. If you’re unfamiliar with NLP or the clojure-opennlp library, please read the previous post first.

In that post, I told you I was going to use clojure-opennlp to enhance a searching engine, so, let’s start with the problem!

The Problem

(Ideas first, code coming later. Be patient.)

So, let’s say you have a sentence:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

I took this sentence out of a NYTimes article talking about the recent Toyota recalls, just pretend it’s an interesting sentence.

Let’s say that you wanted to search for things containing the word “brake“; well, that’s easy for a computer to do right? But, let’s say you want to search for things around the word “brake“, wouldn’t that make the search better? You would be able to find a lot more articles/pages/books/whatever you might be interested in, instead of ones that only contained the word key word.

So, in a naïve (for a computer) pass for words around the key word, we can come up with this:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

Right. Well. That really isn’t helpful, it’s not like the words “when”, “the” or “is” are going to help us find other things related to this topic. We got lucky with “pedal”, that will definitely help find things that are interesting, but might not actually have the word brake in them.

What we really need, is something that can pick words out of the sentence that we can also search for. Almost any human could do this trivially, so here’s what I’d probably pick:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

See the words in red? Those are the important terms I’d probably search for if I were trying to find more information about this topic. Notice anything special about them? Turns out, they’re all nouns or verbs. This is where the NLP (Natural Language Processing) comes into play for this, given a sentence like the above, we can categorize it into it’s parts by doing what’s called POS-Tagging (POS == Parts Of Speech):

[“The” “DT”] [“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“to” “TO”] [“deactivate” “VB”] [“the” “DT”] [“accelerator” “NN”] [“when” “WRB”] [“the” “DT”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”] [“.” “.”]

So next to each word, we can see what part of speech that word belongs to. Things starting with “NN” are nouns, things starting with “VB” are verbs. Doing this is non-trivial. (A full list of tags can be found here). That’s why there are software libraries written by people smarter than me for doing this sort of thing (*cough* opennlp *cough*). I’m just writing the Clojure wrappers for the library.

Anyway, back to the problem, so what I need to do is strip out all the non-noun and non-verb words in the sentence, that leaves us with this:

[“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”]

We’re getting closer, right? Now, searching for things like “is” probably won’t help, so let’s strip out all the words with less than 3 characters:

[“override” “NN”] [“system” “NN”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“pressed” “VBN”]

Now we get to a point where we have to make some kind of decision about what terms to use, for this project, I decided to weight the terms that were nearer to the original search term, only taking two of the closest words in each direction, after scoring this sentence you get:

[“override” 0] [“system” 0] [“meant” 0] [“deactivate” 0.25] [“accelerator” 0.5] [“brake” 1] [“pedal” 0.5] [“pressed” 0.25]

The red next to each word indicates how heavily we’ll weight each word when we use it for subsequent searches. The score is divided by 2 for each unit of distance away from the original key term. Not perfect, but it works pretty well.

That was pretty easy to follow, right? Now imagine having a large body of text, and a term. First you’d generate a list of key words from sentences directly containing the term, score them each and store them in a big map. On a second pass, you can start scoring each sentence by using the map of words => scores you’ve already built. In this way, you can then rank sentences, including those that don’t even contain the actual word you’ve searched for, but are still relevant to the original term. Now, my friend, you have context searching.

Congratulations, that was a really long explanation. It ended up being a bit more like pseudocode, right? (or at least an idea of how to construct a program). Hopefully after understand the previous explanation, the code should be easy to read.

The Code

(note: this is very rough code, I know it’s not entirely idiomatic and has a ton of room for enhancement and parallelization, feel free to suggest improvements in the comments however!)

The most important functions to check out are (score-words [term words]) method, which returns a list of vectors of words and their score, and the (get-scored-terms ) method, which returns a map of words as keys and scores as values for the entire text, given the initial term.

Here’s the output from the last few lines:

contextfinder=> (pprint (reverse (sort-by second (score-sentences mytext scorewords))))
(["The override system is meant to deactivate the accelerator when the brake pedal is pressed. " 13/4]
["The Obama administration is considering requiring all automobiles to contain a brake override system intended to prevent sudden acceleration episodes like those that have led to the recall of millions of Toyotas, the Transportation secretary, Ray LaHood, said Tuesday. " 5/2]
["Often called a \"smart pedal,\" the feature is already found on many automobiles sold worldwide, including models from BMW, Chrysler, Mercedes-Benz, Nissan and Volkswagen." 3/4]
["That will let the driver stop safely even if the cars throttle sticks open. " 0])

I highlighted the score for each sentence in red, see how sentences such as the third in the list have a score, but don’t contain the word “brake”? Those would have been missed entirely without this kind of searching.

contextfinder=> (println (score-text mytext scorewords))
13/2

Score a whole block of text, self-explanatory.

Anyway, I’ve already spent hours writing the explanation for why you’d want to do this sort of searching, so I think I’ll save the explanation of the code for another post. Hopefully it’s clear enough to stand on its own. If you find this interesting, I encourage you to check out the clojure-opennlp bindings and start building other cool linguistic tools!

]]>
http://writequit.org/blog/2010/03/09/context-searching-using-clojure-opennlp/feed/ 8
Installing HyperEstraier Ruby bindings with Ruby 1.9.1 (RC1) http://writequit.org/blog/2009/01/09/installing-hyperestraier-ruby-bindings-with-ruby-191-rc1/ http://writequit.org/blog/2009/01/09/installing-hyperestraier-ruby-bindings-with-ruby-191-rc1/#comments Fri, 09 Jan 2009 19:13:40 +0000 http://writequit.org/blog/?p=247 Lately I’ve been doing a lot of coding around an awesome database called Hyper Estraier, which allows me to create large inverted-index databases to search over very quickly.

I have also been playing around with the new Ruby RC release for version 1.9.1. In an effort to get some current code running, I found that hyperestraier doesn’t compile the native ruby bindings using 1.9.1 correctly. You’ll see this error when attempting to compile:

~/hyperestraier-1.4.13/rubynative$ make
( cd src && if ! [ -f Makefile ] ; then /usr/local/bin/ruby extconf.rb ; fi )
checking for estraier.h... yes
checking for main() in -lestraier... yes
creating Makefile
( cd src && make )
make[1]: Entering directory `/home/hinmanm/hyperestraier-1.4.13/rubynative/src'
gcc -I. -I/usr/local/include/ruby-1.9.1/i686-linux -I/usr/local/include/ruby-1.9.1/ruby/backward -I/usr/local/include/ruby-1.9.1 -I. -DHAVE_ESTRAIER_H  -D_FILE_OFFSET_BITS=64  -fPIC -I. -I.. -I../.. -I/usr/include/estraier  -I/usr/include/qdbm -Wall  -O2 -g -Wall -Wno-parentheses -O3 -fomit-frame-pointer -fforce-addr  -o estraier.o -c estraier.c
estraier.c: In function ‘doc_make_snippet’:
estraier.c:354: error: ‘struct RArray’ has no member named ‘len’
estraier.c: In function ‘db_search_meta’:
estraier.c:767: error: ‘struct RArray’ has no member named ‘len’
estraier.c: In function ‘objtocblist’:
estraier.c:1192: error: ‘struct RArray’ has no member named ‘len’
estraier.c:1195: error: ‘struct RString’ has no member named ‘ptr’
estraier.c:1195: error: ‘struct RString’ has no member named ‘len’
estraier.c: In function ‘objtocbmap’:
estraier.c:1221: error: ‘struct RArray’ has no member named ‘len’
estraier.c:1227: error: ‘struct RString’ has no member named ‘ptr’
estraier.c:1227: error: ‘struct RString’ has no member named ‘len’
estraier.c:1228: error: ‘struct RString’ has no member named ‘ptr’
estraier.c:1228: error: ‘struct RString’ has no member named ‘len’
make[1]: *** [estraier.o] Error 1
make[1]: Leaving directory `/home/hinmanm/hyperestraier-1.4.13/rubynative/src'
make: *** [all] Error 2

This error comes from a not-so-great practice of using RSTRING(foo)->len and RARRAY(bar)->ptr in the code, which was removed in Ruby 1.9.1. The correct way to fix this is to make these changes:

RSTRING(foo)->len and RSTRING(foo)->ptr
becomes:
RSTRING_LEN(foo) and RSTRING_PTR(foo)

RARRAY(bar)->len and RARRAY(bar)->ptr
becomes:
RARRAY_LEN(bar) and RARRAY_PTR(bar)

For HyperEstraier, this is a pretty quick fix, there’s a link to the patch at the bottom of this post, after downloading, simply do:

[4:hinmanm@dagger:~]% cd hyperestraier-1.4.13/rubynative/src
[4:hinmanm@dagger:~/hyperestraier-1.4.13/rubynative/src]% patch < ~/hyperestraier-ruby191.patch
patching file estraier.c
[4:hinmanm@dagger:~/hyperestraier-1.4.13/rubynative/src]% cd ../
[4:hinmanm@dagger:~/hyperestraier-1.4.13/rubynative]% ./configure && make && sudo make install

And Hyper Estraier should be installed and ready to use in ruby 1.9.1 RC1! Enjoy!

Download the patch: [hyperestraier-ruby191.patch]

]]>
http://writequit.org/blog/2009/01/09/installing-hyperestraier-ruby-bindings-with-ruby-191-rc1/feed/ 1