NOTE: I am not a linguist, please feel free to correct me in the comments if I use the wrong term!
From Wikipedia:
Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations such as parse trees or first-order logic structures that are easier for computer programs to manipulate. Many problems within NLP apply to both generation and understanding; for example, a computer must be able to model morphology (the structure of words) in order to understand an English sentence, and a model of morphology is also needed for producing a grammatically correct English sentence.
Clojure-opennlp is a library to interface with the OpenNLP (Open Natural Language Processing) library of functions, which provide linguistic tools to perform on various blocks of text. Once a linguistic interpretation of text is possible, a lot of really interesting applications present themselves. Let’s jump right in!
Basic Example usage (from a REPL)
(use 'clojure.contrib.pprint) ; just for this example
(use 'opennlp.nlp) ; make sure opennlp.jar is in your classpath
You will need to make the processing functions using the model files. These assume you’re running from the root project directory of the git repository (where some models are included). You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models/
user=> (def get-sentences (make-sentence-detector "models/EnglishSD.bin.gz"))
user=> (def tokenize (make-tokenizer "models/EnglishTok.bin.gz"))
user=> (def pos-tag (make-pos-tagger "models/tag.bin.gz"))
For name-finders in particular, it’s possible to have multiple model files:
user=> (def name-find (make-name-finder "models/namefind/person.bin.gz" "models/namefind/organization.bin.gz"))
The (make-<whateverizer> "modelfile.bin.gz")
functions return functions that perform the linguistic offering. I decided to have them return functions so multiple methods doing the same sort of action could be created with different model files (perhaps different language models and such) without having the pass the model file every time you wanted to process some text.
After creating the utility methods, we can use the functions to perform operations on text. For instance, since we defined the sentence-detector as ‘get-sentences’, we can us that method to split text by sentences:
user=> (pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea..."))
["First sentence. ", "Second sentence? ", "Here is another one. ",
"And so on and so forth - you get the idea..."]
nil
Or split a sentence into tokens using the tokenize function:
user=> (pprint (tokenize "Mr. Smith gave a car to his son on Friday"))
["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",
"Friday"]
nil
Once we have a sequence of tokens, we can do what’s called POS Tagging. POS Tagging takes a list of words from only one sentence and applies an algorithms (using the morphology model) to determine what kind of tag to apply to each word:
user=> (pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))
(["Mr." "NNP"]
["Smith" "NNP"]
["gave" "VBD"]
["a" "DT"]
["car" "NN"]
["to" "TO"]
["his" "PRP$"]
["son" "NN"]
["on" "IN"]
["Friday." "NNP"])
nil
You can check out a list of all the tags if you want to know what they stand for.
The clojure-opennlp library also features a name finder, however it is extremely rudimentary at this point and won’t detect all names:
user=> (name-find (tokenize "My name is Lee, not John."))
("Lee" "John")
Filters
In the library, I also provide some simple filters that can be used to pare down a list of pos-tagged tokens using regular expressions. There are some preset filters available, as well as a macro for generating your own filters:
(use 'opennlp.tools.filters)
user=> (pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["Mr." "NNP"]
["Smith" "NNP"]
["car" "NN"]
["son" "NN"]
["Friday" "NNP"])
nil
user=> (pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["gave" "VBD"])
nil
Creating your own filter:
user=> (pos-filter determiners #"^DT")
#'user/determiners
user=> (doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
nil
user=> (pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["a" "DT"])
nil
Check out the filters.clj file for a full list of out-of-the-box filters.
That’s about all there is in the library at the moment, so I hope that made sense. Unfortunately clojars.org does not provide a nice way to public documentation for a library, so the documentation in this post and on the github page will have to do for now.
This library is available on clojars for inclusion in leiningen projects, or on github if you’re interested in the source. This is a fairly new project, and not all OpenNLP features are exposed at the moment so feedback is definitely encouraged. In the next post I’ll explain an in-depth example of how these functions can be used to enhance a searching engine. EDIT: It’s up! Check out “Context searching using clojure-opennlp.”
UPDATE: Hiredman has let me know that the jar on clojars is missing the 3 dependencies used for the library. I’m busy working on writing pom.xml’s for the jars so I can upload them to clojars as dependencies. In the meantime, make sure you have the 3 jars in the lib directory (of the github project) in your classpath. Feel free to report any other issues on the github tracker or in the comments.
UPDATE 2: I fixed the project.clj file and pushed new versions of opennlp.jar and the dependency jars. A regular ‘lein deps’ should work now.
Today in the Intertweets (March 8th Ed) | disclojure: all things clojure wrote:
[…] new blog post about my library: Natural Language Processing in #Clojure with clojure-opennlp (here, via @thnetos) — Grab a text, break it into sentences, parse the words and tag them. […]
Link | March 9th, 2010 at 12:37 am
Context searching using Clojure-OpenNLP : :wq – blog wrote:
[…] is an addon to my previous post, “Natural Language Processing in Clojure with clojure-opennlp“. If you’re unfamiliar with NLP or the clojure-opennlp library, please read the […]
Link | March 9th, 2010 at 11:31 am
Clojure – Destillat #5 | duetsch.info - Open Source, Wet-, Web-, Software wrote:
[…] Natural Language Processing in Clojure with clojure-opennlp […]
Link | March 16th, 2010 at 3:13 am
Jörn Kottmann wrote:
Hi,
nice to see that you use OpenNLP. I am one of the maintainers.
There is actually a maven OpenNLP repository:
OpenNLPRepository http://opennlp.sourceforge.net/maven2
All the dependencies are also in the repo.
You can add the OpenNLP dependency with:
opennlp
opennlp-tools
1.4.3
Jörn
Link | March 24th, 2010 at 3:01 pm