NOTE: I am not a linguist, please feel free to correct me in the comments if I use the wrong term!
Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations such as parse trees or first-order logic structures that are easier for computer programs to manipulate. Many problems within NLP apply to both generation and understanding; for example, a computer must be able to model morphology (the structure of words) in order to understand an English sentence, and a model of morphology is also needed for producing a grammatically correct English sentence.
Clojure-opennlp is a library to interface with the OpenNLP (Open Natural Language Processing) library of functions, which provide linguistic tools to perform on various blocks of text. Once a linguistic interpretation of text is possible, a lot of really interesting applications present themselves. Let’s jump right in!
Basic Example usage (from a REPL)
(use 'clojure.contrib.pprint) ; just for this example
(use 'opennlp.nlp) ; make sure opennlp.jar is in your classpath
You will need to make the processing functions using the model files. These assume you’re running from the root project directory of the git repository (where some models are included). You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models/
user=> (def get-sentences (make-sentence-detector "models/EnglishSD.bin.gz"))
user=> (def tokenize (make-tokenizer "models/EnglishTok.bin.gz"))
user=> (def pos-tag (make-pos-tagger "models/tag.bin.gz"))
For name-finders in particular, it’s possible to have multiple model files:
user=> (def name-find (make-name-finder "models/namefind/person.bin.gz" "models/namefind/organization.bin.gz"))
(make-<whateverizer> "modelfile.bin.gz") functions return functions that perform the linguistic offering. I decided to have them return functions so multiple methods doing the same sort of action could be created with different model files (perhaps different language models and such) without having the pass the model file every time you wanted to process some text.
After creating the utility methods, we can use the functions to perform operations on text. For instance, since we defined the sentence-detector as ‘get-sentences’, we can us that method to split text by sentences:
user=> (pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea..."))
["First sentence. ", "Second sentence? ", "Here is another one. ",
"And so on and so forth - you get the idea..."]
Or split a sentence into tokens using the tokenize function:
user=> (pprint (tokenize "Mr. Smith gave a car to his son on Friday"))
["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",
Once we have a sequence of tokens, we can do what’s called POS Tagging. POS Tagging takes a list of words from only one sentence and applies an algorithms (using the morphology model) to determine what kind of tag to apply to each word:
user=> (pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))
You can check out a list of all the tags if you want to know what they stand for.
The clojure-opennlp library also features a name finder, however it is extremely rudimentary at this point and won’t detect all names:
user=> (name-find (tokenize "My name is Lee, not John."))
In the library, I also provide some simple filters that can be used to pare down a list of pos-tagged tokens using regular expressions. There are some preset filters available, as well as a macro for generating your own filters:
user=> (pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
user=> (pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
Creating your own filter:
user=> (pos-filter determiners #"^DT")
user=> (doc determiners)
Given a list of pos-tagged elements, return only the determiners in a list.
user=> (pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
Check out the filters.clj file for a full list of out-of-the-box filters.
That’s about all there is in the library at the moment, so I hope that made sense. Unfortunately clojars.org does not provide a nice way to public documentation for a library, so the documentation in this post and on the github page will have to do for now.
This library is available on clojars for inclusion in leiningen projects, or on github if you’re interested in the source. This is a fairly new project, and not all OpenNLP features are exposed at the moment so feedback is definitely encouraged. In the next post I’ll explain an in-depth example of how these functions can be used to enhance a searching engine. EDIT: It’s up! Check out “Context searching using clojure-opennlp.”
UPDATE: Hiredman has let me know that the jar on clojars is missing the 3 dependencies used for the library. I’m busy working on writing pom.xml’s for the jars so I can upload them to clojars as dependencies. In the meantime, make sure you have the 3 jars in the lib directory (of the github project) in your classpath. Feel free to report any other issues on the github tracker or in the comments.
UPDATE 2: I fixed the project.clj file and pushed new versions of opennlp.jar and the dependency jars. A regular ‘lein deps’ should work now.