writequit (:wq) http://writequit.org/blog Tu fui, ego eris Mon, 22 Dec 2014 14:54:59 +0000 en-US hourly 1 http://wordpress.org/?v=4.1.5 writequit/feedhttps://feedburner.google.com Getting started with git-annex http://feedproxy.google.com/~r/writequit/feed/~3/5z6oXl-iRa4/ http://writequit.org/blog/?p=465#comments Mon, 22 Dec 2014 14:48:14 +0000 http://writequit.org/blog/?p=465 I’ve written an article on getting started using git-annex, it’s a static page however, and not a blog post, so you can check it out here:

http://writequit.org/articles/getting-started-with-git-annex.html

Or, if you’d like to check out the original .org file:

http://writequit.org/articles/getting-started-with-git-annex.org

Check it out!

]]>
http://writequit.org/blog/?feed=rss2&p=465 0 http://writequit.org/blog/?p=465
Writing literate-programming style Elasticsearch shell scripts with Emacs http://feedproxy.google.com/~r/writequit/feed/~3/1mWA7kP5aTQ/ http://writequit.org/blog/?p=462#comments Wed, 30 Oct 2013 15:50:59 +0000 http://writequit.org/blog/?p=462 This is kind of a special article, and as such, has it’s own page, so here’s a link:

http://writequit.org/articles/literate-es-scripts.html

It’s an article about literate programming, Emacs, and Elasticsearch, if you couldn’t tell from the title. Check it out!

]]>
http://writequit.org/blog/?feed=rss2&p=462 0 http://writequit.org/blog/?p=462
Subrosa: An IRC server written in Clojure http://feedproxy.google.com/~r/writequit/feed/~3/lwIU04Ogd1c/ http://writequit.org/blog/?p=444#comments Tue, 22 Mar 2011 21:00:24 +0000 http://writequit.org/blog/?p=444 For the next couple of blog posts, I’ve decided to describe some of the software we use (and have developed) at Sonian. We have a few Clojure repos pushed to github that have documentation that is poor-to-none. I wanted to start with our main communication tool these days: our IRC server, Subrosa. We originally were using Skype to communicate, but finally got fed-up with stability and usability, we wanted to switch to something with a little bit more control. Dan Larkin had been working on an IRC server we could use for hosting our main topic chats, we switched from Skype in part because of these reasons:

  • Logging (Skype’s logs are all in-process, so there was no easy way to aggregate and share logs if a person had never been in a room)
  • Topic chats – Since we’re a decentralized team, we split a lot of our work into discussions rooms specific to the story being worked on so we can chat about it while the work goes on. With Skype, you cannot join a room that you haven’t been invited to (if you took over a story for someone else for example), with IRC we are free to drop in and out of the rooms however we’d like (we trade off a little with no private rooms, but it is a work-server, not really for private conversations)
  • Full control over the sever
  • A culture of many clients and configuration options – Each of us can connect to IRC however we’d like, set up notification exactly as we’d like and set up the variety of tools already built around IRC (bouncers, etc)

Switching has been very nice, we run an SSL-enabled IRC server where we do all our of development chat and haven’t looked back at Skype (which we still use, but no longer for the majority of our development chatter).

Below the covers

Let’s go a little more in-depth with the Subrosa code. Subrosa is built on top of Netty to make use of asynchronous communication between the clients (called channels in the code) and the server. Netty uses a pipeline of handlers for each message (either sent or received). Each handler will process the message, then pass it up the pipeline until there are no more handlers. Messages that are sent and received are handled by ChannelUpstreamHandlers and ChannelDownstreamHandlers, which are added to the pipeline in the add-irc-pipeline! function:


The most interesting handler that is associated is the (message-stage #'message-handler), which handles the commands that are sent once a user is connected. This calls the dispatch-message function in subrosa.commands that handles any commands that come across the channel. The commands themselves live in commands.clj.  Let’s look at probably the most common command, privmsg:
There are two types of commands, authorized and unauthorized commands (defcommand and defcommand* respectively). Authorized commands require the user to have sent the USER and NICK commannds as well as PASS (if required by the server). The bulk of the work in this command comes from helper functions that determine whether the message was sent to a room (room-for-name recipient) or a user (channel-for-nick recipient), then sends the command to the room or the specified user. Also of note are the hooks that are manually called (privmsg-room-hook and privmsg-nick-hook). This allows Subrosa to support plugins for additional features.

The database

Subrosa uses a custom in-memory database built around a ref to keep track of its users, channels and rooms. It acts like a key-value store for Clojure maps that allows for some interesting automatic indexing between fields. The database was previously built around clojure.contrib.datalog, but switched to a custom DB with the 0.7 version. Every time a user joins the server, a map is put in the DB mapping the Netty channel to the user’s nick (so that messages can be sent to that user’s nick). Rooms are also kept in the database to allow a message to be sent to each nick (therefore each channel) in a given room. Check out the tests for a quick look at how it’s used.

Hooks and Plugins

Subrosa uses a hooks system (manually invoked from each defcommand) to allow for supporting plugins. Hooks are actually how commands are implemented, the defcommand and defcommand* functions are macros that add a hook with the “cmd” tag. There are no limit to the number of hooks, at the time of this writing, there are 9 hooks:

  • privmsg-room-hook
  • privmsg-nick-hook
  • join-hook
  • part-hook
  • quit-hook
  • nick-hook
  • topic-hook
  • notice-room-hook
  • notice-nick-hook

Hooks can easily be added (there are no rules about them) and invoked by giving a tag and a function to be called when the hook is called:

(add-hook ::myhooktag 'my-awesome-hook (fn [& args] (println "I got called!")))

And called using run-hook:

(run-hook 'my-awesome-hook "Subrosa" "is" "awesome")

This is exactly how the logging plugin works (and the /catchup command added in my fork of Subrosa). Automatic logging solves our first main problem with Skype, which is that it is difficult to get conversation logs if you aren’t involved in the chat from the beginning. That about wraps up a basic overview of Subrosa and how (some of) it works. I encourage you to check out the Subrosa github repo and peruse the code. Hopefully this may inspire you to use it as your own server, or fork the project on github and add more commands to it (it’s missing a lot of RFC commands still). Thanks (and pull requests) go to Dan Larkin for getting this project up and awesome (and for letting me write about it).

]]>
http://writequit.org/blog/?feed=rss2&p=444 3 http://writequit.org/blog/?p=444
Dvorak for two weeks, a retrospective http://feedproxy.google.com/~r/writequit/feed/~3/9QzbqpfEM3k/ http://writequit.org/blog/?p=431#comments Thu, 20 Jan 2011 18:51:59 +0000 http://writequit.org/blog/?p=431 Two weeks ago I switched to the Dvorak keyboard layout. The switch was motivated by a few different things:

Knowing real people that are using the Dvorak layout every day

I found out that at least 3 developers out of our group of about 11 use Dvorak. I had heard many people on the internet who had been using it before, but no one in real life. Having people you actually know using it did help in legitimizing the switch in my brain.

Emacs usage and RSI

Having switched to Emacs for Clojure development in order to pair with people at the new job, I started developing what has been known as “Emacs Pinky”. The pinky syndrome faded over time as I became more familiar with Emacs, but it made me think about RSI impacting me in the future. I decided that if I was going to change I would need to change while I was under 30. So far I feel like there has been less stress on my hands than Qwerty was.

I also decided that if I was going to learn Emacs, I might as well gain the muscle-memory with Dvorak the first time. This has been great since I still have to think about Emacs commands, so only knowing the command with muscle-memory on Dvorak keeps my mind from Qwerty/Dvorak contention.

Maybe typing faster in the future

Anyone that’s seen Phil type on his awesome keyboard pants knows how fast he types with Dvorak. It’s inspiring.

So, how has it been?

So far, Dvorak has been really cool. I’m 2 weeks (and a day) into the switch. Right now my qwerty is completely shot, so I try very hard not to use it any more. I did a completely cold-turkey switch, which was hard at the beginning, but I’ve had some understanding coworkers that have helped it not be as painful (they would drive the pairing while I was just starting out). My speed with Dvorak right now is about 1/2 of my speed with Qwerty, but it gets better every day.

The hardest letters to type so far are ‘q’, ‘l’ and ‘g’. I’ve also been having trouble with [] being switched, I end up typing / every time I mean to type [ (like I said, getting better every day). Overall, I would still recommend to anyone that types a lot that if they can take the initial hit, it’s worth the switch.

I recommend that if you do switch, switch like this:

  • Switch cold-turkey – switching half time will leave you crazy.
  • Print out a Dvorak keyboard (or stick it on your screen persistently) until you memorize the keyboard
  • Try not to be in a position where people have to watch you type – if you think it’s painful to type, imagine watching you type.
  • Don’t change any Emacs/Vim/whatever defaults (keybindings) – I couldn’t change since we pair on each other’s computers, so the default is always the best to learn
]]>
http://writequit.org/blog/?feed=rss2&p=431 6 http://writequit.org/blog/?p=431
Swapping words with their synonyms for fun http://feedproxy.google.com/~r/writequit/feed/~3/9dowMIh9KmY/ http://writequit.org/blog/?p=422#comments Tue, 21 Sep 2010 20:17:15 +0000 http://writequit.org/blog/?p=422 I was explaining some of the NLP stuff I’ve been working on to a teacher-friend of mine, and he suggested the idea of taking an essay a student had written, running it through a tokenizer and swapping the verbs and nouns with their synonyms. Would the essay still mean essentially the same thing? Would it still even be recognizable? It sounded reasonably easy to do in an afternoon, so I figured I’d give it a shot.

I already had the tokenizing part, so I looked around and decided on Big Huge Thesaurus for their thesaurus API.

About an hour of fooling around later, and Syndicate was born. It’s dead simple to use and provides hours minutes of amusement.

To play with Syndicate, you’ll need to get an API key from Big Huge Thesaurus (they’re free).

$ git clone git://github.com/dakrone/syndicate.git
$ cd syndicate
$ (vim|emacs|mate) src/syndicate/core.clj
change "YOUR-API-KEY" on line 8 to be your API key from Big Huge Thesaurus
$ lein deps
$ lein repl
user=> (use 'syndicate.core)

From here, there’s really only a few methods:

(replace-text "Google has lately found itself on the receiving end of criticism from privacy and transparency advocates. But with two new tools, Google is trying to convince them that the company is on their side. On Tuesday, Google will introduce a new tool called the Transparency Report, at google.com/transparencyreport/. It publishes where and when Internet traffic to Google sites is blocked, and the blockages are annotated with details when possible. For instance, the tool shows that YouTube has been blocked in Iran since the disputed presidential election in June 2009. The Transparency Report will also be the home for Google's government requests tool, a map that shows every time a government has asked Google to take down or hand over information, and what percentage of the time Google has complied. Google introduced it in April and updates it every six months. Government requests could be court orders to remove hateful content or a subpoena to pass along information about a Google user. The idea is to provide transparency, and we're hoping that transparency is a deterrent to censorship, said Niki Fenwick, a Google spokeswoman.")

Syndicate will return the text with the nouns and verbs replaced by other random synonyms. The result can be somewhat humorous in the choice of random synonyms:

Google has lately set up itself on the receiving point in time of disapproval from seclusion and image somebody . But with two new creature , Google is trying to win over them that the social affair is on their opinion . On Tues , Google will set a new way called the transparentness news , at google .com/transparencyreport/ . It publishes where and when cyberspace give-and-take to Google parcel is blocked , and the block are annotated with info when possible . For natural event , the agency render that YouTube has been blocked in Iran since the disputed presidential selection in Gregorian calendar month 2009 . The ikon account will also be the location for Google 's politics subject matter creature , a representation that demonstrate every reading a politics has communicate Google to admit down or transfer over cognition , and what part of the moment search engine has complied . search engine introduced it in Gregorian calendar month and word it every six period of time . polity petition could be tribunal taxonomic category to vanish hateful contentedness or a judicial writ to go along along content about a Google mortal . The melodic theme is to ready transparence , and we 're cross that foil is a balk to security review , said Niki Fenwick , a search engine interpreter .

You can also play around with just replacing a single word:

user=> (replace-noun "car")
"automobile"
user=> (replace-verb "shout")
"outcry"

That’s all there is to it, just an amusing library for playing with text in < 100 lines. In the future I’ll add a method to specify an API key without changing the source.

]]>
http://writequit.org/blog/?feed=rss2&p=422 3 http://writequit.org/blog/?p=422
Automatically fold Clojure methods in Vim http://feedproxy.google.com/~r/writequit/feed/~3/MK5e68BXUzI/ http://writequit.org/blog/?p=413#comments Wed, 11 Aug 2010 22:09:34 +0000 http://writequit.org/blog/?p=413 I’ve been trying to get this working for quite some time, I finally have a semi-working function to fold Clojure methods automatically.

First, here’s a picture of what you can expect to see when using this (click for larger version):

Click me!

And here’s the code (stick this in your .vimrc):

That’s it, easy as pie, enjoy the folding for Clojure files.

There are a few bugs with this because I’ve never actually written a Vim plugin, so I kind of hacked my way through this until I got it working. I welcome any feedback from someone with vimscript development experience (I’m looking at you Brian Carper ;))

]]>
http://writequit.org/blog/?feed=rss2&p=413 2 http://writequit.org/blog/?p=413
Does Clojure really need another set of introduction slides? http://feedproxy.google.com/~r/writequit/feed/~3/koWpg5M87iU/ http://writequit.org/blog/?p=404#comments Fri, 30 Apr 2010 03:52:11 +0000 http://writequit.org/blog/?p=404 Well, I guess it couldn’t hurt, I looked at quite a few when coming up with mine (I recently gave an ‘Introduction to Clojure’ tech talk to co-workers), so I figure they’ll be useful to someone in the future.

You can download the slides here, or check them out on slideshare:

I’ve also been testing the new 2.2 pre-release of VimClojure (actually using nailgun this time), and it is sweet so far.

]]>
http://writequit.org/blog/?feed=rss2&p=404 2 http://writequit.org/blog/?p=404
How I develop Clojure with Vim http://feedproxy.google.com/~r/writequit/feed/~3/vIhze9hePVY/ http://writequit.org/blog/?p=386#comments Mon, 15 Mar 2010 17:30:48 +0000 http://writequit.org/blog/?p=386 Recently Lau Jensen wrote a post talking about the features of Emacs and why it increases the productivity of Clojure programmers. While I don’t disagree that lisp programming in general benefits greatly from using Emacs as an editor, there are simply people who are too heavily invested into Vim (like myself) for things like viper-mode to work for them. So, I thought I’d share how I do Clojure development with Vim. Throw in my 2 cents.

The key (for me) to editing Clojure code in Vim is a combination of two plugins, VimClojure and slime.vim (see associated blog post). One of the difficult things is that slime.vim doesn’t actually exist anywhere on vim.org’s list of scripts, so it has to be downloaded from the aforementioned blog post. Stick it in the ~/.vim/plugins directory to install it.

First, VimClojure. I tend not to use Nailgun at all, some people like it, I don’t. So instead of the regular install for vimclojure, I copy over the files from the autoload, doc, ftdetect, ftplugin, indent and syntax folders to their respective Vim folders. If you think you’ll want the Nailgun functionality, you should use the installation instructions provided by Kotarak.

Now, add the settings you need for VimClojure to your .vimrc:

" Settings for VimClojure
let g:clj_highlight_builtins=1      " Highlight Clojure's builtins
let g:clj_paren_rainbow=1           " Rainbow parentheses'!

I have to say, rainbow parentheses’ is one of the best features of vimclojure, making it easy to see exactly what parentheses closes which statement:

Now that VimClojure is set up, time to set up the integration with Clojure’s REPL, to do that I use slime.vim. Slime.vim uses screen to send the input from your editor to any window in a running screen session, so to get started we’ll have to start up a screen session. To make it easier, you can name it something so you don’t have to look up the pid, I’ll call this session “clojure”:

‹ ~ › : screen -S clojure

If you didn’t name your session, or forgot what you named it, you can use screen -ls to look up all the screen sessions you’ve started:

‹ ~ › : screen -ls
There are screens on:
41837.clojure (Attached)
8970.ttys000.Xanadu (Attached)
8990.ttys001.Xanadu (Attached)
9010.ttys002.Xanadu (Attached)
4 Sockets in /tmp/screens/S-hinmanm.

Now, start a REPL in the screen terminal window (use ‘clj’ or ‘lein REPL’ or however you like to start a Clojure REPL). Next, open a clojure file with Vim, highlight a block of code (slime.vim will automatically select a paragraph if your cursor is in the middle of something like a defn), now, press Control-c + Control-c (Ctrl+c twice in a row). You should be prompted by Vim like this:

Enter the name of the screen term, if you named your session “clojure” you’d enter “clojure”, if you didn’t name it, use the pid number you see from the output of ‘screen -ls’, next it will ask for which window to send the output to:

If you’ve used screen before (and I’m assuming you have), this is the window number your REPL is running on. After you enter this information the plugin will send the paragraph/line of text to the REPL. From here on the session id and window will be cached, so hitting ctrl+c,ctrl+c again will immediately send whatever function the cursor is on to the REPL. You can also select a block of code using visual mode and use ctrl+c,ctrl+c to send everything selected to the REPL. If you used the wrong numbers, use ctrl+c,v (Control+c, then v) to have slime.vim ask you for the numbers again.

There you go, you now have a 1-way pipe from your Vim editor to any kind of REPL (be it Clojure, Ruby or Python). Here’s a couple of screenshots of the plugin in action:

I know this doesn’t even come close to the amount of integration the Emacs has using SLIME, but for me, this is exactly what I want out of a Clojure development environment, develop some code and be able to easily send it to a REPL. Hopefully a Vim user or two out there will find this setup useful.

UPDATE: If you’re interested in my full Vim setup for some reason, you can check it out here.

UPDATE 2: Want to automagically fold Clojure methods when using Vim? Check out this post.

]]>
http://writequit.org/blog/?feed=rss2&p=386 44 http://writequit.org/blog/?p=386
Context searching using Clojure-OpenNLP http://feedproxy.google.com/~r/writequit/feed/~3/JdGK86diH6U/ http://writequit.org/blog/?p=351#comments Tue, 09 Mar 2010 18:31:00 +0000 http://writequit.org/blog/?p=351 This is an addon to my previous post, “Natural Language Processing in Clojure with clojure-opennlp“. If you’re unfamiliar with NLP or the clojure-opennlp library, please read the previous post first.

In that post, I told you I was going to use clojure-opennlp to enhance a searching engine, so, let’s start with the problem!

The Problem

(Ideas first, code coming later. Be patient.)

So, let’s say you have a sentence:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

I took this sentence out of a NYTimes article talking about the recent Toyota recalls, just pretend it’s an interesting sentence.

Let’s say that you wanted to search for things containing the word “brake“; well, that’s easy for a computer to do right? But, let’s say you want to search for things around the word “brake“, wouldn’t that make the search better? You would be able to find a lot more articles/pages/books/whatever you might be interested in, instead of ones that only contained the word key word.

So, in a naïve (for a computer) pass for words around the key word, we can come up with this:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

Right. Well. That really isn’t helpful, it’s not like the words “when”, “the” or “is” are going to help us find other things related to this topic. We got lucky with “pedal”, that will definitely help find things that are interesting, but might not actually have the word brake in them.

What we really need, is something that can pick words out of the sentence that we can also search for. Almost any human could do this trivially, so here’s what I’d probably pick:

“The override system is meant to deactivate the accelerator when the brake pedal is pressed.”

See the words in red? Those are the important terms I’d probably search for if I were trying to find more information about this topic. Notice anything special about them? Turns out, they’re all nouns or verbs. This is where the NLP (Natural Language Processing) comes into play for this, given a sentence like the above, we can categorize it into it’s parts by doing what’s called POS-Tagging (POS == Parts Of Speech):

[“The” “DT”] [“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“to” “TO”] [“deactivate” “VB”] [“the” “DT”] [“accelerator” “NN”] [“when” “WRB”] [“the” “DT”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”] [“.” “.”]

So next to each word, we can see what part of speech that word belongs to. Things starting with “NN” are nouns, things starting with “VB” are verbs. Doing this is non-trivial. (A full list of tags can be found here). That’s why there are software libraries written by people smarter than me for doing this sort of thing (*cough* opennlp *cough*). I’m just writing the Clojure wrappers for the library.

Anyway, back to the problem, so what I need to do is strip out all the non-noun and non-verb words in the sentence, that leaves us with this:

[“override” “NN”] [“system” “NN”] [“is” “VBZ”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“is” “VBZ”] [“pressed” “VBN”]

We’re getting closer, right? Now, searching for things like “is” probably won’t help, so let’s strip out all the words with less than 3 characters:

[“override” “NN”] [“system” “NN”] [“meant” “VBN”] [“deactivate” “VB”] [“accelerator” “NN”] [“brake” “NN”] [“pedal” “NN”] [“pressed” “VBN”]

Now we get to a point where we have to make some kind of decision about what terms to use, for this project, I decided to weight the terms that were nearer to the original search term, only taking two of the closest words in each direction, after scoring this sentence you get:

[“override” 0] [“system” 0] [“meant” 0] [“deactivate” 0.25] [“accelerator” 0.5] [“brake” 1] [“pedal” 0.5] [“pressed” 0.25]

The red next to each word indicates how heavily we’ll weight each word when we use it for subsequent searches. The score is divided by 2 for each unit of distance away from the original key term. Not perfect, but it works pretty well.

That was pretty easy to follow, right? Now imagine having a large body of text, and a term. First you’d generate a list of key words from sentences directly containing the term, score them each and store them in a big map. On a second pass, you can start scoring each sentence by using the map of words => scores you’ve already built. In this way, you can then rank sentences, including those that don’t even contain the actual word you’ve searched for, but are still relevant to the original term. Now, my friend, you have context searching.

Congratulations, that was a really long explanation. It ended up being a bit more like pseudocode, right? (or at least an idea of how to construct a program). Hopefully after understand the previous explanation, the code should be easy to read.

The Code

(note: this is very rough code, I know it’s not entirely idiomatic and has a ton of room for enhancement and parallelization, feel free to suggest improvements in the comments however!)

The most important functions to check out are (score-words [term words]) method, which returns a list of vectors of words and their score, and the (get-scored-terms ) method, which returns a map of words as keys and scores as values for the entire text, given the initial term.

Here’s the output from the last few lines:

contextfinder=> (pprint (reverse (sort-by second (score-sentences mytext scorewords))))
(["The override system is meant to deactivate the accelerator when the brake pedal is pressed. " 13/4]
["The Obama administration is considering requiring all automobiles to contain a brake override system intended to prevent sudden acceleration episodes like those that have led to the recall of millions of Toyotas, the Transportation secretary, Ray LaHood, said Tuesday. " 5/2]
["Often called a \"smart pedal,\" the feature is already found on many automobiles sold worldwide, including models from BMW, Chrysler, Mercedes-Benz, Nissan and Volkswagen." 3/4]
["That will let the driver stop safely even if the cars throttle sticks open. " 0])

I highlighted the score for each sentence in red, see how sentences such as the third in the list have a score, but don’t contain the word “brake”? Those would have been missed entirely without this kind of searching.

contextfinder=> (println (score-text mytext scorewords))
13/2

Score a whole block of text, self-explanatory.

Anyway, I’ve already spent hours writing the explanation for why you’d want to do this sort of searching, so I think I’ll save the explanation of the code for another post. Hopefully it’s clear enough to stand on its own. If you find this interesting, I encourage you to check out the clojure-opennlp bindings and start building other cool linguistic tools!

]]>
http://writequit.org/blog/?feed=rss2&p=351 8 http://writequit.org/blog/?p=351
Natural Language Processing in Clojure with clojure-opennlp http://feedproxy.google.com/~r/writequit/feed/~3/uZlNSSTgejE/ http://writequit.org/blog/?p=365#comments Mon, 08 Mar 2010 20:51:57 +0000 http://writequit.org/blog/?p=365 NOTE: I am not a linguist, please feel free to correct me in the comments if I use the wrong term!

From Wikipedia:

Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. Natural language generation systems convert information from computer databases into readable human language. Natural language understanding systems convert samples of human language into more formal representations such as parse trees or first-order logic structures that are easier for computer programs to manipulate. Many problems within NLP apply to both generation and understanding; for example, a computer must be able to model morphology (the structure of words) in order to understand an English sentence, and a model of morphology is also needed for producing a grammatically correct English sentence.

Clojure-opennlp is a library to interface with the OpenNLP (Open Natural Language Processing) library of functions, which provide linguistic tools to perform on various blocks of text. Once a linguistic interpretation of text is possible, a lot of really interesting applications present themselves. Let’s jump right in!

Basic Example usage (from a REPL)

(use 'clojure.contrib.pprint) ; just for this example
(use 'opennlp.nlp) ; make sure opennlp.jar is in your classpath

You will need to make the processing functions using the model files. These assume you’re running from the root project directory of the git repository (where some models are included). You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models/

user=> (def get-sentences (make-sentence-detector "models/EnglishSD.bin.gz"))
user=> (def tokenize (make-tokenizer "models/EnglishTok.bin.gz"))
user=> (def pos-tag (make-pos-tagger "models/tag.bin.gz"))

For name-finders in particular, it’s possible to have multiple model files:

user=> (def name-find (make-name-finder "models/namefind/person.bin.gz" "models/namefind/organization.bin.gz"))

The (make-<whateverizer> "modelfile.bin.gz") functions return functions that perform the linguistic offering. I decided to have them return functions so multiple methods doing the same sort of action could be created with different model files (perhaps different language models and such) without having the pass the model file every time you wanted to process some text.

After creating the utility methods, we can use the functions to perform operations on text. For instance, since we defined the sentence-detector as ‘get-sentences’, we can us that method to split text by sentences:

user=> (pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea..."))
["First sentence. ", "Second sentence? ", "Here is another one. ",
"And so on and so forth - you get the idea..."]
nil

Or split a sentence into tokens using the tokenize function:

user=> (pprint (tokenize "Mr. Smith gave a car to his son on Friday"))
["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",
"Friday"]
nil

Once we have a sequence of tokens, we can do what’s called POS Tagging. POS Tagging takes a list of words from only one sentence and applies an algorithms (using the morphology model) to determine what kind of tag to apply to each word:

user=> (pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))
(["Mr." "NNP"]
["Smith" "NNP"]
["gave" "VBD"]
["a" "DT"]
["car" "NN"]
["to" "TO"]
["his" "PRP$"]
["son" "NN"]
["on" "IN"]
["Friday." "NNP"])
nil

You can check out a list of all the tags if you want to know what they stand for.

The clojure-opennlp library also features a name finder, however it is extremely rudimentary at this point and won’t detect all names:

user=> (name-find (tokenize "My name is Lee, not John."))
("Lee" "John")

Filters

In the library, I also provide some simple filters that can be used to pare down a list of pos-tagged tokens using regular expressions. There are some preset filters available, as well as a macro for generating your own filters:

(use 'opennlp.tools.filters)

user=> (pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["Mr." "NNP"]
["Smith" "NNP"]
["car" "NN"]
["son" "NN"]
["Friday" "NNP"])
nil
user=> (pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["gave" "VBD"])
nil

Creating your own filter:

user=> (pos-filter determiners #"^DT")
#'user/determiners
user=> (doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
Given a list of pos-tagged elements, return only the determiners in a list.
nil
user=> (pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["a" "DT"])
nil

Check out the filters.clj file for a full list of out-of-the-box filters.

That’s about all there is in the library at the moment, so I hope that made sense. Unfortunately clojars.org does not provide a nice way to public documentation for a library, so the documentation in this post and on the github page will have to do for now.

This library is available on clojars for inclusion in leiningen projects, or on github if you’re interested in the source. This is a fairly new project, and not all OpenNLP features are exposed at the moment so feedback is definitely encouraged. In the next post I’ll explain an in-depth example of how these functions can be used to enhance a searching engine. EDIT: It’s up! Check out “Context searching using clojure-opennlp.”

UPDATE: Hiredman has let me know that the jar on clojars is missing the 3 dependencies used for the library. I’m busy working on writing pom.xml’s for the jars so I can upload them to clojars as dependencies. In the meantime, make sure you have the 3 jars in the lib directory (of the github project) in your classpath. Feel free to report any other issues on the github tracker or in the comments.

UPDATE 2: I fixed the project.clj file and pushed new versions of opennlp.jar and the dependency jars. A regular ‘lein deps’ should work now.

]]>
http://writequit.org/blog/?feed=rss2&p=365 4 http://writequit.org/blog/?p=365