Using leave one out cross validation, I get a 2% error rate right now. This is only working with 100 data points, something I will correct shortly.
Notice the one human paper almost exactly in the middle of the computer generated papers. My first thought was that the paper must be computer generated, but that doesn't appear to be the case. Instead, it looks like the paper just has a very short abstract and is extremely formula-intensive. The formulas introduce a lot of 3 letter nonsense words that are repeated a lot, which throws off the 'top ten words' heuristic. I may experiment with ignoring three letter words, and still plan on adding a third feature to the classifier. Concentrating on this one point probably isn't the best idea, however.
I've implemented the nearest neighbor algorithm using a nifty implementation of a kd-tree from Scipy. Before finding it, I experimented with some Python bindings for ANN, but they hadn't been updated for a year and wouldn't compile on a 64 bit system. The kd-tree does almost all the work involved in a nearest neighbor algorithm (efficiently, too!), so implementing the classifier was a breeze once I found a working kd-tree implementation.
From here I'll be collecting more data, looking around for another feature (maybe something to do with references), investigating the possibility of turning this in to a web service, and starting a writeup for the Undergraduate Research Symposium at RPI. As always, the Python code used in the making of this post (and the data it worked with) is available in my Google Code Subversion repository.