Friday, March 26, 2010

Third Feature

The long awaited third feature has finally arrived.  I'm measuring the occurrence of keywords from the reference section of papers in their body, title and abstract.  Implementing the feature itself was fairly trivial, but 3D plotting with Matplotlib turns out to be somewhat tricky.

You can see that the dots are tiny.  Unfortunately a bug in the latest version of Matplotlib prevents adjusting them.  I may eventually grab Matplotlib from the project's Subversion repository, where the bug is already fixed.

Regardless, all 100 papers are correctly classified with this new feature (still working with k-nearest neighbor, now in 3 dimensions).  My next step will be to get a bunch more data and evaluate various classifiers, since I'm using a quite arbitrary k=11 right now (as Professor Magdon pointed out at the CS poster session).


  1. One possibility is to plot how the classification changes as k changes from 2 to 20. (just a thought).

  2. Yes, that should work nicely. k-nearest-neighbor with an even k leads to ties in voting, though, so I've skipped even numbers.