Tuesday, April 27, 2010

Undergraduate Research Symposium

Photo courtesy of Professor Moorthy.

Thursday, April 22, 2010

Presentations and Posters

I'll be presenting to RCOS tomorrow.  My slides are here.  I haven't had as much time to work on this project as I'd like over the past week or so, but I'm hopeful for a little free time at the end of next week.

Saturday, April 10, 2010


So it turns out that pruning does not produce a U-shaped leave-one-out error curve as I expected:
For reference, a 3d plot of the pruned data with k=3:
The densities of computer generated and human papers do look much more comparable, though. The error looks to be uniformly at least slightly higher than without pruning, which is expected.  I pruned by removing only points which were both classified correctly (leave-one-out) and whose removal did not cause any previously removed points to be classified incorrectly.  There are certainly other (probably better) pruning algorithms, but I would expect at least comparable results from them.

I will try using a validation set instead of leave-one-out cross-validation just for kicks, but it's looking increasingly like k=3 is the way to go.

Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers.  Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session.  Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k.  Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue.  A plot of the 200 data points:
The two computer generated groups can be seen in the lower left of the image.  

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers.  The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.