Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers.  Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session.  Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k.  Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue.  A plot of the 200 data points:
The two computer generated groups can be seen in the lower left of the image.  

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers.  The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.  

No comments:

Post a Comment