Algorithmic Detection of Computer Generated Papers: Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers. Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session. Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k. Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue. A plot of the 200 data points:

The two computer generated groups can be seen in the lower left of the image.

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers. The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.

Algorithmic Detection of Computer Generated Papers

Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

No comments:

Post a Comment

Followers

Blog Archive

About Me