Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session. Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k. Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue. A plot of the 200 data points:
The two computer generated groups can be seen in the lower left of the image.
Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers. The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.
No comments:
Post a Comment