Thursday, April 22, 2010
Saturday, April 10, 2010
So it turns out that pruning does not produce a U-shaped leave-one-out error curve as I expected:
For reference, a 3d plot of the pruned data with k=3:
The densities of computer generated and human papers do look much more comparable, though. The error looks to be uniformly at least slightly higher than without pruning, which is expected. I pruned by removing only points which were both classified correctly (leave-one-out) and whose removal did not cause any previously removed points to be classified incorrectly. There are certainly other (probably better) pruning algorithms, but I would expect at least comparable results from them.
I will try using a validation set instead of leave-one-out cross-validation just for kicks, but it's looking increasingly like k=3 is the way to go.
Posted by Allen at 9:15 PM
Wednesday, April 7, 2010
So I finally go around to gathering more data and testing different nearest neighbor classifiers. Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):
Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session. Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k. Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue. A plot of the 200 data points:
The two computer generated groups can be seen in the lower left of the image.
Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers. The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.
Posted by Allen at 3:44 PM