Algorithmic Detection of Computer Generated Papers: April 2010

Tuesday, April 27, 2010

Undergraduate Research Symposium

Photo courtesy of Professor Moorthy.

Thursday, April 22, 2010

Presentations and Posters

I'll be presenting to RCOS tomorrow. My slides are here. I haven't had as much time to work on this project as I'd like over the past week or so, but I'm hopeful for a little free time at the end of next week.

Saturday, April 10, 2010

Pruning

So it turns out that pruning does not produce a U-shaped leave-one-out error curve as I expected:

For reference, a 3d plot of the pruned data with k=3:

The densities of computer generated and human papers do look much more comparable, though. The error looks to be uniformly at least slightly higher than without pruning, which is expected. I pruned by removing only points which were both classified correctly (leave-one-out) and whose removal did not cause any previously removed points to be classified incorrectly. There are certainly other (probably better) pruning algorithms, but I would expect at least comparable results from them.

I will try using a validation set instead of leave-one-out cross-validation just for kicks, but it's looking increasingly like k=3 is the way to go.

Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers. Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session. Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k. Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue. A plot of the 200 data points:

The two computer generated groups can be seen in the lower left of the image.

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers. The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.

Algorithmic Detection of Computer Generated Papers