Algorithmic Detection of Computer Generated Papers

Tuesday, August 31, 2010

Fake h-Index

Cyril Labbé has done some interesting work in creating a fake researcher named Ike Antkare and getting him an h-index of 98 using a self-referencing network of Scigen papers.

Attempting to classify these papers, I ran in to a bit of a surprise (green dots are "Ike"'s papers):

As you can see, the classifier is foiled by the extra references added to the Scigen papers. The classifier is thrown off by the fact that the author is mentioned so many times in the references section: This doesn't just increase the references score, but also increases the word repetition score (the citations have several words in common) and the title/abstract score (since the author appears in that block of text).

The easiest fix for this would be to modify the features so that repetition in the references section and mentions of the authors of the paper in the references section are not counted. It also looks like we could learn to classify reference-rich papers by training on some examples. However, Professor Moorthy had a much better suggestion: Look for clusterings of papers rather than attempting to build a traditional classifier.

Keeping the existing features, a clustering approach would identify both current sets of papers (the traditional Scigen papers and "Ike"'s modified papers) and would be much more future-proof when looking for generated text. The clustering approach would not detect a single or even several examples without similar training data, but could automatically detect medium or large groupings of novel types generated text.

Thursday, August 5, 2010

Paper

Professor Moorthy and I have a writeup now live on ArXiv. It's not computer generated, I swear!

Tuesday, April 27, 2010

Undergraduate Research Symposium

Photo courtesy of Professor Moorthy.

Thursday, April 22, 2010

Presentations and Posters

I'll be presenting to RCOS tomorrow. My slides are here. I haven't had as much time to work on this project as I'd like over the past week or so, but I'm hopeful for a little free time at the end of next week.

Saturday, April 10, 2010

Pruning

So it turns out that pruning does not produce a U-shaped leave-one-out error curve as I expected:

For reference, a 3d plot of the pruned data with k=3:

The densities of computer generated and human papers do look much more comparable, though. The error looks to be uniformly at least slightly higher than without pruning, which is expected. I pruned by removing only points which were both classified correctly (leave-one-out) and whose removal did not cause any previously removed points to be classified incorrectly. There are certainly other (probably better) pruning algorithms, but I would expect at least comparable results from them.

I will try using a validation set instead of leave-one-out cross-validation just for kicks, but it's looking increasingly like k=3 is the way to go.

Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers. Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session. Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k. Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue. A plot of the 200 data points:

The two computer generated groups can be seen in the lower left of the image.

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers. The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.

Friday, March 26, 2010

Third Feature

The long awaited third feature has finally arrived. I'm measuring the occurrence of keywords from the reference section of papers in their body, title and abstract. Implementing the feature itself was fairly trivial, but 3D plotting with Matplotlib turns out to be somewhat tricky.

You can see that the dots are tiny. Unfortunately a bug in the latest version of Matplotlib prevents adjusting them. I may eventually grab Matplotlib from the project's Subversion repository, where the bug is already fixed.

Regardless, all 100 papers are correctly classified with this new feature (still working with k-nearest neighbor, now in 3 dimensions). My next step will be to get a bunch more data and evaluate various classifiers, since I'm using a quite arbitrary k=11 right now (as Professor Magdon pointed out at the CS poster session).