Tuesday, August 31, 2010

Fake h-Index

Cyril LabbĂ© has done some interesting work in creating a fake researcher named Ike Antkare and getting him an h-index of 98 using a self-referencing network of Scigen papers.

Attempting to classify these papers, I ran in to a bit of a surprise (green dots are "Ike"'s papers):
As you can see, the classifier is foiled by the extra references added to the Scigen papers.  The classifier is thrown off by the fact that the author is mentioned so many times in the references section: This doesn't just increase the references score, but also increases the word repetition score (the citations have several words in common) and the title/abstract score (since the author appears in that block of text).  

The easiest fix for this would be to modify the features so that repetition in the references section and mentions of the authors of the paper in the references section are not counted.  It also looks like we could learn to classify reference-rich papers by training on some examples.  However, Professor Moorthy had a much better suggestion: Look for clusterings of papers rather than attempting to build a traditional classifier.

Keeping the existing features, a clustering approach would identify both current sets of papers (the traditional Scigen papers and "Ike"'s modified papers) and would be much more future-proof when looking for generated text.  The clustering approach would not detect a single or even several examples without similar training data, but could automatically detect medium or large groupings of novel types generated text.

1 comment: