Tuesday, August 31, 2010

Fake h-Index

Cyril Labbé has done some interesting work in creating a fake researcher named Ike Antkare and getting him an h-index of 98 using a self-referencing network of Scigen papers.

Attempting to classify these papers, I ran in to a bit of a surprise (green dots are "Ike"'s papers):
As you can see, the classifier is foiled by the extra references added to the Scigen papers.  The classifier is thrown off by the fact that the author is mentioned so many times in the references section: This doesn't just increase the references score, but also increases the word repetition score (the citations have several words in common) and the title/abstract score (since the author appears in that block of text).  

The easiest fix for this would be to modify the features so that repetition in the references section and mentions of the authors of the paper in the references section are not counted.  It also looks like we could learn to classify reference-rich papers by training on some examples.  However, Professor Moorthy had a much better suggestion: Look for clusterings of papers rather than attempting to build a traditional classifier.

Keeping the existing features, a clustering approach would identify both current sets of papers (the traditional Scigen papers and "Ike"'s modified papers) and would be much more future-proof when looking for generated text.  The clustering approach would not detect a single or even several examples without similar training data, but could automatically detect medium or large groupings of novel types generated text.

Thursday, August 5, 2010

Paper

Professor Moorthy and I have a writeup now live on ArXiv.  It's not computer generated, I swear!

Tuesday, April 27, 2010

Undergraduate Research Symposium

Photo courtesy of Professor Moorthy.

Thursday, April 22, 2010

Presentations and Posters

I'll be presenting to RCOS tomorrow.  My slides are here.  I haven't had as much time to work on this project as I'd like over the past week or so, but I'm hopeful for a little free time at the end of next week.

Saturday, April 10, 2010

Pruning

So it turns out that pruning does not produce a U-shaped leave-one-out error curve as I expected:
For reference, a 3d plot of the pruned data with k=3:
The densities of computer generated and human papers do look much more comparable, though. The error looks to be uniformly at least slightly higher than without pruning, which is expected.  I pruned by removing only points which were both classified correctly (leave-one-out) and whose removal did not cause any previously removed points to be classified incorrectly.  There are certainly other (probably better) pruning algorithms, but I would expect at least comparable results from them.

I will try using a validation set instead of leave-one-out cross-validation just for kicks, but it's looking increasingly like k=3 is the way to go.

Wednesday, April 7, 2010

Error Analysis for Nearest Neighbor Classifiers

So I finally go around to gathering more data and testing different nearest neighbor classifiers.  Interestingly, k=3 (as in k-nearest-neighbor) gave the best results with 200 data points using leave-one-out cross validation (2 incorrectly classified points, 1% error):

Although 1% is great, this reminds me that Professor Magdon mentioned the density of points being a problem for nearest neighbor classifiers during the poster session.  Although a higher k provides regularization and would typically decrease the leave-one-out error estimate (at least until a point), the high density of the computer-generated papers creates an artificially large region that gets classified as computer generated with a high k.  Unfortunately this problem will only get worse with more data points if I stick with a nearest neighbor classifier and don't address the issue.  A plot of the 200 data points:
The two computer generated groups can be seen in the lower left of the image.  

Fortunately I can use a pruning algorithm to rid the dataset of some 'unneeded' points, which should decrease the density of the computer generated papers.  The idea will be to prune for each k value, run cross validation on each resulting data set with the chosen k value, and finally plot the error again.  

Friday, March 26, 2010

Third Feature

The long awaited third feature has finally arrived.  I'm measuring the occurrence of keywords from the reference section of papers in their body, title and abstract.  Implementing the feature itself was fairly trivial, but 3D plotting with Matplotlib turns out to be somewhat tricky.

You can see that the dots are tiny.  Unfortunately a bug in the latest version of Matplotlib prevents adjusting them.  I may eventually grab Matplotlib from the project's Subversion repository, where the bug is already fixed.

Regardless, all 100 papers are correctly classified with this new feature (still working with k-nearest neighbor, now in 3 dimensions).  My next step will be to get a bunch more data and evaluate various classifiers, since I'm using a quite arbitrary k=11 right now (as Professor Magdon pointed out at the CS poster session).

Wednesday, March 17, 2010

Computer Science Poster Session

I'll be presenting a poster this Friday at 4pm in Lally 102 for a CS department poster session.  Feel free to stop by and chat.  I've uploaded the poster in case anyone's interested.

I'll be working on project updates and a more substantive blog post closer to the weekend.

Wednesday, March 10, 2010

A Classifier and an Interesting Outlier

I've built a classifier using a nearest neighbor algorithm.  Right now I'm using k=11 (selecting the 11 nearest known data points to the one being classified).  The results are as follows:
Using leave one out cross validation, I get a 2% error rate right now.  This is only working with 100 data points, something I will correct shortly.

Notice the one human paper almost exactly in the middle of the computer generated papers.  My first thought was that the paper must be computer generated, but that doesn't appear to be the case.  Instead, it looks like the paper just has a very short abstract and is extremely formula-intensive.  The formulas introduce a lot of 3 letter nonsense words that are repeated a lot, which throws off the 'top ten words' heuristic.  I may experiment with ignoring three letter words, and still plan on adding a third feature to the classifier.  Concentrating on this one point probably isn't the best idea, however.  

I've implemented the nearest neighbor algorithm using a nifty implementation of a kd-tree from Scipy.  Before finding it, I experimented with some Python bindings for ANN, but they hadn't been updated for a year and wouldn't compile on a 64 bit system.  The kd-tree does almost all the work involved in a nearest neighbor algorithm (efficiently, too!), so implementing the classifier was a breeze once I found a working kd-tree implementation.

From here I'll be collecting more data, looking around for another feature (maybe something to do with references), investigating the possibility of turning this in to a web service, and starting a writeup for the Undergraduate Research Symposium at RPI.  As always, the Python code used in the making of this post (and the data it worked with) is available in my Google Code Subversion repository.

Thursday, March 4, 2010

More Data and Results

I've written a few Bash scripts to speed data collection.  I hope arXiv and SCIgen don't mind too much, although I've only grabbed 40 papers so far from each.

To start, here's the promised scatter plot from last post:

The green dots are human generated papers, and the red dots are SCIgen papers.  The x-axis in this plot represents the occurrence of words from the title of the paper in its body, and the y-axis represents the occurrence of words from the abstract.  Both are normalized for the size of the paper.

These features encode the observation that a paper should really be about whatever it says it's about.  Alternatively, the observation that humans like to repeat themselves a lot.  It doesn't seem like "fixing" this would (will?) be too difficult for SCIgen.

These two features alone, even though they are heavily correlated, look like they're enough to classify papers with a good deal of accuracy.  That said, I plan to combine them in to one feature and use at least one more (probably several others).

I'm on the lookout for other sources of computer generated text (generated academic papers specifically, although I don't imagine there are very many of those), so comment if you know of one!  My scatter plots might get boring with just one small grouping of red dots on each.  I suppose I could branch out and do spam email, although that's a very well studied topic.

I'll just edit this in, since it doesn't really need its own post:

Here the vertical axis is the number of times the top ten most used words appear in the text of the paper divided by the number of times all other words are used.  The horizontal axis is the number of times words from the title or abstract appear in the body of the paper divided by that plus the number of words in the body of the paper (so we get normalized values).  There are two very distinct blobs here, so I may start building a classifier based on these two features alone.

Monday, March 1, 2010

Getting Started with Preprocessing

Thanks to the Natural Language Toolkit (NLTK), preprocessing for this project has been quite easy so far. It even comes with an awesome free reference ebook. Right now I'm using the NLTK for tokenizing, stemming, and filtering words based on their parts of speech.

The first step in preprocessing a paper is transforming it in to plain text. To do this in Python, I'll eventually be using PDFMiner. For the moment, I've been using pdftotext from Poppler. To start, I've converted 6 human-written papers (from arXiv) to text, and grabbed 6 computer-generated papers (from SCIgen) which were already text. I'll need quite a few more once I start using techniques from machine learning, but this small sampling is good enough for preprocessing. At this point, the paper looks something like this:
The Missing Piece Syndrome in Peer-to-Peer Communication
Bruce Hajek and Ji Zhu Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign

1

Abstract

arXiv:1002.3493v1 [cs.PF] 18 Feb 2010

Typical protocols for peer-to-peer ï¬le sharing over the Internet divide ï¬les to be shared into pieces. New peers strive to obtain a complete collection of pieces from other peers and from a seed. In this paper we identify a problem that can occur if the seeding rate is not large enough. The problem is that, even if the statistics of the system are symmetric in the pieces, there can be symmetry breaking, with one piece becoming very rare. If peers depart after obtaining a complete collection, they can tend to leave before helping other peers receive the rare piece.

I. I NTRODUCTION Peer-to-peer (P2P) communication in the Internet is communication provided through the sharing of widely distributed resou
Now that we have a plain text version of the paper, the NLTK starts making life really easy. First, papers are tokenized into a list of words. Next, we convert everything to lowercase and remove strange symbols which come from formulas and PDF artifacts. Then we filter words by their part of speech and stem them. Part of speech filtering means that we'll keep nouns, adjectives and words that aren't in the NLTK's dictionary. Stemming means that different conjugations of words will map to the same stemmed word. At this point, the paper looks something like this:
Section 0:



['peertop', 'zhu', 'syndrom', 'miss', 'bruce', 'commun', 'univers', 'urbanachampaign', 'engin', 'comput', 'scienc', 'depart', 'laboratori', 'piec', 'hajek']


Section 1:



['depart', 'divid', 'obtain', 'piec', 'rate', 'paper', 'identifi', 'seed', 'statist', 'feb', 'system', 'internet', 'complet', 'collect', 'le', 'cspf', 'protocol', 'peertop', 'peer', 'rare', 'arxiv', 'seed', 'problem', 'piec']


Section 2:



['peertop', 'commun', 'internet', 'commun', 'share', 'resourc', 'entiti', 'end', 'user', 'comput', 'client', 'server', 'clientserv', 'paradigm', 'focu', 'peertop', 'network', 'type', 'bittorr', 'mean', 'speci', 'network', 'topolog', 'particip', 'peer', 'network', 'replic', 'constant', 'network', 'exchang', 'piec', 'list', 'bittorr', 'work', 'consid', 'replic', 'arriv', 'peer', 'peer', 'piec', 'arriv', 'paper', 'studi', 'uid', 'limit', 'model', 'theori', 'densiti', 'depend', 'markov', 'process', 'twostat', 'model', 'paper', 'follow', 'model', 'simul', 'result', 'proposit', 'section', 'proposit', 'section', 'help', 'lemma', 'appendix', 'paper', 'discuss', 'extens', 'section', 'odel', 'formul', 'model', 'seed', 'uniform', 'contact', 'random', 'piec', 'select', 'set', 'proper', 'subset', 'number', 'piec', 'peer', 'set',
Notice that we've also split the paper up in to sections. Section 0 is the part of the paper before the word "abstract", namely the title, author and an assortment of other words. Section 1 is between the word "abstract" and the word "introduction" (typically just the abstract itself), and section 2 is the body of the paper along with the references and any appendixes. Duplicate words are removed from sections 0 and 1, although I should probably experiment with leaving them there.

Now we're ready to get the first statistics from the processed papers. My next blog post will cover this in some detail, but for now I'll leave you with a teaser. I've calculated a title score and an abstract score for each of the 12 papers. These scores are simply the number of times a word from the abstract/title occur in the body of the paper divided by the number of words in the body.
File: data/robot/1
Title score: 0.027701
Abstract score: 0.060942
File: data/robot/2
Title score: 0.016845
Abstract score: 0.055130
File: data/robot/3
Title score: 0.000000
Abstract score: 0.049875
File: data/robot/4
Title score: 0.016970
Abstract score: 0.048485
File: data/robot/5
Title score: 0.033233
Abstract score: 0.105740
File: data/robot/6
Title score: 0.013405
Abstract score: 0.045576
File: data/human/1
Title score: 0.085758
Abstract score: 0.301685
File: data/human/2
Title score: 0.051014
Abstract score: 0.291334
File: data/human/3
Title score: 0.191415
Abstract score: 0.297854
File: data/human/4
Title score: 0.854369
Abstract score: 0.593851
File: data/human/5
Title score: 0.052889
Abstract score: 0.166395
File: data/human/6
Title score: 0.069283
Abstract score: 0.294298
You'll notice that there do seem to be some significant differences between human generated and (these) computer generated papers. My next post will include lots of scatter plots, and maybe even some statistics.

You can find the code used to generate the quoted text in this post in the project's subversion repository.

Tuesday, February 16, 2010

Introduction and Acknowledgements

First of all, I'd like to thank Sean O'Sullivan for his continued support of the Rensselaer Center for Open Software. RCOS adds enormous benefit to computer science at Rensselaer, and has been one of the high points of my undergraduate career here.

This semester, I'll be working on applying methods from machine learning to automatically classify academic papers as either computer generated or genuine human productions. The project is summed up fairly well in my research proposal:
Computer programs designed to generate documents which look like academic papers have been used to expose a lack of thorough human review in several conferences and journals. The documents include figures, formatting, and complete sentences which seem on a shallow overview to form a genuine paper. A human attempting to get meaning from such a paper may realize that there is no coherent flow of ideas, and indeed that the paper is simply a well formatted combination of randomly selected keywords.

A human familiar with the apparent subject matter of a paper can classify computer generated papers as such with great accuracy . The question then arises as to whether we can identify computer generated documents without resorting to an attempt at true understanding by a well trained human. We propose an investigation into several potential methods for differentiating between computer generated and authentic documents based on techniques from machine learning.

A preliminary preprocessing step will split the subject document into word tokens, ignoring any figures found in the document. The document can then be analyzed for various features that might differ between computer generated documents and papers written by a human. These features might include the repetition of phrases in a certain section of the document representing a coherent theme in the passage, the occurrence of keywords from a title or abstract in the body of the document, or the occurrence of keywords from the titles of cited papers in the body of the document.

Each candidate feature will be tested against a body of known academic papers and a body of computer generated documents. Features which differ between computer generated papers and true academic papers will then be selected as part of the basis for classifying documents. Scatter plots will be used to visualize the separating power of various features, with similar features being grouped together into feature categories and the group's principal axis used on plots.

The features determined to differ between computer generated and human written papers will be selected for inclusion in a web service. This web service will classify uploaded documents based on the selected features and offer several methods for quantifying and visualizing the likelihood that a paper was computer generated. Scatter plots will show where the uploaded paper stands in relation to known computer generated and known human written papers, along with groupings for several of the known document generators.

The web service may attempt to classify papers not only based on whether they are computer generated or not, but also based on the generator which was most likely used. Doing this may require different features than the binary classification between computer generated and authentic papers. For example, the generator which created a paper might be identified based on the types and distribution of keywords, or the different sentence structures used.
My next post will detail some of my preliminary findings on feature selection. Ultimately, I'll be writing a Python web service (most likely running on Google's App Engine) which will allow users to submit documents for real time classification. When that happens, I'll be posting code to Google Code.