Tuesday, February 16, 2010

Introduction and Acknowledgements

First of all, I'd like to thank Sean O'Sullivan for his continued support of the Rensselaer Center for Open Software. RCOS adds enormous benefit to computer science at Rensselaer, and has been one of the high points of my undergraduate career here.

This semester, I'll be working on applying methods from machine learning to automatically classify academic papers as either computer generated or genuine human productions. The project is summed up fairly well in my research proposal:
Computer programs designed to generate documents which look like academic papers have been used to expose a lack of thorough human review in several conferences and journals. The documents include figures, formatting, and complete sentences which seem on a shallow overview to form a genuine paper. A human attempting to get meaning from such a paper may realize that there is no coherent flow of ideas, and indeed that the paper is simply a well formatted combination of randomly selected keywords.

A human familiar with the apparent subject matter of a paper can classify computer generated papers as such with great accuracy . The question then arises as to whether we can identify computer generated documents without resorting to an attempt at true understanding by a well trained human. We propose an investigation into several potential methods for differentiating between computer generated and authentic documents based on techniques from machine learning.

A preliminary preprocessing step will split the subject document into word tokens, ignoring any figures found in the document. The document can then be analyzed for various features that might differ between computer generated documents and papers written by a human. These features might include the repetition of phrases in a certain section of the document representing a coherent theme in the passage, the occurrence of keywords from a title or abstract in the body of the document, or the occurrence of keywords from the titles of cited papers in the body of the document.

Each candidate feature will be tested against a body of known academic papers and a body of computer generated documents. Features which differ between computer generated papers and true academic papers will then be selected as part of the basis for classifying documents. Scatter plots will be used to visualize the separating power of various features, with similar features being grouped together into feature categories and the group's principal axis used on plots.

The features determined to differ between computer generated and human written papers will be selected for inclusion in a web service. This web service will classify uploaded documents based on the selected features and offer several methods for quantifying and visualizing the likelihood that a paper was computer generated. Scatter plots will show where the uploaded paper stands in relation to known computer generated and known human written papers, along with groupings for several of the known document generators.

The web service may attempt to classify papers not only based on whether they are computer generated or not, but also based on the generator which was most likely used. Doing this may require different features than the binary classification between computer generated and authentic papers. For example, the generator which created a paper might be identified based on the types and distribution of keywords, or the different sentence structures used.
My next post will detail some of my preliminary findings on feature selection. Ultimately, I'll be writing a Python web service (most likely running on Google's App Engine) which will allow users to submit documents for real time classification. When that happens, I'll be posting code to Google Code.