Monday, March 1, 2010

Getting Started with Preprocessing

Thanks to the Natural Language Toolkit (NLTK), preprocessing for this project has been quite easy so far. It even comes with an awesome free reference ebook. Right now I'm using the NLTK for tokenizing, stemming, and filtering words based on their parts of speech.

The first step in preprocessing a paper is transforming it in to plain text. To do this in Python, I'll eventually be using PDFMiner. For the moment, I've been using pdftotext from Poppler. To start, I've converted 6 human-written papers (from arXiv) to text, and grabbed 6 computer-generated papers (from SCIgen) which were already text. I'll need quite a few more once I start using techniques from machine learning, but this small sampling is good enough for preprocessing. At this point, the paper looks something like this:
The Missing Piece Syndrome in Peer-to-Peer Communication
Bruce Hajek and Ji Zhu Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign



arXiv:1002.3493v1 [cs.PF] 18 Feb 2010

Typical protocols for peer-to-peer ï¬le sharing over the Internet divide ï¬les to be shared into pieces. New peers strive to obtain a complete collection of pieces from other peers and from a seed. In this paper we identify a problem that can occur if the seeding rate is not large enough. The problem is that, even if the statistics of the system are symmetric in the pieces, there can be symmetry breaking, with one piece becoming very rare. If peers depart after obtaining a complete collection, they can tend to leave before helping other peers receive the rare piece.

I. I NTRODUCTION Peer-to-peer (P2P) communication in the Internet is communication provided through the sharing of widely distributed resou
Now that we have a plain text version of the paper, the NLTK starts making life really easy. First, papers are tokenized into a list of words. Next, we convert everything to lowercase and remove strange symbols which come from formulas and PDF artifacts. Then we filter words by their part of speech and stem them. Part of speech filtering means that we'll keep nouns, adjectives and words that aren't in the NLTK's dictionary. Stemming means that different conjugations of words will map to the same stemmed word. At this point, the paper looks something like this:
Section 0:

['peertop', 'zhu', 'syndrom', 'miss', 'bruce', 'commun', 'univers', 'urbanachampaign', 'engin', 'comput', 'scienc', 'depart', 'laboratori', 'piec', 'hajek']

Section 1:

['depart', 'divid', 'obtain', 'piec', 'rate', 'paper', 'identifi', 'seed', 'statist', 'feb', 'system', 'internet', 'complet', 'collect', 'le', 'cspf', 'protocol', 'peertop', 'peer', 'rare', 'arxiv', 'seed', 'problem', 'piec']

Section 2:

['peertop', 'commun', 'internet', 'commun', 'share', 'resourc', 'entiti', 'end', 'user', 'comput', 'client', 'server', 'clientserv', 'paradigm', 'focu', 'peertop', 'network', 'type', 'bittorr', 'mean', 'speci', 'network', 'topolog', 'particip', 'peer', 'network', 'replic', 'constant', 'network', 'exchang', 'piec', 'list', 'bittorr', 'work', 'consid', 'replic', 'arriv', 'peer', 'peer', 'piec', 'arriv', 'paper', 'studi', 'uid', 'limit', 'model', 'theori', 'densiti', 'depend', 'markov', 'process', 'twostat', 'model', 'paper', 'follow', 'model', 'simul', 'result', 'proposit', 'section', 'proposit', 'section', 'help', 'lemma', 'appendix', 'paper', 'discuss', 'extens', 'section', 'odel', 'formul', 'model', 'seed', 'uniform', 'contact', 'random', 'piec', 'select', 'set', 'proper', 'subset', 'number', 'piec', 'peer', 'set',
Notice that we've also split the paper up in to sections. Section 0 is the part of the paper before the word "abstract", namely the title, author and an assortment of other words. Section 1 is between the word "abstract" and the word "introduction" (typically just the abstract itself), and section 2 is the body of the paper along with the references and any appendixes. Duplicate words are removed from sections 0 and 1, although I should probably experiment with leaving them there.

Now we're ready to get the first statistics from the processed papers. My next blog post will cover this in some detail, but for now I'll leave you with a teaser. I've calculated a title score and an abstract score for each of the 12 papers. These scores are simply the number of times a word from the abstract/title occur in the body of the paper divided by the number of words in the body.
File: data/robot/1
Title score: 0.027701
Abstract score: 0.060942
File: data/robot/2
Title score: 0.016845
Abstract score: 0.055130
File: data/robot/3
Title score: 0.000000
Abstract score: 0.049875
File: data/robot/4
Title score: 0.016970
Abstract score: 0.048485
File: data/robot/5
Title score: 0.033233
Abstract score: 0.105740
File: data/robot/6
Title score: 0.013405
Abstract score: 0.045576
File: data/human/1
Title score: 0.085758
Abstract score: 0.301685
File: data/human/2
Title score: 0.051014
Abstract score: 0.291334
File: data/human/3
Title score: 0.191415
Abstract score: 0.297854
File: data/human/4
Title score: 0.854369
Abstract score: 0.593851
File: data/human/5
Title score: 0.052889
Abstract score: 0.166395
File: data/human/6
Title score: 0.069283
Abstract score: 0.294298
You'll notice that there do seem to be some significant differences between human generated and (these) computer generated papers. My next post will include lots of scatter plots, and maybe even some statistics.

You can find the code used to generate the quoted text in this post in the project's subversion repository.


  1. Hi, I'm hoping you'll get a notification for this comment, despite it being over 3 years since you wrote this blog entry.

    I'm doing something similar and converting pdf lecture notes to text using pdftotext.
    Are you able to elaborate on your methods for filtering the 'strange symbols which come from formulas and PDF artifacts'. I have been filtering using a few regular expression style techniques but maybe your methods could help me?

  2. I found the source code. Ignore the above.