The first step in preprocessing a paper is transforming it in to plain text. To do this in Python, I'll eventually be using PDFMiner. For the moment, I've been using pdftotext from Poppler. To start, I've converted 6 human-written papers (from arXiv) to text, and grabbed 6 computer-generated papers (from SCIgen) which were already text. I'll need quite a few more once I start using techniques from machine learning, but this small sampling is good enough for preprocessing. At this point, the paper looks something like this:
The Missing Piece Syndrome in Peer-to-Peer CommunicationBruce Hajek and Ji Zhu Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign1AbstractarXiv:1002.3493v1 [cs.PF] 18 Feb 2010Typical protocols for peer-to-peer ï¬le sharing over the Internet divide ï¬les to be shared into pieces. New peers strive to obtain a complete collection of pieces from other peers and from a seed. In this paper we identify a problem that can occur if the seeding rate is not large enough. The problem is that, even if the statistics of the system are symmetric in the pieces, there can be symmetry breaking, with one piece becoming very rare. If peers depart after obtaining a complete collection, they can tend to leave before helping other peers receive the rare piece.I. I NTRODUCTION Peer-to-peer (P2P) communication in the Internet is communication provided through the sharing of widely distributed resou
Now that we have a plain text version of the paper, the NLTK starts making life really easy. First, papers are tokenized into a list of words. Next, we convert everything to lowercase and remove strange symbols which come from formulas and PDF artifacts. Then we filter words by their part of speech and stem them. Part of speech filtering means that we'll keep nouns, adjectives and words that aren't in the NLTK's dictionary. Stemming means that different conjugations of words will map to the same stemmed word. At this point, the paper looks something like this:
Section 0:['peertop', 'zhu', 'syndrom', 'miss', 'bruce', 'commun', 'univers', 'urbanachampaign', 'engin', 'comput', 'scienc', 'depart', 'laboratori', 'piec', 'hajek']Section 1:['depart', 'divid', 'obtain', 'piec', 'rate', 'paper', 'identifi', 'seed', 'statist', 'feb', 'system', 'internet', 'complet', 'collect', 'le', 'cspf', 'protocol', 'peertop', 'peer', 'rare', 'arxiv', 'seed', 'problem', 'piec']Section 2:['peertop', 'commun', 'internet', 'commun', 'share', 'resourc', 'entiti', 'end', 'user', 'comput', 'client', 'server', 'clientserv', 'paradigm', 'focu', 'peertop', 'network', 'type', 'bittorr', 'mean', 'speci', 'network', 'topolog', 'particip', 'peer', 'network', 'replic', 'constant', 'network', 'exchang', 'piec', 'list', 'bittorr', 'work', 'consid', 'replic', 'arriv', 'peer', 'peer', 'piec', 'arriv', 'paper', 'studi', 'uid', 'limit', 'model', 'theori', 'densiti', 'depend', 'markov', 'process', 'twostat', 'model', 'paper', 'follow', 'model', 'simul', 'result', 'proposit', 'section', 'proposit', 'section', 'help', 'lemma', 'appendix', 'paper', 'discuss', 'extens', 'section', 'odel', 'formul', 'model', 'seed', 'uniform', 'contact', 'random', 'piec', 'select', 'set', 'proper', 'subset', 'number', 'piec', 'peer', 'set',
Notice that we've also split the paper up in to sections. Section 0 is the part of the paper before the word "abstract", namely the title, author and an assortment of other words. Section 1 is between the word "abstract" and the word "introduction" (typically just the abstract itself), and section 2 is the body of the paper along with the references and any appendixes. Duplicate words are removed from sections 0 and 1, although I should probably experiment with leaving them there.
Now we're ready to get the first statistics from the processed papers. My next blog post will cover this in some detail, but for now I'll leave you with a teaser. I've calculated a title score and an abstract score for each of the 12 papers. These scores are simply the number of times a word from the abstract/title occur in the body of the paper divided by the number of words in the body.
File: data/robot/1Title score: 0.027701Abstract score: 0.060942File: data/robot/2Title score: 0.016845Abstract score: 0.055130File: data/robot/3Title score: 0.000000Abstract score: 0.049875File: data/robot/4Title score: 0.016970Abstract score: 0.048485File: data/robot/5Title score: 0.033233Abstract score: 0.105740File: data/robot/6Title score: 0.013405Abstract score: 0.045576File: data/human/1Title score: 0.085758Abstract score: 0.301685File: data/human/2Title score: 0.051014Abstract score: 0.291334File: data/human/3Title score: 0.191415Abstract score: 0.297854File: data/human/4Title score: 0.854369Abstract score: 0.593851File: data/human/5Title score: 0.052889Abstract score: 0.166395File: data/human/6Title score: 0.069283Abstract score: 0.294298
You'll notice that there do seem to be some significant differences between human generated and (these) computer generated papers. My next post will include lots of scatter plots, and maybe even some statistics.
You can find the code used to generate the quoted text in this post in the project's subversion repository.
Hi, I'm hoping you'll get a notification for this comment, despite it being over 3 years since you wrote this blog entry.
ReplyDeleteI'm doing something similar and converting pdf lecture notes to text using pdftotext.
Are you able to elaborate on your methods for filtering the 'strange symbols which come from formulas and PDF artifacts'. I have been filtering using a few regular expression style techniques but maybe your methods could help me?
I found the source code. Ignore the above.
ReplyDelete