The long awaited third feature has finally arrived. I'm measuring the occurrence of keywords from the reference section of papers in their body, title and abstract. Implementing the feature itself was fairly trivial, but 3D plotting with Matplotlib turns out to be somewhat tricky.
You can see that the dots are tiny. Unfortunately a bug in the latest version of Matplotlib prevents adjusting them. I may eventually grab Matplotlib from the project's Subversion repository, where the bug is already fixed.
Regardless, all 100 papers are correctly classified with this new feature (still working with k-nearest neighbor, now in 3 dimensions). My next step will be to get a bunch more data and evaluate various classifiers, since I'm using a quite arbitrary k=11 right now (as Professor Magdon pointed out at the CS poster session).
Friday, March 26, 2010
Wednesday, March 17, 2010
Computer Science Poster Session
I'll be presenting a poster this Friday at 4pm in Lally 102 for a CS department poster session. Feel free to stop by and chat. I've uploaded the poster in case anyone's interested.
I'll be working on project updates and a more substantive blog post closer to the weekend.
I'll be working on project updates and a more substantive blog post closer to the weekend.
Wednesday, March 10, 2010
A Classifier and an Interesting Outlier
I've built a classifier using a nearest neighbor algorithm. Right now I'm using k=11 (selecting the 11 nearest known data points to the one being classified). The results are as follows:
Using leave one out cross validation, I get a 2% error rate right now. This is only working with 100 data points, something I will correct shortly.
Notice the one human paper almost exactly in the middle of the computer generated papers. My first thought was that the paper must be computer generated, but that doesn't appear to be the case. Instead, it looks like the paper just has a very short abstract and is extremely formula-intensive. The formulas introduce a lot of 3 letter nonsense words that are repeated a lot, which throws off the 'top ten words' heuristic. I may experiment with ignoring three letter words, and still plan on adding a third feature to the classifier. Concentrating on this one point probably isn't the best idea, however.
I've implemented the nearest neighbor algorithm using a nifty implementation of a kd-tree from Scipy. Before finding it, I experimented with some Python bindings for ANN, but they hadn't been updated for a year and wouldn't compile on a 64 bit system. The kd-tree does almost all the work involved in a nearest neighbor algorithm (efficiently, too!), so implementing the classifier was a breeze once I found a working kd-tree implementation.
From here I'll be collecting more data, looking around for another feature (maybe something to do with references), investigating the possibility of turning this in to a web service, and starting a writeup for the Undergraduate Research Symposium at RPI. As always, the Python code used in the making of this post (and the data it worked with) is available in my Google Code Subversion repository.
Thursday, March 4, 2010
More Data and Results
I've written a few Bash scripts to speed data collection. I hope arXiv and SCIgen don't mind too much, although I've only grabbed 40 papers so far from each.
To start, here's the promised scatter plot from last post:
The green dots are human generated papers, and the red dots are SCIgen papers. The x-axis in this plot represents the occurrence of words from the title of the paper in its body, and the y-axis represents the occurrence of words from the abstract. Both are normalized for the size of the paper.
These features encode the observation that a paper should really be about whatever it says it's about. Alternatively, the observation that humans like to repeat themselves a lot. It doesn't seem like "fixing" this would (will?) be too difficult for SCIgen.
These two features alone, even though they are heavily correlated, look like they're enough to classify papers with a good deal of accuracy. That said, I plan to combine them in to one feature and use at least one more (probably several others).
I'm on the lookout for other sources of computer generated text (generated academic papers specifically, although I don't imagine there are very many of those), so comment if you know of one! My scatter plots might get boring with just one small grouping of red dots on each. I suppose I could branch out and do spam email, although that's a very well studied topic.
I'll just edit this in, since it doesn't really need its own post:
Here the vertical axis is the number of times the top ten most used words appear in the text of the paper divided by the number of times all other words are used. The horizontal axis is the number of times words from the title or abstract appear in the body of the paper divided by that plus the number of words in the body of the paper (so we get normalized values). There are two very distinct blobs here, so I may start building a classifier based on these two features alone.
To start, here's the promised scatter plot from last post:
These features encode the observation that a paper should really be about whatever it says it's about. Alternatively, the observation that humans like to repeat themselves a lot. It doesn't seem like "fixing" this would (will?) be too difficult for SCIgen.
These two features alone, even though they are heavily correlated, look like they're enough to classify papers with a good deal of accuracy. That said, I plan to combine them in to one feature and use at least one more (probably several others).
I'm on the lookout for other sources of computer generated text (generated academic papers specifically, although I don't imagine there are very many of those), so comment if you know of one! My scatter plots might get boring with just one small grouping of red dots on each. I suppose I could branch out and do spam email, although that's a very well studied topic.
I'll just edit this in, since it doesn't really need its own post:
Monday, March 1, 2010
Getting Started with Preprocessing
Thanks to the Natural Language Toolkit (NLTK), preprocessing for this project has been quite easy so far. It even comes with an awesome free reference ebook. Right now I'm using the NLTK for tokenizing, stemming, and filtering words based on their parts of speech.
The first step in preprocessing a paper is transforming it in to plain text. To do this in Python, I'll eventually be using PDFMiner. For the moment, I've been using pdftotext from Poppler. To start, I've converted 6 human-written papers (from arXiv) to text, and grabbed 6 computer-generated papers (from SCIgen) which were already text. I'll need quite a few more once I start using techniques from machine learning, but this small sampling is good enough for preprocessing. At this point, the paper looks something like this:
The Missing Piece Syndrome in Peer-to-Peer CommunicationBruce Hajek and Ji Zhu Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign1AbstractarXiv:1002.3493v1 [cs.PF] 18 Feb 2010Typical protocols for peer-to-peer ï¬le sharing over the Internet divide ï¬les to be shared into pieces. New peers strive to obtain a complete collection of pieces from other peers and from a seed. In this paper we identify a problem that can occur if the seeding rate is not large enough. The problem is that, even if the statistics of the system are symmetric in the pieces, there can be symmetry breaking, with one piece becoming very rare. If peers depart after obtaining a complete collection, they can tend to leave before helping other peers receive the rare piece.I. I NTRODUCTION Peer-to-peer (P2P) communication in the Internet is communication provided through the sharing of widely distributed resou
Now that we have a plain text version of the paper, the NLTK starts making life really easy. First, papers are tokenized into a list of words. Next, we convert everything to lowercase and remove strange symbols which come from formulas and PDF artifacts. Then we filter words by their part of speech and stem them. Part of speech filtering means that we'll keep nouns, adjectives and words that aren't in the NLTK's dictionary. Stemming means that different conjugations of words will map to the same stemmed word. At this point, the paper looks something like this:
Section 0:['peertop', 'zhu', 'syndrom', 'miss', 'bruce', 'commun', 'univers', 'urbanachampaign', 'engin', 'comput', 'scienc', 'depart', 'laboratori', 'piec', 'hajek']Section 1:['depart', 'divid', 'obtain', 'piec', 'rate', 'paper', 'identifi', 'seed', 'statist', 'feb', 'system', 'internet', 'complet', 'collect', 'le', 'cspf', 'protocol', 'peertop', 'peer', 'rare', 'arxiv', 'seed', 'problem', 'piec']Section 2:['peertop', 'commun', 'internet', 'commun', 'share', 'resourc', 'entiti', 'end', 'user', 'comput', 'client', 'server', 'clientserv', 'paradigm', 'focu', 'peertop', 'network', 'type', 'bittorr', 'mean', 'speci', 'network', 'topolog', 'particip', 'peer', 'network', 'replic', 'constant', 'network', 'exchang', 'piec', 'list', 'bittorr', 'work', 'consid', 'replic', 'arriv', 'peer', 'peer', 'piec', 'arriv', 'paper', 'studi', 'uid', 'limit', 'model', 'theori', 'densiti', 'depend', 'markov', 'process', 'twostat', 'model', 'paper', 'follow', 'model', 'simul', 'result', 'proposit', 'section', 'proposit', 'section', 'help', 'lemma', 'appendix', 'paper', 'discuss', 'extens', 'section', 'odel', 'formul', 'model', 'seed', 'uniform', 'contact', 'random', 'piec', 'select', 'set', 'proper', 'subset', 'number', 'piec', 'peer', 'set',
Notice that we've also split the paper up in to sections. Section 0 is the part of the paper before the word "abstract", namely the title, author and an assortment of other words. Section 1 is between the word "abstract" and the word "introduction" (typically just the abstract itself), and section 2 is the body of the paper along with the references and any appendixes. Duplicate words are removed from sections 0 and 1, although I should probably experiment with leaving them there.
Now we're ready to get the first statistics from the processed papers. My next blog post will cover this in some detail, but for now I'll leave you with a teaser. I've calculated a title score and an abstract score for each of the 12 papers. These scores are simply the number of times a word from the abstract/title occur in the body of the paper divided by the number of words in the body.
File: data/robot/1Title score: 0.027701Abstract score: 0.060942File: data/robot/2Title score: 0.016845Abstract score: 0.055130File: data/robot/3Title score: 0.000000Abstract score: 0.049875File: data/robot/4Title score: 0.016970Abstract score: 0.048485File: data/robot/5Title score: 0.033233Abstract score: 0.105740File: data/robot/6Title score: 0.013405Abstract score: 0.045576File: data/human/1Title score: 0.085758Abstract score: 0.301685File: data/human/2Title score: 0.051014Abstract score: 0.291334File: data/human/3Title score: 0.191415Abstract score: 0.297854File: data/human/4Title score: 0.854369Abstract score: 0.593851File: data/human/5Title score: 0.052889Abstract score: 0.166395File: data/human/6Title score: 0.069283Abstract score: 0.294298
You'll notice that there do seem to be some significant differences between human generated and (these) computer generated papers. My next post will include lots of scatter plots, and maybe even some statistics.
You can find the code used to generate the quoted text in this post in the project's subversion repository.
Subscribe to:
Posts (Atom)