Thursday, March 4, 2010

More Data and Results

I've written a few Bash scripts to speed data collection.  I hope arXiv and SCIgen don't mind too much, although I've only grabbed 40 papers so far from each.

To start, here's the promised scatter plot from last post:

The green dots are human generated papers, and the red dots are SCIgen papers.  The x-axis in this plot represents the occurrence of words from the title of the paper in its body, and the y-axis represents the occurrence of words from the abstract.  Both are normalized for the size of the paper.

These features encode the observation that a paper should really be about whatever it says it's about.  Alternatively, the observation that humans like to repeat themselves a lot.  It doesn't seem like "fixing" this would (will?) be too difficult for SCIgen.

These two features alone, even though they are heavily correlated, look like they're enough to classify papers with a good deal of accuracy.  That said, I plan to combine them in to one feature and use at least one more (probably several others).

I'm on the lookout for other sources of computer generated text (generated academic papers specifically, although I don't imagine there are very many of those), so comment if you know of one!  My scatter plots might get boring with just one small grouping of red dots on each.  I suppose I could branch out and do spam email, although that's a very well studied topic.

I'll just edit this in, since it doesn't really need its own post:

Here the vertical axis is the number of times the top ten most used words appear in the text of the paper divided by the number of times all other words are used.  The horizontal axis is the number of times words from the title or abstract appear in the body of the paper divided by that plus the number of words in the body of the paper (so we get normalized values).  There are two very distinct blobs here, so I may start building a classifier based on these two features alone.

No comments:

Post a Comment