To start, here's the promised scatter plot from last post:
These features encode the observation that a paper should really be about whatever it says it's about. Alternatively, the observation that humans like to repeat themselves a lot. It doesn't seem like "fixing" this would (will?) be too difficult for SCIgen.
These two features alone, even though they are heavily correlated, look like they're enough to classify papers with a good deal of accuracy. That said, I plan to combine them in to one feature and use at least one more (probably several others).
I'm on the lookout for other sources of computer generated text (generated academic papers specifically, although I don't imagine there are very many of those), so comment if you know of one! My scatter plots might get boring with just one small grouping of red dots on each. I suppose I could branch out and do spam email, although that's a very well studied topic.
I'll just edit this in, since it doesn't really need its own post:
No comments:
Post a Comment