Sunday, March 27, 2016

The Sillifier

Ah, 90s background images.
The Moby Project has seen no activity in 16 years, but still has some really neat data.  It has a collection of word lists, including first and second edition Scrabble words, names, places, compound words, etc..  But it also has part-of-speech and pronunciation data for lots of words.  This is probably not what Siri uses to figure out what you said, but it's free and good for certain little toy projects, like a perl script to help find rhymes.

I decided to use the part-of-speech data to write a program that approximates my special sense of humor by replacing random words with other ones.  The idea is pretty simple - for each word, look up what kind of word it is, then replace that word with another one of the same kind - but there were some little hurdles to jump along the way.

A noun
The Moby data says,
potato N
Which means "potato" is a noun.  No surprise there.  But,
light NAVvi
Which means "light" is mostly a noun but also an adjective, a verb, and an adverb.  And without parsing the whole sentence my program wasn't able to tell which.  I didn't want full natural language parsing, we'll save that for another project.  Instead I went with a trick: find another word that can also be all of those things.

Oh, but "light" is the only NAVvi word in the set.  It wouldn't be very interesting to replace it with itself.  Moby sorts the parts-of-speech by likelihood, so I'll trim off the least likely usage and try again.  "Square" is marked as "NAVv", so except for the intransitive verb usage, it might fit.  One more step, to "NAV" gives 22 more choices, for instance "signal".  I might get even more choices if I considered "NAV" words equivalent to, for instance, "AVN", but I haven't tried that yet.

For some words, replacing them with another word really tends to make the sentence strange.  For instance, "had" and "been" are both marked "V", but when they appear together they take on particular meaning, and it no longer works to replace one of them with "sparge", say.  So I set up a list of words that the program leaves unaltered.

More potatoes. Random.
The Moby parts-of-speech list has around 230,000 entries, but I wanted to embed this list in a web page and keep the footprint as small as possible.  Besides, it can probably get away without knowing "tonsilitic" or "hoiden" or "abcoulomb".  This is a somewhat common problem with word lists, in my opinion: the top N lists are too short and all the rest are too long.   Somewhere there must be a middle ground in between the "1,000 most frequently used English words" and "first edition of the Official Scrabble Players Dictionary(tm)", but I can't seem to find such a list.  For this project I dropped everything that wasn't in the Scrabble list.

Finally, for many words, not all variations of the word were in the data.  So my program had to detect the "-ing" forms of verbs, for instance, replace just the base part, and put the "ing" back on the end afterward.  It's not very good at that (it thinks the past tense of "choose" is "chooseed"), but it's good enough to play with.

The end result is the page linked below.  There are some sample texts you can choose to get started, or enter your own text, and play with the sliders to see what kind of silliness it produces.  Or put another way, steer with the phenomenons to stand what conic of fact it dies.


1 comment:

  1. I've moved the page to github, to avoid the Dropbox hack.

    ReplyDelete