If you’re a geek in the more traditional sense of the word, sometimes you look at the works of Shakespeare as one big text file and say, “Ooh, let’s look for patterns.” If you were to take a course on natural language programming, chances are very good that one of the first lessons will be in parsing Shakespeare for what are called n-grams, or “phrases of N words that always appear together.” This is how autocorrect works, it looks at what you’ve already written and then says, “Statistically, what do I think is typically the next word?”
But being Shakespeare geeks as well, we can then work backward and look at the context for when and where and why he used them. This post is just one in what’s hopefully a series of interesting discoveries using this technique.
For my fellow programming geeks – here’s the github source I found to get started!
Disclaimer – raw text processing has lots of issues. Special character and line breaks and headers/footers all get in your way and have to be stripped out. In one version of this test the phrase “a midsummer nights dream” ranked very highly and I thought, wait, no it doesn’t. That’s because one of the sample text files had used the title of the play as a header on every page. Very hard to strip that out if there’s no real markers to identify it. So, take these results with at least a few grains of salt.
The longer the n-gram that less data you get, which makes sense because you’re going to get fewer hits. So typically you see 2 or 3 words (bi- and tri-grams, respectively). But Shakespeare’s a bit wordy and you tend to get things like, “I pray you” or “I know not” which don’t give you much to work with. So I expanded to look at 4 and 5 word grams. Quads and quints? Not sure what they’re officially called.
My quad-grams give me plenty of the usual hits: “I know not what”, “I do not know”, “I do beseech you,” … but one of them appears significantly more than the others (30% more actually), and it is where I got the title for this post.
With all my heart.
Lovely. Now can you guess which play uses it the most? I’ll give you a hint. Merchant of Venice uses it 4 times, Othello 5, but this play uses it 6 times.
I’m not giving the answer here, I want to see what people guess. It’s not one I would have expected. I wonder if it has something to do with when the play was written relative to Shakespeare’s career. Maybe he had a tendency to repeat himself or use simpler go-to phrases earlier in his writing? Is that a hint?