The most interesting part of this article to me was learning that the 1965 corpus was hand-created and processed on Hollerith cards. Obvious in retrospect, but I hadn’t thought about it. The closing remarks were amusing about fitting the Google corpus on punch cards. 🙂
Originally shared by null
When Research@Google was young, we memorized the top English letter frequencies (ETAOINSHRDLU) and put them to good use, spending countless hours solving cryptograms.
Those frequencies date from 1965, when Mark Mayzner painstakingly tabulated 20,000 words from newspapers, magazines, and books.
Google has now scanned over two trillion words—eight orders of magnitude more. Recently, Mark contacted Googler Peter Norvig to suggest that he use the Google Books Ngram corpus (http://books.google.com/ngrams) to perform an updated analysis for English. The results are at http://norvig.com/mayzner.html.
– R, L, and C are more common than originally thought.
– The average English word is 4.79 letters long.
– The most common 4-gram is “tion”.
– The most common 7-gram is “present”.
– The most common 9-gram is “different”.
If anyone wants to perform a similar analysis for other languages, let Research@Google know—the data for German, French, Italian, Spanish, Chinese, Hebrew, and Russian is available for download at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.