As I explained on the c++ page, Google offers a service that shows word and phrase usage over all books up to 2010. They have the raw data available to download (each phrase and how many times it was used in each year). I re-wrote a c++ program I had previously written to parse this data except this time I wrote it in java. The program counts up each phrase usage and keeps track of it all and sorts it at the end. Because there is so much data however, I had to use memory management techniques to remove low-occurring phrases. Using java allowed me to auto download, unzip, and parse all the data. I made the program multi-threaded so it would download the next file as it processed the current one. It managed to finish this difficult task in just under 5 hours.
This project certainly was a lot of work but in the end I managed to find that the most common three word phrase used in all books ever written up to 2010 is “one of the.”
I have the full list available for download below.
You will notice some non-three-word phrases. This is because of Google ngrams’ algorithm’s definition of a 3-word phrase (includes “. it is” as a three word phrase).
Top 3 word phrases