Google Ngrams Analyzer

Google offers a service known as ngrams that gives access to each time a word or phrase is used in a certain year in a certain amount of books from all books every written (to the extend that they have). They offer the raw data for download so what I did was make a c++ program that parses and analyzes this data to find the most common word of phrase every out of all books.

This is one of the coolest statistical programs I have made and I am very happy with how it turned out. I ran my program on all books looking for the most common one word phrase ever written (most common word), and as you may have guessed, it is “the.” I also ran the same test in all spanish books. I tried running my program on the most common three word phrase ever, but because of the incredible size of the raw data, it was taking days to complete and I didn’t have the processing power to finish the job. Or so I thought, I have the most common three word phrase in all books ever written available on the java page (Most common 3 word phrase).

Nevertheless I have the files available for download on the data I did finish analyzing. Stats on the software are shown at the botton of the file and in the screenshot.

The format of the file is:
x.) phrase : occurance # : occurance # of books

Most Commonly Used English Words
Most Commonly Used Spanish Words

ngram1

ngram2

 

Leave a Reply

Your email address will not be published. Required fields are marked *