Skip to content

albert-decatur/google-ngrams-db

Repository files navigation

Google N-grams DB

This repo contains Google N-grams English One Million as a SQLite database. These fields are availble:

  • ngram
  • year
  • match_count
  • page_count
  • volume_count

This data is based on Google English One Million 1-grams version 20090715.

Google N-grams is licensed under Creative Commons Attribution 3.0 Unported License. These files have been modified by taking only English One Million 1-grams that are entirely made of the letters A-Z (eg no punctuation or numbers, no accents or non-English characters). These ngrams have been put into a SQLite database, making queries like these easy:

-- get counts of matches by ngram for appearances after 1990
SELECT ngram,SUM(match_count) AS sum_match_count FROM eng_1m_ascii WHERE year > 1990 GROUP BY ngram;
-- get a count of all matches by year
SELECT year,SUM(match_count) AS sum_match_count FROM eng_1m_ascii GROUP BY year;

How To

The directory output_eng_1m_ascii.sqlite.7z/ contains split files that can be concatenated to make eng_1m_ascii.sqlite.7z, which can be unarchived to make eng_1m_ascii.sqlite, which is a SQLite database.

# concatenate the split files under output_eng_1m_ascii.sqlite.7z/
cat $( find output_eng_1m_ascii.sqlite.7z/ -type f | sort ) > eng_1m_ascii.sqlite.7z 
# unarchive the 7zip file to get the SQLite database
7z x eng_1m_ascii.sqlite.7z
# use the SQLite database
sqlite3 eng_1m_ascii.sqlite

About

SQLite database for Google N-grams

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages