Wikipedia & Project Gutenberg Ngram databases
The permanent URL for this post is http://opendna.com/blog/902
I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.
Wikipedia Ngram data
Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.
Project Gutenberg Ngram data
Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram & 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.
Posted: Wednesday, February 15th, 2012 @ 8:27 am
Categories: How To.
Subscribe to the comments feed if you like.
You can leave a response, or trackback from your own site.

