Welcome to SFU.ca.
You have reached this page because we have detected you have a browser that is not supported by our web site and its stylesheets. We are happy to bring you here a text version of the SFU site. It offers you all the site's links and info, but without the graphics.
You may be able to update your browser and take advantage of the full graphical website. This could be done FREE at one of the following links, depending on your computer and operating system.
Or you may simply continue with the text version.

*Windows:*
FireFox (Recommended) http://www.mozilla.com/en-US/firefox/
Netscape http://browser.netscape.com
Opera http://www.opera.com/

*Macintosh OSX:*
FireFox (Recommended) http://www.mozilla.com/en-US/firefox/
Netscape http://browser.netscape.com
Opera http://www.opera.com/

*Macintosh OS 8.5-9.22:*
The only currently supported browser that we know of is iCAB. This is a free browser to download and try, but there is a cost to purchase it.
http://www.icab.de/index.html

Wikipedia & Project Gutenberg Ngram databases

The permanent URL for this post is http://opendna.com/blog/902

I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.

Number of publications offered by Project Gutenberg, 1994-2008.

Number of publications offered by Project Gutenberg, 1994-2008.

Wikipedia Ngram data

Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.

Number of articles in the English Wikipedia, 2001-2012.

Number of articles in the English Wikipedia, 2001-2012.

Project Gutenberg Ngram data

Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram & 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.

Posted: Wednesday, February 15th, 2012 @ 8:27 am
Categories: How To.
Subscribe to the comments feed if you like. You can leave a response, or trackback from your own site.

Leave a Reply


1 × = seven

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word