Welcome to SFU.ca.
You have reached this page because we have detected you have a browser that is not supported by our web site and its stylesheets. We are happy to bring you here a text version of the SFU site. It offers you all the site's links and info, but without the graphics.
You may be able to update your browser and take advantage of the full graphical website. This could be done FREE at one of the following links, depending on your computer and operating system.
Or you may simply continue with the text version.

*Windows:*
FireFox (Recommended) http://www.mozilla.com/en-US/firefox/
Netscape http://browser.netscape.com
Opera http://www.opera.com/

*Macintosh OSX:*
FireFox (Recommended) http://www.mozilla.com/en-US/firefox/
Netscape http://browser.netscape.com
Opera http://www.opera.com/

*Macintosh OS 8.5-9.22:*
The only currently supported browser that we know of is iCAB. This is a free browser to download and try, but there is a cost to purchase it.
http://www.icab.de/index.html

Wikipedia & Project Gutenberg Ngram databases

February 15th, 2012

The permanent URL for this post is http://opendna.com/blog/902

I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.

Number of publications offered by Project Gutenberg, 1994-2008.

Number of publications offered by Project Gutenberg, 1994-2008.

Wikipedia Ngram data

Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.

Number of articles in the English Wikipedia, 2001-2012.

Number of articles in the English Wikipedia, 2001-2012.

Project Gutenberg Ngram data

Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram & 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.

ngram.sh

February 12th, 2012

The permanent URL for this post is http://opendna.com/blog/850

What is ngram.sh?

The Google Ngram Viewer is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is Google Charts. It’s cool. It’s pretty. It’s hard to use for academic work because it doesn’t easily give up the raw data.

This page explains how to run a script — ngram.sh — on a *NIX shell account and extract your keywords into spreadsheet-ready files, without buying a terabyte hard drive.

Ngrams: telegraph, telephone, television & Internet (1850-2000)

Ngram frequency graph of four communication technologies — telegraph, telephone, television and Internet — from 1850 to 2000.

Before mapping out your research project, you should spend some time testing out search terms in the Google Ngram Viewer, reading this page’s bibliography and selecting one of the Ngram datasets.

How ngram.sh works

By default, ngram.sh is configured for 1gram searches of the English Version 20090715 dataset. It will download ten ~210mb ZIP files, one at a time, unpack the ~1000mb CSV inside, and (grep) search for each keyword. It will write the output to a keyword-named CSV, and then delete the source file. You must have a MINIMUM 2gb to run the default script (sorry, your SFU diskshare isn’t big enough).

This script can take quite a while to run. In part that’s because it’s searching large amounts of data, but mostly it’s because it’s downloading lots of data. Unless you’re running on a very fast Internet connection, bandwidth is your bottleneck. You can speed things up for future searches by removing the lines to delete source (or CSV) files. You could then modify the script to run off your hard drive without downloading anew. However, you must have you must have upwards of 10gb available to do this with 1grams. If you change the script to process 2-grams or higher, WATCH OUT! A multi-keyword search of 5-grams without deletes can easily top a terabyte of data!

It is really, REALLY easy to fill every last sector of your hard drive — or bust your bandwidth cap! — with this script. Be cautious and pay attention.

How to run ngram.sh

You can run this script on any UNIX or *NIX system, including your OSX or higher Apple/Mac personal computer (YouTube). Download ngram.sh. Access is restricted to authenticated SFU users but non-SFU users may email jay@opendna.com or visit the OpenDNA Project for a copy.

Permissions: Some people like to run their scripts with “bash ngram.sh” and file permissions unchanged, others prefer to set executable permissions with chmod 755 and run with “./ngram.sh“.
Configuration: The very first line will need to be edited with your path to bash. Replace the 1grams on line 41 (beginning “for word in”) with your search keywords. Save and run the script.
Cleaning the results: I like to manipulate the CSVs in MS Excel, but any spreadsheet application will do (even GoogleDocs). When reading the results of your query, you’re likely to discover that you grabbed a bunch of words you didn’t intend to, will have to collapse a bunch that are similar, and might have missed a few that you wanted. Keyword selection is a science and an art. Just modify your script, set it to run, make yourself some tea and hope you don’t cause your ISP to blow a gasket.

Happy counting!

bibliography

Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., & Nowak, M. A. (2007). Quantifying the evolutionary dynamics of language. Nature, 449(7163), 713-716. doi:10.1038/nature06137

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., et al. (2010). Quantitative Analysis of Culture Using Millions of Digitized Books. Science. doi:10.1126/science.1199644