<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jay McKinnon</title>
	<atom:link href="http://pages.cmns.sfu.ca/jay-mckinnon/feed/" rel="self" type="application/rss+xml" />
	<link>http://pages.cmns.sfu.ca/jay-mckinnon</link>
	<description>CMNS Graduate Student page</description>
	<lastBuildDate>Wed, 15 Feb 2012 15:28:55 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Wikipedia &amp; Project Gutenberg Ngram databases</title>
		<link>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/15/wikipedia-project-gutenberg-ngram-databases/</link>
		<comments>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/15/wikipedia-project-gutenberg-ngram-databases/#comments</comments>
		<pubDate>Wed, 15 Feb 2012 15:27:32 +0000</pubDate>
		<dc:creator>Jay McKinnon</dc:creator>
				<category><![CDATA[How To]]></category>

		<guid isPermaLink="false">http://pages.cmns.sfu.ca/jay-mckinnon/?p=80</guid>
		<description><![CDATA[The permanent URL for this post is http://opendna.com/blog/902 I thought I&#8217;d follow-up ngram.sh: a script for extracting Google Ngram data with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too. Wikipedia Ngram data Title: Tagged and Cleaned Wikipedia (TC Wikipedia) [...]]]></description>
				<content:encoded><![CDATA[<p><em>The permanent URL for this post is <a title="Wikipedia &amp; Project Gutenberg Ngram databases [opendna project]" href="http://opendna.com/blog/902" target="_blank">http://opendna.com/blog/902</a></em></p>
<p>I thought I&#8217;d follow-up <a title="ngram.sh: a script for extracting Google Ngram data" href="http://opendna.com/blog/850">ngram.sh: a script for extracting Google Ngram data</a> with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.</p>
<div>
<div id="attachment_906" class="wp-caption alignleft" style="width: 265px"><a href="http://pages.cmns.sfu.ca/jay-mckinnon/?attachment_id=906" rel="attachment wp-att-906"><img class="size-medium wp-image-906 " src="http://opendna.com/blog/wp-content/uploads/2012/02/500px-Project_Gutenberg_total_books.svg_-300x199.png" alt="Number of publications offered by Project Gutenberg, 1994-2008." width="255" height="169" /></a><p class="wp-caption-text">Number of publications offered by Project Gutenberg, 1994-2008.</p></div>
<h3>Wikipedia Ngram data</h3>
<p>Title: <a href="http://nlp.cs.nyu.edu/wikipedia-data/">Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram</a><br />
Author: <a title="Javier Artiles" href="http://nlp.cs.qc.cuny.edu/artiles">Javier Artiles</a> &amp; <a title="Satoshi Sekine" href="http://cs.nyu.edu/~sekine/">Satoshi Sekine</a> at NYU&#8217;s <a title="Proteus Project" href="http://nlp.cs.nyu.edu/index.shtml">Proteus Project</a>.<br />
Source: Wikipedia [18:12, June 8, 2008 version]<br />
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)<br />
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.<br />
Code: None provided.</p>
</div>
<div>
<div id="attachment_912" class="wp-caption alignright" style="width: 265px"><a href="http://pages.cmns.sfu.ca/jay-mckinnon/?attachment_id=912" rel="attachment wp-att-912"><img class="size-medium wp-image-912 " src="http://opendna.com/blog/wp-content/uploads/2012/02/WikipediaGrowthGraph-300x223.png" alt="Number of articles in the English Wikipedia, 2001-2012." width="255" height="190" /></a><p class="wp-caption-text">Number of articles in the English Wikipedia, 2001-2012.</p></div>
<h3>Project Gutenberg Ngram data</h3>
<p>Title: <a title="N-gram data from Project Gutenberg" href="http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/">N-gram data from Project Gutenberg</a><br />
Author: <a title="Prashanth Ellina" href="http://blog.prashanthellina.com/">Prashanth Ellina</a><br />
Source: Project Gutenberg [n.d. probably 2008]<br />
Ngrams: 2gram &amp; 3gram (624mb);<br />
Also: three tarballs (5.3gb each) of the &#8220;complete&#8221; text database.<br />
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg <a href="http://www.gutenberg.org/wiki/Gutenberg:Mirroring_How-To">Mirroring How-To</a>.<br />
Code: Yes, step-by-step instructions.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/15/wikipedia-project-gutenberg-ngram-databases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ngram.sh</title>
		<link>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/12/ngram/</link>
		<comments>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/12/ngram/#comments</comments>
		<pubDate>Sun, 12 Feb 2012 20:58:13 +0000</pubDate>
		<dc:creator>Jay McKinnon</dc:creator>
				<category><![CDATA[How To]]></category>

		<guid isPermaLink="false">http://pages.cmns.sfu.ca/jay-mckinnon/?p=67</guid>
		<description><![CDATA[The permanent URL for this post is http://opendna.com/blog/850 What is ngram.sh? The Google Ngram Viewer is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is Google Charts. It&#8217;s cool. It&#8217;s pretty. It&#8217;s hard to use for [...]]]></description>
				<content:encoded><![CDATA[<p><em>The permanent URL for this post is <a title="ngram.sh [opendna project]" href="http://opendna.com/blog/850" target="_blank">http://opendna.com/blog/850</a></em></p>
<h2>What is ngram.sh?</h2>
<p>The <a href="http://books.google.com/ngrams">Google Ngram Viewer</a> is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is <a href="http://code.google.com/apis/chart/">Google Charts</a>. It&#8217;s cool. It&#8217;s pretty. It&#8217;s hard to use for academic work because it doesn&#8217;t easily give up the raw data.</p>
<p>This page explains how to run a script — ngram.sh — on a *NIX shell account and extract your keywords into spreadsheet-ready files, <em>without</em> buying a terabyte hard drive.</p>
<div class="wp-caption alignnone" style="width: 550px"><a href="http://books.google.com/ngrams/graph?content=telegraph%2Ctelephone%2Ctelevision%2CInternet&amp;year_start=1850&amp;year_end=2000&amp;corpus=0&amp;smoothing=3"><img src="http://books.google.com/ngrams/chart?content=Internet%2Ctelephone%2Ctelevision%2Ctelegraph&amp;corpus=0&amp;smoothing=3&amp;year_start=1850&amp;year_end=2000" alt="Ngrams: telegraph, telephone, television &amp; Internet (1850-2000)" width="540" /></a><p class="wp-caption-text">Ngram frequency graph of four communication technologies — telegraph, telephone, television and Internet — from 1850 to 2000.</p></div>
<p>Before mapping out your research project, you should spend some time testing out search terms in the <a href="http://books.google.com/ngrams">Google Ngram Viewer</a>, reading this page&#8217;s <a href="#bibliography">bibliography</a> and selecting one of the <a href="http://books.google.com/ngrams/datasets">Ngram datasets</a>.</p>
<h2>How ngram.sh works</h2>
<p>By default, ngram.sh is configured for 1gram searches of the English Version 20090715 dataset. It will download ten ~210mb ZIP files, one at a time, unpack the ~1000mb CSV inside, and (grep) search for each keyword. It will write the output to a keyword-named CSV, and then delete the source file. You must have a MINIMUM 2gb to run the default script (sorry, your SFU diskshare isn&#8217;t big enough).</p>
<p>This script can take quite a while to run. In part that&#8217;s because it&#8217;s searching large amounts of data, but mostly it&#8217;s because it&#8217;s <em>downloading</em> lots of data. Unless you&#8217;re running on a very fast Internet connection, bandwidth is your bottleneck. You can speed things up for future searches by removing the lines to delete source (or CSV) files. You could then modify the script to run off your hard drive without downloading anew. However, you must have you must have upwards of 10gb available to do this with 1grams. If you change the script to process 2-grams or higher, WATCH OUT! A multi-keyword search of 5-grams without deletes can easily top a terabyte of data!</p>
<p>It is really, REALLY easy to fill every last sector of your hard drive — or bust your bandwidth cap! — with this script. Be cautious and <strong>pay attention</strong>.</p>
<h2>How to run ngram.sh</h2>
<p>You can run this script on any UNIX or *NIX system, including your OSX or higher <a href="http://youtu.be/nZqi3BqqeqI">Apple/Mac personal computer</a> (YouTube). <a href="http://www.sfu.ca/~jkm9/ngram/ngram.sh">Download ngram.sh</a>. Access is restricted to authenticated SFU users but non-SFU users may email <a href="mailto:jay@opendna.com?subject=ngram.sh">jay@opendna.com</a> or visit <a title="ngram.sh [opendna project]" href="http://opendna.com/blog/850" target="_blank">the OpenDNA Project</a> for a copy.</p>
<p>Permissions: Some people like to run their scripts with &#8220;<em>bash ngram.sh</em>&#8221; and file permissions unchanged, others prefer to set executable permissions with chmod 755 and run with &#8220;<em>./ngram.sh</em>&#8220;.<br />
Configuration: The very first line will need to be edited with your path to bash. Replace the 1grams on line 41 (beginning &#8220;for word in&#8221;) with your search keywords. Save and run the script.<br />
Cleaning the results: I like to manipulate the CSVs in MS Excel, but any spreadsheet application will do (even GoogleDocs). When reading the results of your query, you&#8217;re likely to discover that you grabbed a bunch of words you didn&#8217;t intend to, will have to collapse a bunch that are similar, and might have missed a few that you wanted. Keyword selection is a science and an art. Just modify your script, set it to run, make yourself some tea and hope you don&#8217;t cause your ISP to blow a gasket.</p>
<p>Happy counting!</p>
<p><a name="bibliography"></a></p>
<h2>bibliography</h2>
<p>Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., &amp; Nowak, M. A. (2007). <a title="Abstract: Human language is based on grammatical rules. Cultural evolution allows these rules to change over time5. Rules compete with each other: as new rules rise to prominence, old ones die away. To quantify the dynamics of language evolution, we studied the regularization of English verbs over the past 1,200 years. Although an elaborate system of productive conjugations existed in English's proto-Germanic ancestor, Modern English uses the dental suffix, '-ed', to signify past tense6. Here we describe the emergence of this linguistic rule amidst the evolutionary decay of its exceptions, known to us as irregular verbs. We have generated a data set of verbs whose conjugations have been evolving for more than a millennium, tracking inflectional changes to 177 Old-English irregular verbs. Of these irregular verbs, 145 remained irregular in Middle English and 98 are still irregular today. We study how the rate of regularization depends on the frequency of word usage. The half-life of an irregular verb scales as the square root of its usage frequency: a verb that is 100 times less frequent regularizes 10 times as fast. Our study provides a quantitative analysis of the regularization process by which ancestral forms gradually yield to an emerging linguistic rule." href="http://www.nature.com/nature/journal/v449/n7163/abs/nature06137.html">Quantifying the evolutionary dynamics of language</a>. Nature, 449(7163), 713-716. doi:10.1038/nature06137</p>
<p>Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., et al. (2010). <a title="Abstract: We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities." href="http://www.sciencemag.org/content/331/6014/176.abstract">Quantitative Analysis of Culture Using Millions of Digitized Books</a>. Science. doi:10.1126/science.1199644</p>
]]></content:encoded>
			<wfw:commentRss>http://pages.cmns.sfu.ca/jay-mckinnon/2012/02/12/ngram/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
