<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: American English word frequencies?</title>
	<link>http://ask.metafilter.com/60283/American-English-word-frequencies/</link>
	<description>Comments on Ask MetaFilter post American English word frequencies?</description>
	<pubDate>Mon, 09 Apr 2007 22:01:18 -0800</pubDate>
	<lastBuildDate>Mon, 09 Apr 2007 22:01:18 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: American English word frequencies?</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies</link>	
		<description>Where can I find a good word frequency list for American English? &lt;br /&gt;&lt;br /&gt; Requirements: Not just sorted by frequency, but with specific frequency information for each word.  &lt;em&gt;Not&lt;/em&gt; lemmatized.  &lt;a href=&quot;http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html&quot;&gt;These lists&lt;/a&gt; are almost perfect, except they&apos;re based on British English (and they have separate entries for &quot;n&apos;t&quot; and &quot;&apos;s&quot;).&lt;br&gt;
&lt;br&gt;
Does a such a list even exist?  Is this the kind of thing I can&apos;t get online?  Google has revealed to me only other lists based on the British National Corpus and a lot of word lists for open source spell checkers and Scrabble programs.  Am I missing something?&lt;br&gt;
&lt;br&gt;
(Suggestions for other interesting corpora that I could use to generate my own list of this sort would also be appreciated&#8212;this is for a generative poetry project.)</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2007:site.60283</guid>
		<pubDate>Mon, 09 Apr 2007 21:38:40 -0800</pubDate>
		<dc:creator>aparrish</dc:creator>
		
			<category>language</category>
		
			<category>linguistics</category>
		
			<category>corpora</category>
		
			<category>wordfrequency</category>
		
	</item> <item>
		<title>By: demiurge</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907340</link>	
		<description>I suggest training your generator on a specific poet and seeing if you can get it to emulate that style.  Then you can add in other base text to get more breadth. &lt;br&gt;
&lt;br&gt;
For example, generate frequencies from http://www.infomotions.com/etexts/literature/english/1600-1699/shakespeare-sonnets-59.txt</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907340</guid>
		<pubDate>Mon, 09 Apr 2007 22:01:18 -0800</pubDate>
		<dc:creator>demiurge</dc:creator>
	</item><item>
		<title>By: epugachev</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907348</link>	
		<description>The &lt;a href=&quot;http://en.wikipedia.org/wiki/Brown_Corpus&quot;&gt;Brown Corpus&lt;/a&gt; was what they referred us to in a computational linguistics course I took ~5 years ago.  It seems like there may be copyright issues that explain why it is not more widely available, but the first page of google results includes this &lt;a href=&quot;http://www.google.com/url?sa=t&amp;ct=res&amp;cd=8&amp;url=http%3A%2F%2Fdingo.sbs.arizona.edu%2F~hammond%2Fling696f-sp03%2Fbrowncorpus.txt&amp;ei=GR0bRuySD5GsgwPI08yGDg&amp;usg=__HtHrLxowVWX72OdC83We4YE5GOI=&amp;sig2=FtC4EGzZMwdfl7r4YyVDKQ&quot;&gt;plaintext version&lt;/a&gt;.&lt;br&gt;
&lt;br&gt;
The &lt;a href=&quot;http://ldc.upenn.edu/&quot;&gt;LDC&lt;/a&gt; at Penn may be of help.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907348</guid>
		<pubDate>Mon, 09 Apr 2007 22:16:29 -0800</pubDate>
		<dc:creator>epugachev</dc:creator>
	</item><item>
		<title>By: demiurge</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907350</link>	
		<description>Oh, of course if you generate frequencies yourself, you&apos;ll probably need a part-of-speech tagger, which could be extra work for you.&lt;br&gt;
&lt;br&gt;
I&apos;m not sure I understand what&apos;s wrong with the lists you linked to, do you just not want British English?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907350</guid>
		<pubDate>Mon, 09 Apr 2007 22:21:44 -0800</pubDate>
		<dc:creator>demiurge</dc:creator>
	</item><item>
		<title>By: lukemeister</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907359</link>	
		<description>The first release of the &lt;a href=&quot;http://americannationalcorpus.org/&quot;&gt;American National Corpus&lt;/a&gt; is available for $75 for non-commercial use.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907359</guid>
		<pubDate>Mon, 09 Apr 2007 22:28:43 -0800</pubDate>
		<dc:creator>lukemeister</dc:creator>
	</item><item>
		<title>By: ALongDecember</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907361</link>	
		<description>The &lt;a href=&quot;http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm&quot;&gt;MRC Psycholinguistic Database&lt;/a&gt; has helped me before for psychology projects. It runs a Unix dict command on the database based on parameters you set, and it includes Brown Frequency. Take a look at the parameters, to get a list of the most popular words set the min on the BROWN-FREQ at about 50 or so and you&apos;ll get a large enough list. You&apos;ll have to sort the list in Excel, but it&apos;ll cut and paste easily.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907361</guid>
		<pubDate>Mon, 09 Apr 2007 22:29:40 -0800</pubDate>
		<dc:creator>ALongDecember</dc:creator>
	</item><item>
		<title>By: UbuRoivas</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907379</link>	
		<description>&lt;a href=&quot;http://www.wordcount.org/&quot;&gt;wordcount&lt;/a&gt;?&lt;br&gt;
&lt;br&gt;
i think they have an API as well, if you are technically-minded.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907379</guid>
		<pubDate>Mon, 09 Apr 2007 23:15:35 -0800</pubDate>
		<dc:creator>UbuRoivas</dc:creator>
	</item><item>
		<title>By: UbuRoivas</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907381</link>	
		<description>oh, cancel that.&lt;br&gt;
&lt;br&gt;
&lt;em&gt;WordCount data currently comes from the British National Corpus.&lt;/em&gt;</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907381</guid>
		<pubDate>Mon, 09 Apr 2007 23:17:53 -0800</pubDate>
		<dc:creator>UbuRoivas</dc:creator>
	</item><item>
		<title>By: Citizen Premier</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907408</link>	
		<description>Do the &lt;a href=&quot;http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists&quot;&gt;wiktionary:Frequency Lists&lt;/a&gt; help?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907408</guid>
		<pubDate>Tue, 10 Apr 2007 00:44:44 -0800</pubDate>
		<dc:creator>Citizen Premier</dc:creator>
	</item><item>
		<title>By: miagaille</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907588</link>	
		<description>Seconding the &lt;a href=&quot;http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm&quot;&gt;MRC database&lt;/a&gt;. However, please note that the Brown-Freq it gives is NOT THE SAME THING as the Brown Corpus, which is probably what you want.  The labels are by editor, so the data for the Brown Corpus is under Kucera-Francis Frequency (edited by Francis and Kucera, 1967). &quot;Brown&quot; here refers to the London-Lund corpus (edited by Brown, 1984). The LLC corpus is British English, the Brown is American.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907588</guid>
		<pubDate>Tue, 10 Apr 2007 07:05:52 -0800</pubDate>
		<dc:creator>miagaille</dc:creator>
	</item><item>
		<title>By: snownoid</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#907700</link>	
		<description>The &lt;a href=http://nltk.sourceforge.net/index.html&gt;nltk-lite&lt;/a&gt; corpora package includes a pos-tagged version of the Brown corpus.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-907700</guid>
		<pubDate>Tue, 10 Apr 2007 09:07:02 -0800</pubDate>
		<dc:creator>snownoid</dc:creator>
	</item><item>
		<title>By: aparrish</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#911013</link>	
		<description>Thanks for the leads, everyone!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-911013</guid>
		<pubDate>Thu, 12 Apr 2007 19:43:13 -0800</pubDate>
		<dc:creator>aparrish</dc:creator>
	</item><item>
		<title>By: eritain</title>
		<link>http://ask.metafilter.com/60283/American-English-word-frequencies#915866</link>	
		<description>The Brown corpus, be it noted, is from the 1960s, and I think it&apos;s mostly business/light-literary English. If that&apos;s what you want to generate, OK, but the vocabulary might not be very interesting for poetry. &lt;br&gt;
&lt;br&gt;
For creative purposes, I think you&apos;d be OK using British sources. If it really had to taste American, you could post-process it for spelling and &lt;i&gt;have/had gotten&lt;/i&gt; and so forth. Yes, there are still differences of frequency, but I&apos;m not sure anyone is going to stand over your generator long enough to establish for sure that its ratio of sentence-initial &lt;i&gt;so&lt;/i&gt; to sentence-final &lt;i&gt;then&lt;/i&gt; is too low for red-blooded American poetry. &lt;br&gt;
&lt;br&gt;
And for what it&apos;s worth, I once used &lt;a href=&quot;http://view.byu.edu/&quot;&gt;Variation In English Words&lt;/a&gt; (a web interface to the British National Corpus) to search for words that were exceptionally more common in one register than another. For example, I took the adjectives that are relatively prevalent in fiction as contrasted with news; the top five hits were &lt;i&gt;faint, silken, husky, rueful,&lt;/i&gt; and &lt;i&gt;momentary&lt;/i&gt;&amp;mdash;now just try and tell me that isn&apos;t Distilled Essence of Bodice-Ripper. I can see those sorts of lists helping you create a near-parodic, over-the-top feeling&amp;mdash;and that, for me, is what makes generated art work.&lt;br&gt;
&lt;br&gt;
&lt;b&gt;Or&lt;/b&gt; you could slurp the collected works of Emily Dickinson from Project Gutenberg, and Perl up the frequency lists yourself. Possibilities ...</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.60283-915866</guid>
		<pubDate>Tue, 17 Apr 2007 21:23:29 -0800</pubDate>
		<dc:creator>eritain</dc:creator>
	</item>
	</channel>
</rss>
