{"id":2369,"date":"2018-02-04T13:10:12","date_gmt":"2018-02-04T18:10:12","guid":{"rendered":"http:\/\/osric.com\/chris\/accidental-developer\/?p=2369"},"modified":"2018-02-04T13:10:12","modified_gmt":"2018-02-04T18:10:12","slug":"analyzing-text-to-find-common-terms-using-python-and-nltk","status":"publish","type":"post","link":"https:\/\/osric.com\/chris\/accidental-developer\/2018\/02\/analyzing-text-to-find-common-terms-using-python-and-nltk\/","title":{"rendered":"Analyzing text to find common terms using Python and NLTK"},"content":{"rendered":"<p>I just recently started playing with the Python <a href=\"http:\/\/www.nltk.org\/\">NLTK<\/a> (Natural Language ToolKit) to analyze text. The book <a href=\"http:\/\/www.nltk.org\/book\/\">Natural Language Processing with Python<\/a> is available online and is very helpful if you&#8217;re just getting started.<\/p>\n<p>At the beginning of the book the examples cover importing and analyzing text (primarily books) that you import from nltk (<a href=\"http:\/\/www.nltk.org\/book\/ch01.html#getting-started-with-nltk\">Getting Started with NLTK<\/a>). It includes texts like <em>Moby-Dick<\/em> and <em>Sense and Sensibility<\/em>.<\/p>\n<p>But you will probably want to analyze a source of your own. For example, I had text from a series of tweets debating political issues. The third chapter (<a href=\"http:\/\/www.nltk.org\/book\/ch03.html#accessing-text-from-the-web-and-from-disk\">Accessing Text from the Web and from Disk<\/a>) has the answers:<\/p>\n<p>First you need to turn raw text into tokens:<\/p>\n<pre><code>tokens = word_tokenize(raw)<\/code><\/pre>\n<p>Next turn your tokens into NLTK text:<\/p>\n<pre><code>text = nltk.Text(tokens)<\/code><\/pre>\n<p>Now you can treat it like the book examples in chapter 1.<\/p>\n<p>I was analyzing a number number of tweets. One of the things I wanted to do was find common words in the tweets, to see if there were particular keywords that were common.<\/p>\n<p>I was using the Python interpreter for my tests, and I did run into a couple errors with <code>word_tokenize<\/code> and later <code>FreqDist<\/code>, such as:<\/p>\n<pre><code>&gt;&gt;&gt; fdist1 = FreqDist(text)\r\nTraceback (most recent call last):\r\n  File \"&lt;stdin&gt;\", line 1, in &lt;module&gt;\r\nNameError: name 'FreqDist' is not defined<\/code><\/pre>\n<p>You can address this by importing the specific libraries:<\/p>\n<pre><code>&gt;&gt;&gt; from nltk import FreqDist<\/code><\/pre>\n<p>Here are the commands, in order, that I ran to produce my list of common words &#8212; in this case, I was looking for words that appeared at least 3 times and that were at least 5 characters long:<\/p>\n<pre><code>&gt;&gt;&gt; import nltk\r\n&gt;&gt;&gt; from nltk import word_tokenize\r\n&gt;&gt;&gt; from nltk import FreqDist\r\n\r\n&gt;&gt;&gt; with open(\"corpus-twitter\", \"r\") as myfile:\r\n...     raw = myfile.read().decode(\"utf8\")\r\n\r\n&gt;&gt;&gt; tokens = word_tokenize(raw)\r\n&gt;&gt;&gt; text = nltk.Text(tokens)\r\n\r\n&gt;&gt;&gt; fdist = FreqDist(text)\r\n&gt;&gt;&gt; sorted(w for w in set(text) if len(w) &gt;= 5 and fdist[w] &gt;= 3)\r\n\r\n[u'Americans', u'Detroit', u'Please', u'TaxReform', u'Thanks', u'There', u'Trump', u'about', u'against', u'always', u'anyone', u'argument', u'because', u'being', u'believe', u'context', u'could', u'debate', u'defend', u'diluted', u'dollars', u'enough', u'every', u'going', u'happened', u'heard', u'human', u'ideas', u'immigration', u'indefensible', u'logic', u'never', u'opinion', u'people', u'point', u'pragmatic', u'problem', u'problems', u'proposed', u'public', u'question', u'really', u'restricting', u'right', u'saying', u'school', u'scope', u'serious', u'should', u'solution', u'still', u'talking', u'their', u'there', u'think', u'thinking', u'thread', u'times', u'truth', u'trying', u'tweet', u'understand', u'until', u'welfare', u'where', u'world', u'would', u'wrong', u'years', u'yesterday']<\/code><\/pre>\n<p>It turns out the results weren&#8217;t as interesting as I&#8217;d hoped. A few interesting items&#8211;<em>Detroit<\/em> for example&#8211;but most of the words aren&#8217;t surprising given I was looking at tweets around political debate. Perhaps with a larger corpus there would be more stand-out words.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The examples in Getting Started with NLTK have you analyze text imported from the NLTK library. What if you want to analyze your own input text? I stumble over how to do that in this post, and I analyze some word frequencies in a series of tweets.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[86,1],"tags":[487,358],"class_list":["post-2369","post","type-post","status-publish","format-standard","hentry","category-python","category-uncategorized","tag-nltk","tag-python"],"_links":{"self":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/2369","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/comments?post=2369"}],"version-history":[{"count":8,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/2369\/revisions"}],"predecessor-version":[{"id":2383,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/posts\/2369\/revisions\/2383"}],"wp:attachment":[{"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/media?parent=2369"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/categories?post=2369"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/osric.com\/chris\/accidental-developer\/wp-json\/wp\/v2\/tags?post=2369"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}