I just recently started playing with the Python NLTK (Natural Language ToolKit) to analyze text. The book Natural Language Processing with Python is available online and is very helpful if you’re just getting started.
At the beginning of the book the examples cover importing and analyzing text (primarily books) that you import from nltk (Getting Started with NLTK). It includes texts like Moby-Dick and Sense and Sensibility.
But you will probably want to analyze a source of your own. For example, I had text from a series of tweets debating political issues. The third chapter (Accessing Text from the Web and from Disk) has the answers:
First you need to turn raw text into tokens:
tokens = word_tokenize(raw)
Next turn your tokens into NLTK text:
text = nltk.Text(tokens)
Now you can treat it like the book examples in chapter 1.
I was analyzing a number number of tweets. One of the things I wanted to do was find common words in the tweets, to see if there were particular keywords that were common.
I was using the Python interpreter for my tests, and I did run into a couple errors with word_tokenize
and later FreqDist
, such as:
>>> fdist1 = FreqDist(text)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'FreqDist' is not defined
You can address this by importing the specific libraries:
>>> from nltk import FreqDist
Here are the commands, in order, that I ran to produce my list of common words — in this case, I was looking for words that appeared at least 3 times and that were at least 5 characters long:
>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk import FreqDist
>>> with open("corpus-twitter", "r") as myfile:
... raw = myfile.read().decode("utf8")
>>> tokens = word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> fdist = FreqDist(text)
>>> sorted(w for w in set(text) if len(w) >= 5 and fdist[w] >= 3)
[u'Americans', u'Detroit', u'Please', u'TaxReform', u'Thanks', u'There', u'Trump', u'about', u'against', u'always', u'anyone', u'argument', u'because', u'being', u'believe', u'context', u'could', u'debate', u'defend', u'diluted', u'dollars', u'enough', u'every', u'going', u'happened', u'heard', u'human', u'ideas', u'immigration', u'indefensible', u'logic', u'never', u'opinion', u'people', u'point', u'pragmatic', u'problem', u'problems', u'proposed', u'public', u'question', u'really', u'restricting', u'right', u'saying', u'school', u'scope', u'serious', u'should', u'solution', u'still', u'talking', u'their', u'there', u'think', u'thinking', u'thread', u'times', u'truth', u'trying', u'tweet', u'understand', u'until', u'welfare', u'where', u'world', u'would', u'wrong', u'years', u'yesterday']
It turns out the results weren’t as interesting as I’d hoped. A few interesting items–Detroit for example–but most of the words aren’t surprising given I was looking at tweets around political debate. Perhaps with a larger corpus there would be more stand-out words.