nltk – The Accidental Developer

[This was originally posted at the now-defunct impractical.bot on 23 Feb 2019]

I created a tool that will allow anyone to experiment with NLTK (Natural Language Toolkit) chatbots without writing any Python code. The repository for the backend code is available on GitHub: Docker NLTK chatbot.

I plan to expand on this idea, but it is usable now. In order to create your own bot:

Create a GitHub account
Create a “gist” or fork my demo gist: Greetings Bot Source
Customize the name, match, and replies elements
Note your username and the unique ID of your gist (a hash value, a 32-character string of letters and numbers)
Visit http://osric.com/chat/user/hash, replacing user with your GitHub username and hash with the unique ID of your gist. For an example, see Greetings Bot.

You can now interact with your custom bot, or share the link with your friends!

One more thing: if you update your gist, you’ll need to let the site know to update the code. Just click the “Reload Source” link on the chat page.

I just recently started playing with the Python NLTK (Natural Language ToolKit) to analyze text. The book Natural Language Processing with Python is available online and is very helpful if you’re just getting started.

At the beginning of the book the examples cover importing and analyzing text (primarily books) that you import from nltk (Getting Started with NLTK). It includes texts like Moby-Dick and Sense and Sensibility.

But you will probably want to analyze a source of your own. For example, I had text from a series of tweets debating political issues. The third chapter (Accessing Text from the Web and from Disk) has the answers:

First you need to turn raw text into tokens:

tokens = word_tokenize(raw)

Next turn your tokens into NLTK text:

text = nltk.Text(tokens)

Now you can treat it like the book examples in chapter 1.

I was analyzing a number number of tweets. One of the things I wanted to do was find common words in the tweets, to see if there were particular keywords that were common.

I was using the Python interpreter for my tests, and I did run into a couple errors with word_tokenize and later FreqDist, such as:

>>> fdist1 = FreqDist(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'FreqDist' is not defined

You can address this by importing the specific libraries:

>>> from nltk import FreqDist

Here are the commands, in order, that I ran to produce my list of common words — in this case, I was looking for words that appeared at least 3 times and that were at least 5 characters long:

>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> with open("corpus-twitter", "r") as myfile:
...     raw = myfile.read().decode("utf8")

>>> tokens = word_tokenize(raw)
>>> text = nltk.Text(tokens)

>>> fdist = FreqDist(text)
>>> sorted(w for w in set(text) if len(w) >= 5 and fdist[w] >= 3)

[u'Americans', u'Detroit', u'Please', u'TaxReform', u'Thanks', u'There', u'Trump', u'about', u'against', u'always', u'anyone', u'argument', u'because', u'being', u'believe', u'context', u'could', u'debate', u'defend', u'diluted', u'dollars', u'enough', u'every', u'going', u'happened', u'heard', u'human', u'ideas', u'immigration', u'indefensible', u'logic', u'never', u'opinion', u'people', u'point', u'pragmatic', u'problem', u'problems', u'proposed', u'public', u'question', u'really', u'restricting', u'right', u'saying', u'school', u'scope', u'serious', u'should', u'solution', u'still', u'talking', u'their', u'there', u'think', u'thinking', u'thread', u'times', u'truth', u'trying', u'tweet', u'understand', u'until', u'welfare', u'where', u'world', u'would', u'wrong', u'years', u'yesterday']

It turns out the results weren’t as interesting as I’d hoped. A few interesting items–Detroit for example–but most of the words aren’t surprising given I was looking at tweets around political debate. Perhaps with a larger corpus there would be more stand-out words.

Tag: nltk

DIY Gist Chatbots

Analyzing text to find common terms using Python and NLTK