Installing NLTK on a Mac with Python 3.6

Installing NLTK 3.0 on a Mac.

This page describes the process I’m working through to try and get NLTK 3 working on Mavericks 10.9.5

Current situation:

So far I’ve managed to get Python 3.6 and NLTK 3.0 installed on the Mac. However there is a problem related to security certificates so that nltk.download() will not download corpus data. A little investigation on Google shows that there are problems caused by the interaction of OpenSSL supplied by Apple and Python.

Initially I thought that the main reason I would need nltk.download() is for testing, since we’ll have our own corpora to work on. However there are other files that may need to be downloaded so it may be a problem worth solving.

Introduction.

The Natural Language Toolkit or NLTK is a python library created by Steven Bird to aid researchers in their analysis of large quantities of language data. I’ve been trying to help a linguist install nltk on a mac, I thought that should be straightforward. It was pretty easy to do on Windows 10 so I was hoping there wouldn’t be too many problems. NLTK is written in Python and there are currently two versions one for Python 2.7 and one for Python 3.4 and later.

Python 3 isn’t just a later version of Python 2.

Python 3 was completely rewritten, independently of Python 2 code. So scripts written for Python 2 don’t work under Python 3 in general. One of the benefits of Python 3 is that it deals with Unicode characters by default. This Unicode handling is something that I expect we will need as the languages have plenty of non-Ascii characters.

OS X comes with python (and perl , Ruby and PHP) as part of the OS. However python isn’t provided as a convenience to the user, it is a necessary part of the OS. So should it become corrupted the OS will need to be reinstalled. Another consequence of this is that the built-in Python will likely be older than the current version of Python. I had OS X version 10.9 (Mavericks) installed which came with Python 2.5

So here’s the first hurdle, nltk requires python version 3.4+ or later. So it will be necessary to install a later version of python and keep it separate from the python required by the operating system.

Is a third party package manager required?

John Laudun on his blog suggests using MacPorts to install all the prerequisites for nltk, including python.
However MacPorts requires Apple’s Xcode to be installed. I found Xcode on the App store, after pressing the ‘Get’ buttton a message said that it wasn’t available for this machine.

The Xcode Command Line Tools and can be installed from the terminal with the command:
xcode-select –install

As John notes, there are error messages from MacPorts when only the command line tools are installed.

Warning: xcodebuild exists but failed to execute
Warning: Xcode does not appear to be installed; most ports will likely fail to build.

However that didn’t seem to be a problem, the ports I installed built successfully in spite of the warnings.

Once MacPorts is installed the following commands in a terminal window installed Python 3.6
sudo port install python36

In a terminal window, invoking ‘python’ or ‘python2’ will open the latest version of Python 2.
To start Python 3.6 (or latest version of Python 3) use the command ‘python3’

If you would like to add alias commands for these you can use the following:
sudo port select –set py2 python27
sudo port select –set py2 python36

Note that there are two minus signs before the word ‘set’, it’s not always clear on some screens, and I tried it many times with only one.

I do hope that NLTK can be installed without MacPorts or Homebrew or any other third party packaging manager. That would keep things a lot simpler. There would be fewer dependencies. So I tried installing python 3.6 using the download from python.org/downloads that installed no problem without MacPorts.

At some point I gave up with trying to proceed along this route and just installed Python 3 directly. So I’m hopeful that the excursion down the Homebrew/MacPorts with Xcode route was just a lesson in Keeping it Simple.

Back to the mainstream route.

With Python 3 is installed nltk was installed with the following command
sudo pip3 install nltk

In a terminal window opening python3 and importing nltk worked without any errors. Showing that nltk is installed.
>>> import nltk
>>>

The next thing that is required is to download some of the corpus data and test it. The command for that is:
>>> nltk.download()

Which opens a window which should show all the corpora and other resources that are available with nltk.
The window opens and then an error message pops up saying that the SSL Security Certificate could not be verified. None of the corpora are shown.
I tried to download just one or two using the python interactive interpreter but these attempts also gave a similar warning about SSL security certificates.

>>> nltk.download(‘brown’)
>>> nltk.download(‘genesis’)

Then I found some code in chapter 3 of the nltk book that allows me to to download data directly from a webpage and use that as a corpus.

So I created the following script:

import nltk
from nltk import word_tokenize
from urllib import request
url = “http://www.gutenberg.org/files/2554/2554.txt”
response = request.urlopen(url)
raw = response.read().decode(‘utf8’)
print(len(raw))
print(raw[:75])

Running this printed the number of characters in the text followed by the first 75 characters:

1176893
The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky
So we’ve got a text at this point, but haven’t used nltk on it yet.

I added the next few lines:

tokens = word_tokenize(raw)
print(len(tokens))
print(tokens[:10])

However this gave an error saying that it couldn’t find a file required to do tokenization in English.
(The file is tokenizers/punkt/english.pickle).

Google to the rescue, three quarters of the way down the nltk tokenization page I found this:
“The NLTK data package includes a pre-trained Punkt tokenizer for English.”

So perhaps importing that would work as in their example, using import nltk.data
Now the script looks like this:

import nltk
import nltk.data
from nltk import word_tokenize
from urllib import request
url = “http://www.gutenberg.org/files/2554/2554.txt”
response = request.urlopen(url)
raw = response.read().decode(‘utf8’)
print(len(raw))
print(raw[:75])
tokens = word_tokenize(raw)
print(len(tokens))
print(tokens[:10])

However this gave the same error once it gets to word_tokenize(raw).

254354
[‘The’, ‘Project’, ‘Gutenberg’, ‘EBook’, ‘of’, ‘Crime’, ‘and’, ‘Punishment’, ‘,’, ‘by’]

Here’s a glossary:
NLTK – The Natural Language Toolkit. (A python package).
NLTK requires Python versions 2.7 or 3.4+
Python – The programming language (Version 3.x is a rewrite and not just an upgrade from version 2.x)
pip – Python package manager (as of Python version ) for installing, updating and removing python packages as of version

venv is available by default in Python 3.3 and later, and installs pip and setuptools into created virtual environments in Python 3.4 and later.
virtualenv needs to be installed separately, but supports Python 2.6+ and Python 3.3+, and pip, setuptools and wheel are always installed for each virtualenv.

virtualenv – a way to copy python for each project so that differening versions of python can be used for different scripts. Used from version

IDLE – An Intergrated Development Envionment (IDE) commonly used with python. (Requires tkinter)
tkinter – A Graphical User Interface toolkit used by IDLE. tkinter requires the Tcl/Tk frameworks.
OS X – The current Mac operating system. (There are various versions).
Xcode – The Apple development environment.
Homebrew – A third party (non Apple) package manager for OS X.
MacPorts – A third party (non Apple) package manager for OS X more mature than Homebrew.

References:
https://packaging.python.org/installing/#creating-virtual-environments/
https://docs.python.org/3/installing/
http://johnlaudun.org/20120131-nltk-on-mac-os-x/
http://johnlaudun.org/20131229-the-complete-python-for-text-analysis/
http://www.janosgyerik.com/working-with-different-versions-of-python-on-osx-using-macports/
http://stackoverflow.com/questions/27077357/can-i-install-xcode-6-on-mavericks
http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/

Dependencies:
With MacPorts
NLTK requires Python3 which may require MacPorts which requires at least Xcode command line tools or Xcode full version which requires an Apple Developer login to download.

Leave a Reply

Your email address will not be published. Required fields are marked *