Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 2 Dr. Paula Matuszek (610) 647-9789.

Similar presentations


Presentation on theme: "©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 2 Dr. Paula Matuszek (610) 647-9789."— Presentation transcript:

1 ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 2 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

2 ©2012 Paula Matuszek Goals l Goals for this lab are: –Get started with Python –Get started with NLTK l If you have a laptop with you, get them installed on it.

3 ©2012 Paula Matuszek NLTK l What is NLTK? –Natural Language ToolKit –Set of modules for Python l A large number of methods for importing and manipulating text l A large set of relevant data l Starting point is http://www.nltk.org/.http://www.nltk.org/

4 ©2012 Paula Matuszek Why NLTK? l Easy to get started l Powerful set of tools useful for preparing documents for mining l Several text mining tools built in l Very well documented l However: this is not a NLP course, and we will be ignoring much of NLTK which isn’t important for our topics

5 ©2012 Paula Matuszek Python l Python is an object-oriented, interpreted language. l Unlike Java, very little required structure. l IDLE is a simple integrated development for Python and the easiest way to use it l Can also be run at command line We will cover just some basics in class, more or less the minimum to use NLTK effectively

6 ©2012 Paula Matuszek Why Python? (besides NLTK) l Built-in types for strings, lists, dictionaries l Strong numeric processing capabilities l Clean syntax, powerful extensions

7 ©2012 Paula Matuszek Getting Started l http://python.org/ http://python.org/ l http://www.nltk.org/ http://www.nltk.org/ l Starting URLs for the systems we will be using: downloads, documentation, etc.

8 ©2012 Paula Matuszek Version Nightmares... l Biggest problem you are likely to run into is version incompatibilities. l Latest version of python is 3.2, but python 3.x does not preserve backward compatibility with 2.x. l Latest version of NLTK is 2.0x. –It is not compatible with python 3.x. –Docs say works with python 2.5.x, 2.6.x, 2.7.x. Didn’t work for me with 2.5.5. –I recommend using python 2.7. l Links in NLTK documentation for numpy and matplotlib may not be appropriate. Start at numpy.scipy.org or matplotlib.sourceforge.net and look for what matches your circumstances

9 ©2012 Paula Matuszek NLTK Data l http://www.nltk.org/data http://www.nltk.org/data l Data includes corpora, dictionaries, gazetteers, trained models l http://nltk.googlecode.com/svn/trunk/nltk _data/index.xml gives a list, with downloadable versions http://nltk.googlecode.com/svn/trunk/nltk _data/index.xml

10 ©2012 Paula Matuszek Go! l Start at http://www.nltk.org/downloadhttp://www.nltk.org/download l You will need to be sure you have –Python –PyYAML –NLTK –numpy –Matplotlib l Test that you can start python and import nltk. l If you didn’t bring a laptop either work with someone who has the same system or go on to the next slide with the PCs in G87

11 ©2012 Paula Matuszek Step 2 l Download the NLTK data from within Python. Go ahead and download all. – >>> nltk.download() l Test with –>>> from nltk.corpus import brown –>>>brown.words()

12 ©2012 Paula Matuszek Step 3 l Test some of the demos at –http://www.nltk.org/getting-startedhttp://www.nltk.org/getting-started l Look through the preface of the NLTK book by Bird et al: http://www.nltk.org/book http://www.nltk.org/book l Work through chapter 1, through section 1.4. This will include both some python and some NLTK methods.


Download ppt "©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 2 Dr. Paula Matuszek (610) 647-9789."

Similar presentations


Ads by Google