1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004.

Slides:



Advertisements
Similar presentations
1 I256: Applied Natural Language Processing Marti Hearst Aug 30, 2006.
Advertisements

1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006.
Introduction To System Analysis and Design
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Computers: Tools for an Information Age
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
C++ fundamentals.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
ANTLR.
Basic Concepts The Unified Modeling Language (UML) SYSC System Analysis and Design.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.
XP New Perspectives on XML Tutorial 4 1 XML Schema Tutorial – Carey ISBN Working with Namespaces and Schemas.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
CSM-Java Programming-I Spring,2005 Introduction to Objects and Classes Lesson - 1.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Introduction to Programming Lecture Number:. What is Programming Programming is to instruct the computer on what it has to do in a language that the computer.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Deep Learning with Python. 파이썬 (python) 이란 ? 1991 년 Guido van Rossum 이 발표한 인터프리터 언어 Google 의 3 대 개발언어 (C/C++, Java, Python)
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
CountryData Technologies for Data Exchange SDMX Information Model: An Introduction.
Introduction To System Analysis and Design
Tutorial 10 Programming with JavaScript
CSCI-383 Object-Oriented Programming & Design Lecture 13.
1 COMP 3438 – Part II-Lecture 1: Overview of Compiler Design Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
New Perspectives on XML, 2nd Edition
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Guide to Programming with Python Chapter One Getting Started: The Game Over Program.
Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Module 4 Part 2 Introduction To Software Development : Programming & Languages Introduction To Software Development : Programming & Languages.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Design Model Lecture p6 T120B pavasario sem.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Natural Language Processing Chapter 2 : Morphology.
Information Retrieval
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
Levels of Linguistic Analysis
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
CMSC 345 Fall 2000 OO Design. Characteristics of OOD Objects are abstractions of real-world or system entities and manage themselves Objects are independent.
Lesson 10—Networking BASICS1 Networking BASICS The Internet and Its Tools Unit 3 Lesson 10.
Unix RE’s Text Processing Lexical Analysis.   RE’s appear in many systems, often private software that needs a simple language to describe sequences.
PROGRAMMING (1) LECTURE # 1 Programming and Languages: Telling the Computer What to Do.
Python for NLP and the Natural Language Toolkit
UNIT-IV Designing Classes – Access Layer ‐ Object Storage ‐ Object Interoperability.
Advanced Computer Systems
CSCI-235 Micro-Computer Applications
Introduction to Unified Modeling Language (UML)
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Natural Language Processing (NLP)
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Levels of Linguistic Analysis
Natural Language Processing (NLP)
CSA2050: Introduction to Computational Linguistics
Natural Language Processing (NLP)
Presentation transcript:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 1, 2004

2 Today How shall we transform a huge text collection? Levels of Language Intro to NLTK and Python

3 The Enron Archive Background Originally made public, and posted to the web by the Federal Energy Regulatory Commission during its investigation. –~500,000 messages –Salon article: Later purchased by Leslie Kaelbling at MIT People at SRI, notably Melinda Gervasio, cleaned it up –No attachments –Some messages have been deleted "as part of a redaction effort due to requests from affected employees". –Invalid addresses were converted to Posted online for research on by William Cohen at – Paper describing the dataset: –The Enron Corpus: A New Dataset for Classification Research, Klimt and Yang, ECML

4 The Enron Archive A valuable resource No other large open corpus for research A sensitive resource We need to be respectful and careful about how we treat this information We can add value Idea: this class produces something more valuable and interesting than what we started with. Researchers and practitioners will build on our results

5 The Enron Archive So … what’s in there? 500,000 messages. Let’s search (on a subset of the collection): Now … what more would we like to have?

6 Slide adapted from Robert Berwick's Levels of Language Sound Structure (Phonetics and Phonology) The sounds of speech and their production The systematic way that sounds are differently realized in different environments. Word Structure (Morphology) From morphos = shape (not transform, as in morph) Analyzes how words are formed from minimal units of meaning; also derivational rules –dog + s = dogs; eat, eats, ate Phrase Structure (Syntax) From the Greek syntaxis, arrange together Describes grammatical arrangements of words into hierarchical structure

7 Slide adapted from Robert Berwick's Levels of Language Thematic Structure Getting closer to meaning Who did what to whom –Subject, object, predicate Semantic Structure How the lower levels combine to convey meaning Pragmatics and Discourse Structure How language is used across sentences.

8 Slide adapted from Robert Berwick's Parsing at Every Level Transforming from a surface representation to an underlying representation It’s not straightforward to do any of these mappings! Ambiguity at every level –Word: is “saw” a verb or noun? –Phrase: “I saw the guy on the hill with the telescope.”  Who is on the hill? –Semantic: which hill?

9 Python and NLTK The following slides from Diane Litman’s lecture

10 Slide by Diane Litman Python and Natural Language Processing Python is a great language for NLP: Simple Easy to debug: –Exceptions –Interpreted language Easy to structure –Modules –Object oriented programming Powerful string manipulation

11 Slide by Diane Litman Modules and Packages Python modules “package program code and data for reuse.” (Lutz) Similar to library in C, package in Java. Python packages are hierarchical modules (i.e., modules that contain other modules). Three commands for accessing modules: 1.import 2.from…import 3.reload

12 Slide by Diane Litman Modules and Packages: import The import command loads a module: # Load the regular expression module >>> import re To access the contents of a module, use dotted names: # Use the search method from the re module >>> re.search(‘\w+’, str) To list the contents of a module, use dir: >>> dir(re) [‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

13 Slide by Diane Litman Modules and Packages from…import The from…import command loads individual functions and objects from a module: # Load the search function from the re module >>> from re import search Once an individual function or object is loaded with from…import, it can be used directly: # Use the search method from the re module >>> search (‘\w+’, str)

14 Slide by Diane Litman Import vs. from…import Import Keeps module functions separate from user functions. Requires the use of dotted names. Works with reload. from…import Puts module functions and user functions together. More convenient names. Does not work with reload.

15 Slide by Diane Litman Modules and Packages: reload If you edit a module, you must use the reload command before the changes become visible in Python: >>> import mymodule... >>> reload (mymodule) The reload command only affects modules that have been loaded with import ; it does not update individual functions and objects loaded with from...import.

16 Slide by Diane Litman Introduction to NLTK The Natural Language Toolkit (NLTK) provides: Basic classes for representing data relevant to natural language processing. Standard interfaces for performing tasks, such as tokenization, tagging, and parsing. Standard implementations of each task, which can be combined to solve complex problems.

17 Slide by Diane Litman NLTK: Example Modules nltk.token : processing individual elements of text, such as words or sentences. nltk.probability : modeling frequency distributions and probabilistic systems. nltk.tagger : tagging tokens with supplemental information, such as parts of speech or wordnet sense tags. nltk.parser : high-level interface for parsing texts. nltk.chartparser : a chart-based implementation of the parser interface. nltk.chunkparser : a regular-expression based surface parser.

18 Slide by Diane Litman NLTK: Top-Level Organization NLTK is organized as a flat hierarchy of packages and modules. Each module provides the tools necessary to address a specific task Modules contain two types of classes: Data-oriented classes are used to represent information relevant to natural language processing. Task-oriented classes encapsulate the resources and methods needed to perform a specific task.

19 Slide by Diane Litman To the First Tutorials Tokens and Tokenization Frequency Distributions

20 Slide by Diane Litman The Token Module It is often useful to think of a text in terms of smaller elements, such as words or sentences. The nltk.token module defines classes for representing and processing these smaller elements. What might be other useful smaller elements?

21 Slide by Diane Litman Tokens and Types The term word can be used in two different ways: 1.To refer to an individual occurrence of a word 2.To refer to an abstract vocabulary item For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items. To avoid confusion use more precise terminology: 1.Word token: an occurrence of a word 2.Word Type: a vocabulary item

22 Slide by Diane Litman Tokens and Types (continued) In NLTK, tokens are constructed from their types using the Token constructor: >>> from nltk.token import * >>> my_word= 'dog' >>> my_word_token =Token(TEXT=my_word)

23 Slide by Diane Litman Text Locations A text [s:e] specifies a region of a text: s is the start index e is the end index The text [s:e] specifies the text beginning at s, and including everything up to (but not including) the text at e. This definition is consistent with Python slice. Think of indices as appearing between elements: I saw a man Shorthand notation when location width = 1.

24 Slide by Diane Litman Text Locations(continued) Indices can be based on different units: character word sentence Locations can be tagged with sources (files, other text locations – e.g., the first word of the first sentence in the file) Location member functions: start end unit source

25 Slide by Diane Litman Tokenization The simplest way to represent a text is with a single string. Difficult to process text in this format. Often, it is more convenient to work with a list of tokens. The task of converting a text from a single string to a list of tokens is known as tokenization.

26 Slide by Diane Litman Tokenization (continued) Tokenization is harder that it seems I’ll see you in New York. The aluminum-export ban. The simplest approach is to use “graphic words” (i.e., separate words using whitespace) Another approach is to use regular expressions to specify which substrings are valid words. NLTK provides a generic tokenization interface: TokenizerI

27 Slide by Diane Litman TokenizerI Defines a single method, tokenize, which takes a string and returns a list of tokens Tokenize is independent of the level of tokenization and the implementation algorithm

28 For Next Week Monday: holiday, no class Kevin is still installing the software I will send with details when ready Probably by the end of today Sign up for the list! Mail to: Put in msg body: subscribe anlp For Wed Sept 8 Do exercises 1-3 in Tutorial 2 (Tokenizing)