1 Introduction LING 570 Fei Xia Week 1: 9/26/07. 2 Outline Course overview Tokenization Homework #1 Quiz #1.

Slides:



Advertisements
Similar presentations
CS1101: Programming Methodology
Advertisements

Introduction to Computer Programming I CSE 113
Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
General course information Session 1: 7/08/
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
IT 240 Intro to Desktop Databases Introduction. About this course Design a database: Entity Relation (ER) modeling and normalization techniques Create.
CS/CMPE 535 – Machine Learning Outline. CS Machine Learning (Wi ) - Asim LUMS2 Description A course on the fundamentals of machine.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Finite state automaton (FSA)
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
COMP 110 Introduction to Programming Mr. Joshua Stough August 22, 2007 Monday/Wednesday/Friday 3:00-4:15 Gardner Hall 307.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.
1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.
COMP 14 – 02: Introduction to Programming Andrew Leaver-Fay August 31, 2005 Monday/Wednesday 3-4:15 pm Peabody 217 Friday 3-3:50pm Peabody 217.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Data Structures & Agorithms Lecture-1: Introduction.
COP4020/CGS5426 Programming languages Syllabus. Instructor Xin Yuan Office: 168 LOV Office hours: T, H 10:00am – 11:30am Class website:
Computer Science 102 Data Structures and Algorithms V Fall 2009 Lecture 1: administrative details Professor: Evan Korth New York University 1.
Computer Science 2211b Software Tools and Systems Programming.
© 2004 Goodrich, Tamassia CS2210 Data Structures and Algorithms Lecture 1: Course Overview Instructor: Olga Veksler.
CSCI 347 – Data Mining Lecture 01 – Course Overview.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
CS223 Algorithms D-Term 2013 Instructor: Mohamed Eltabakh WPI, CS Introduction Slide 1.
Math 125 Statistics. About me  Nedjla Ougouag, PhD  Office: Room 702H  Ph: (312)   Homepage:
Course Introduction Bryce Boe 2012/08/06 CS32, Summer 2012 B.
FS100 – Unit 1 Introduction to FS C. Seminar Overview Course Syllabus Important Dates Course Announcements Discussion Boards Assignments and Grading.
COMP Introduction to Programming Yi Hong May 13, 2015.
CSET 3300: Database-Driven Web Applications Spring 2010 William Acosta URL:
CST 229 Introduction to Grammars Dr. Sherry Yang Room 213 (503)
CS 4705 Natural Language Processing Fall 2010 What is Natural Language Processing? Designing software to recognize, analyze and generate text and speech.
CST 320 Compiler Methods Dr. Sherry Yang PV 171 (541)
CNS 4450 Syllabus. Context Language is a tool of thought. We rarely think without words. In solving problems by computer, we eventually get to the point.
CSCI 51 Introduction to Computer Science Dr. Joshua Stough January 20, 2009.
Introduction to Databases Computer Science 557 September 2007 Instructor: Joe Bockhorst University of Wisconsin - Milwaukee.
Computer Science 102 Data Structures and Algorithms CSCI-UA.0102 Fall 2012 Lecture 1: administrative details Professor: Evan Korth New York University.
CS 6961: Structured Prediction Fall 2014 Course Information.
CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.
Programming In Perl CSCI-2230 Thursday, 2pm-3:50pm Paul Lalli - Instructor.
CS-2851 Dr. Mark L. Hornick 1 CS-2852 Data Structures Dr. Mark L. Hornick Office: L341 Phone: web: people.msoe.edu/hornick/
Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Introduction to ECE 2401 Data Structure Fall 2005 Chapter 0 Chen, Chang-Sheng
June 19, Liang-Jun Zhang MTWRF 9:45-11:15 am Sitterson Hall 011 Comp 110 Introduction to Programming.
Overview of Bioinformatics 1 Module Denis Manley..
1 Probability theory LING 570 Fei Xia Week 2: 10/01/07.
1 Introduction LING 570 Fei Xia Week 1: 9/30/09. 2 Outline Course overview Tokenization Homework #1 Questionnaire.
ICS202 Data Structures King Fahd University of Petroleum & Minerals College of Computer Science & Engineering Information & Computer Science Department.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Elements of Computing Science I Course Web Site: The lecture outlines.
Dr. Sajib Datta CSE Spring 2016 INTERMEDIATE PROGRAMMING.
Course Info Instructor U.T. Nguyen Office: CSEB Office hours: Tuesday, 14:30-15:30 Thursday, 12:00-12:45 By.
Data Structures and Algorithms in Java AlaaEddin 2012.
Computer Networks CNT5106C
INTRODUCTION to Operations Management MT435 – 02 Week 1 Instructor – Dr. Stuart Childers 1-1.
CSc 120 Introduction to Computer Programing II
CS6501 Advanced Topics in Information Retrieval Course Policy
CENG 213 Data Structures Dr. Cevat Şener
CSC 135 section 60 or CSC Fall 2017.
Natural Language Processing (NLP)
Computer Science 102 Data Structures CSCI-UA
CSE1320 INTERMEDIATE PROGRAMMING
CSE1320 INTERMEDIATE PROGRAMMING
CSE1320 INTERMEDIATE PROGRAMMING
CSE1320 INTERMEDIATE PROGRAMMING
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1 Introduction LING 570 Fei Xia Week 1: 9/26/07

2 Outline Course overview Tokenization Homework #1 Quiz #1

3 Course overview

4 General info Course url: –Syllabus (incl. slides, assignments, and papers): updated every week. –GoPost: Message board, similar to EPost –Collect it: similar to ESubmit Slides: –The slides will be online after class. –“Additional slides” are not required and not covered in class.

5 Mailing list CLMA students: Non-CLMA students: Please check your s at least once per day.

6 Prerequisites CS 326 (Data Structures) or equivalent: –Ex: hash table, array, tree, … Stat 391 (Prob. and Stats for CS) or equivalent: Basic concepts in probability and statistics –Ex: random variables, chain rule, Bayes’ rule Programming in Perl, C, C++, Java, or Python Basic unix/linux commands (e.g., ls, cd, ln, sort, head): tutorials on unix

7 Reading Textbook: Manning & Schutze –Get it from UW bookstore or amazon.com, etc. Reference book: Jurafsky and Martin (2008?): online Additional papers are online.

8 Grades Assignments: 70-80% Quizzes: 10-20% Class participation: 5-10% No final exams

9 Office hour Fei: – address: Subject line should include “ling570” The 48-hour rule –Office hour: Time: Fr: 10:30-11:30am Location: Padelford A-210G

10 TA hour Bill McNeill – –Time: M: 2:30-3:30pm –Location: Art 337

11 Homework submission Use “Collect it”: submit the tar file. –tar –cvf hw1.tar hw1_dir Due date: every Wed at 3pm. the submission area is closed 4 days after the due date. There is 1% penalty for every hour after the due date. Programming languages: C, C++, Java, Perl, Python, or Ruby. Please follow the instructions in the assignments about the naming convention.

12 Homework Submission (cont) Each submission includes –a note file: hw1.(txt|doc|pdf) for hw1. If your code does not work, explain in the note file what you have implemented so far. –a set of shell scripts: e.g., eng_token.sh –source code: e.g., eng_token.C –binary code (for C/C++/Java): eng_token.out –data files if any. Time spent on an assignment: <= 15 hours/week  I would appreciate it if you could tell me the time you spent on the homework.

13 Course policies No “incomplete” unless you can prove your case. Late assignments: 1% penalty for every hour after the deadline. No submission will be accepted 4 days after the due day.

14 Course description

15 CompLing courses offered in the ling dept Core courses on CompLing: Ling –Ling570: Shallow processing –Ling571: Deep processing –Ling572: Statistical methods for NLP: focusing on supervised learning –Ling573: System/applications Other courses on CompLing: –Grammar engineering: ling566 and ling567 –Seminar (ling575): typology, unsupervised learning, etc.

16 Topics covered in Ling570 Introduction, corpora, evaluation, tokenization: 2 Morphological analysis and FSA: 3 LM, ngram, and smoothing: 4 Part-of-speech (POS) tagging and HMM: 4

17 Topics covered in ling570 (cont) Introduction to classification: 1 Shallow parsing: 1 Named-entity (NE) tagging: 1 Word sense disambiguation (WSD): 1 Information retrieval (IR) and Information extraction (IE): 2 Summary: 1

18 Ling Ling571: Parsing, semantics, discourse, dialogue, natural language generation, … Ling572: Decision tree, Naïve Bayes, Boosting, SVM, MaxEnt, … Ling573: Applications (e.g., Q&A, Dialogue systems, …)

19 Relation between ling /571 are organized by tasks 572 is organized by learning methods and it focuses on statistical methods. 573 focuses on building an end-to-end system.

20 Questions?

21 Outline Course overview Tokenization Homework #1 Quiz #1

22 Tokenization

23 The Task Given the input text, break it into tokens. A token is a word, a number, a punctuation mark, etc. An example: John said: “Please call me tomorrow.”

24 A naïve algorithm Separating punctuation marks from word tokens. Problems: –Period: 1.23, Mr., Ph.D., U.S.A., etc. –Comma: 123,456 –Single apostrophes: dog’s, I’ll –Parenthesis: (202) , (a), 1), –Hyphenation: Line-breaking hyphens , so-called, pro-life, text-based data-base, database, data base a take-it-or-leave-it offer

25 Whitespace is not always reliable Examples: –Names: Hong Kong –Numbers: –Collocation: because of, pick up –Collocation mixed with hyphens: The New York-based company The New York-New Haven railroad

26 Word segmentation No whitespace between words: e.g., Chinese, Japanese, Thai Noun compounds in German: –Ex: Lebensversicherungsgesellschaftsangestellter “Life insurance company employee”

27 What counts as a word? In theory: –Phonological word, syntactic words, lexeme, etc. In practice: –$22.50 –Hyphenated words –Named entity –…–…  Design a set of annotation guidelines, and write a tokenizer based on that.

28 Homework 1

29 Naming convention A tool called “eng_tokenizer” –Source code: eng_tokenizer.(pl | py | C | java|cpp|h|…) –Shell script: eng_tokenizer.sh, which could look like #!/bin/sh eng_tokenizer.pl Or eng_tokenizer.pl abbr_file Shell, perl, python, and binary code must be executable. Your code will be tested on new test data.

30 Part I: implementing an English tokenizer The shell command line should be cat input_file |./eng_tokenizer.sh > output_file Your code can use additional arguments. The i th input line should correspond to the i th output line. Don’t merge the tokens in the input. Design your own criteria and specify that in your note file. Don’t spend too much time trying to make your tokenizer perfect.

31 Part 2: make_voc Given the input text: The teacher bought the bike. The bike is expensive. The output should be The 2 the 1 teacher 1 bought 1 bike. 1 bike 1 is 1 expensive. 1

32 Coming up Next Monday: –Go over Quiz 1 –A quick review of probability theory. –Start on regular expression and FSA. Hw1 is due next Wed at 3pm.

33 Quiz #1