Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Slides:



Advertisements
Similar presentations
Corpus Processing and NLP
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario.
Chapter 5 Mechanics of Writing Business Communication Copyright 2010 South-Western Cengage Learning.
Chapter 5 Mechanics of Writing
Punctuation & Grammar., ?; :’!., ?; “” :’!., ?; “” :’!
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Towards an NLP `module’ The role of an utterance-level interface.
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
1 I256: Applied Natural Language Processing Marti Hearst Sept 6, 2006.
1 HTML Markup language – coded text is converted into formatted text by a web browser. Big chart on pg. 16—39. Tags usually come in pairs like – data Some.
Stemming, tagging and chunking Text analysis short of parsing.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Guide To UNIX Using Linux Third Edition
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Lists and More About Strings CS303E: Elements of Computers and Programming.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
E0262 – MIS – Multimedia Storage Techniques XML (Extensible Markup Language)  XML is a markup language for creating documents containing structured information.
Introduction to XML 1. XML XML started out as a standard data exchange format for the Web Yet, it has quickly become the fundamental instrument in the.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Creating Webpage Using HTML
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
The question mark Parentheses mark Exclamation full stop Comma: Semicolons COLONS Ellipsis Link& dashes Quotation marks Hyphens:
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
Errors “Computer says no..”. Types of Errors Many different types of errors new ones are being invented every day by industrious programming students..
What it is and how it works
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
© 2006 SOUTH-WESTERN EDUCATIONAL PUBLISHING 11th Edition Hulbert & Miller Effective English for Colleges Chapter 10 PUNCTUATION.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Prof. Alfred J Bird, Ph.D., NBCT Door Code for IT441 Students.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Adapted from the following sources: ASU College of Education Graduate Resources Robin Sontheimer University of Missouri-Kansas City Writing Center 1.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.
Regular Expressions.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CMSC201 Computer Science I for Majors Lecture 22 – Binary (and More)
Text Based Information Retrieval
Natural Language Processing (NLP)
Corpus Linguistics I ENG 617
CS 430: Information Discovery
Intro to PHP & Variables
Topics in Linguistics ENG 331
Basic Text Processing: Sentence Segmentation
String Processing 1 MIS 3406 Department of MIS Fox School of Business
Chapter 5 Mechanics of Writing
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Basic Text Processing Word tokenization.
CSCI 5832 Natural Language Processing
Information Retrieval and Web Design
Natural Language Processing (NLP)
Programming Techniques
Presentation transcript:

Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)

Corpus-Based Work Text Corpora are usually big. –Corpora 사용의 중요한 한계점으로 작용 – 대용량 Computer 의 발전으로 극복 Corpus-Based word involves collection a large number of counts from corpora that need to be access quickly There exists some software for processing corpora

Corpora Linguistically mark-up or not Representative sample of the population of interest – American English vs. British English –Written vs. Spoken –Areas The performance of a system depends heavily on –the entropy –Text categorization Balanced corpus vs. all text available

Software –Text editor : 글자 그대로 보여준다. –Regular expression : 정확한 patter 을 찾게 한다. –Programming language C/C++, Perl, awk, Python, Prolog, Java –Programming techniques

Looking at Text Text come a row format or marked up. Markup –A term is used for putting code of some sort into a computer file –Commercial word processing : WYSIWYG Features of text in human languages – 자연어 처리의 어려운 점

Low-Level Formatting Issues Junk formatting/Content. – document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. – OCR : If your program is meant to deal with only connected English text Uppercase and Lowercase: –should we keep the case or not? The, the and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.

Tokenization: What is a Word?(1) Tokenization –To divide the input text into unit called token –what is a word? graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo- strophes, but no other punctuation marks”

Tokenization: What is a Word?(2) Period – 문자의 끝을 나타내는 의미가 있다. – 약어를 나타낸다. : as in etc. or Wash Single apostrophes – isn’t, I’ll  2 words ? 1 words – 영어의 축약 : I’ll or isn’t Hyphenation – 일반적으로 인쇄상 다음 줄로 넘어가는 한 단어를 표시 – text-based, co-operation, , A-1-plus paper, “take-it- or-leave-it”, the 90-cent-an-hour raise, mark up  mark-up  mark(ed) up

Tokenization: What is a Word?(3) Word Segmentation in other languages: no whitespace ==> words segmentation is hard whitespace not indicating a word break. – New York, data base – the New York-New Haven railroad 명확한 의미의 정보가 다양한 형태로 존재한다. – , (202) , , (44.171)

Tokenization: What is a Word?(4) Phone number Country UK Denmark (44.171) UK Pakistan +44 (0) UK +411/ Switzerland UK (94-1) Sri Lanka (202) USA Germany USA France USA The Nerherlands Table 4.2 Different formats for telephone numbers appearing in an issue of the Economist

Morphology Stemming: Strips off affixes. – sit, sits, sat Lemmatization: transforms into base form (lemma, lexeme) –Disambiguation Not always helpful in English (from an IR point of view) which has very little morphology. IR community has shown that doing stemming does not help the performance Mutiple words  a morpheme ??? Morphological analysis 를 구현하기 위한 추가비용에 비해 효능이 안 좋다

Stemming 동일 의 단어의 다양한 변형을 하나의 색인어로 변환 – “computer”, “computing” 등을 “compute” 로 변환 장점 – 저장 공간의 사용을 감소, 검색 속도 개선 – 검색 결과의 질 향상 ( 질의가 “compute” 일 경우 “computer”, “computing” 등 포함 하는 모든 단어 검색 ) 단점 –Over Stemming: 문자를 과도하게 제거하여 연관성 없는 단어들의 매칭을 발생 –Under Stemming : 단어에 포함된 문자를 적게 제거하여 연관성 있는 단어 매칭이 안 되는 현상

Porter Stemming Algorithm 가장 널리 사용되며, 다양한 규칙을 이용 접두사는 제거하지 않고 접미사만을 제거하거나, 새로운 String 으로 대치 –Porter Stemming 실행 전 –Porter Stemming 실행 후

Porter Stemming Algorithm

Error #1: Words ending with “yed” and “ying” and having different meanings may end up with –Dying -> dy (impregnate with dye) –Dyed -> dy (passes away) Error #2: The removal of “ic” or “ical” from words having m=2 and ending with a series of consonant, vowel, consonant, vowel, such as generic, politic…: –Political -> polit –Politic -> polit –Polite -> polit

Sentences What is a sentence? –Something ending with a ‘.’, ‘?’ or ‘!’. True in 90% of the cases. –Colon, semicolon, dash 도 문장으로 여겨질 수 있다. Sometimes, however, sentences are split up by other punctuation marks or quotes. Often, solutions involve heuristic methods. However, these solutions are hand-coded. Some effort to automate the sentenceboundary process have also been done. 우리말은 더욱 어려움 !!! – 마침표가 없기도 하고  종결형 어미 뒤 ? – 연결형 어미이면서 종결형 어미 – 따옴표

End-of-Sentence Detection (I) Place EOS after all. ? ! (maybe ;:-) Move EOS after quotation marks, if any Disqualify a period boundary if: – Preceeded by known abbreviation followed by upper case letter, not normally sentence-final: e.g., Prof. vs. Mr.

End-of-Sentence Detection (II) – Precedeed by a known abbreviation not followed by upper case: e.g., Jr. etc. (abbreviation that is sentence-final or medial) Disqualify a sentence boundary with ? or ! If followed by a lower case (or a known name) Keep all the rest as EOS

Marked-Up Data I: Mark-up Schemes 초기의 markup schemes – 단순히 내용정보만을 위해 header 에 삽입 (giving author, date, title, etc.) SGML – 문서의 구조와 문법을 표준화하는 grammer language XML –SGML 을 web 에 응용하기 위해 만든 SGML 의 축소판

Marked-Up Data II: Grammatical tagging first step of analysis – 일반적인 문법적 category 로 구별하는 것 – 최상급, 비교급, 명사의 단수, 복수 등의 구별 Tag sets (Table 4.5) –morphological distinction 을 통합한다. The design of a tag set – 분류의 관점 Word 의 문법정보가 얼마나 유용한 요소인가 하는 관점 – 예상의 관점 문맥에서 다른 word 에 어떠한 영향을 미치는지 예상하는 관점

Examples of Tagset(Korean)

Examples of Tagset(English) Brown corpus tagset PennTreebank tagset