Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.

Slides:



Advertisements
Similar presentations
Anatomy of a Web Page. Parts of a Web Page Title Bar Navigation Tool Bar Location Bar Header Graphic/Image Text Horizontal Rule Links.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
An Introduction to GATE
Corpus Processing and NLP
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario.
Chapter 5 Mechanics of Writing Business Communication Copyright 2010 South-Western Cengage Learning.
Chapter 5 Mechanics of Writing
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Towards an NLP `module’ The role of an utterance-level interface.
Fall 2001 EE669: Natural Language Processing 1 Lecture 4: Corpus-Based Work (Chapter 4 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
1 HTML Markup language – coded text is converted into formatted text by a web browser. Big chart on pg. 16—39. Tags usually come in pairs like – data Some.
CSC 4630 Meeting 9 February 14, 2007 Valentine’s Day; Snow Day.
Stemming, tagging and chunking Text analysis short of parsing.
REVIEW OF GRAMMAR Wrighting good meens you got to follow all the ruls; like speling, good, propper, punctuashun and coreckt grammar.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Creating Web Pages Getting Started. Overview What Web Pages Are How Web Pages are Formatted Putting Graphics on Web Pages How Web Pages are Linked Linking.
Guide To UNIX Using Linux Third Edition
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
Developing a Basic Web Page Posting Files on UMBC
HTML: PART ONE. Creating an HTML Document  It is a good idea to plan out a web page before you start coding  Draw a planning sketch or create a sample.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Paper Title Author 1, Co-Author 2, Co-author 3 PCIC MEXICO 2015 Petroleum and Chemical Industry Committee Technical Conference México, D. F. – November.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
CiNii Books is a service that provides information, which has been accumulated by NACSIS-CAT, on books and journals that are held in university libraries.
TERMS TO KNOW. Desktop This does not mean a computer desktop vs. a laptop. You probably keep a number of commonly used items on your desk at home such.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
HTML 4 Foundation Level Course HyperText Markup Language Most common language used in creating Web documents. You can use HTML to create cross-platform.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Introduction to HTML. Slide 1 Hard-Coding What is hard-coding? –Creating the page in a text editor just using HTML A Web designer should know how to hard-
Welcome! The Topic For Today Is Word Processing and Desktop Publishing.
INTRODUCTORY Tutorial 1 Using HTML Tags to Create Web Pages.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
4 Chapter Four Introduction to HTML. 4 Chapter Objectives Learn basic HTML commands Discover how to display graphic image objects in Web pages Create.
1 Web Application Programming Presented by: Mehwish Shafiq.
HTML ( HYPER TEXT MARK UP LANGUAGE ). What is HTML HTML describes the content and format of web pages using tags. Ex. Title Tag: A title It’s the job.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Introducing the World Wide Web Internet- a structure made up of millions of interconnected computers whose users communicate with each other and share.
Kevin Murphy Basics of XML Masters Project CS 490.
Unit 10 Schema Data Processing. Key Concepts XML fundamentals XML document format Document declaration XML elements and attributes Parsing Reserved characters.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
XP Review 1 New Perspectives on JavaScript, Comprehensive1 Introducing HTML and XHTML Creating Web Pages with HTML.
1 2/16/05CS120 The Information Era Chapter 4 Basic Web Page Construction TOPICS: Intro to HTML and Basic Web Page Design.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Using HTML. Glogger Glogger is like you own personal web page, you can add… Pictures Text Videos Music, etc… Modify and adjust anything you want Glogger.
1 Writing for Computer Science 4. Punctuation Ko, Myung warn.
XP Including Comments in an HTML Document On a new blank line in an HTML document, type the start code for a comment:
Regular Expressions.
Basic concepts of web design
Day 6 - Encoding and Sending Formatted Text
Natural Language Processing (NLP)
C-Character Set Dept. of Computer Applications Prof. Harpreet Kaur
Corpus Linguistics I ENG 617
Tutorial Developing a Basic Web Page
Chapter 5 Mechanics of Writing
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아

2 개 요  Getting Set Up –Computers, Corpora, Software  Looking at Text –Low-level formatting issues –Tokenization : What is a word? –Morphology –Sentences  Mark-up Data –Markup schemes –Grammatical tagging

3 Getting Set up(1/2)  Text corpora are usually big. –Corpora 사용의 중요한 한계점으로 작용 – 대용량 Computer 의 발전으로 극복  Corpora –Corpus 는 main organization 에서 제공하는 웹에 공개 된 것을 사용하면 된다. –corpus : 언어자료들을 모아놓은 사전 –general issue is representative sample of the population of interest.

4 Getting Set up(2/2)  Software –Text editors : 글자 그대로를 보여준다. –Regular expressions : 정확한 pattern 을 찾게 한다. –Programming languages : C, C++, Perl –Programming techniques

5 Looking at Text  Text come a row format or marked up.  Markup –a term is used for putting code of some sort into a computer file. –commercial word processing : WYSIWYG  Features of text in human languages – 자연언어처리의 어려운 점

6 Low-level formatting issues  Junk formatting/content –junk : document header, separator, table, diagram, etc. –OCR : If your program is meant to deal with only connected Englisg text ▷ junk : 다른 나라 언어, table, 숫자  Uppercase and lowercase –The original Brown corpus : * was used to capital letter –Should we treat brown in Richard Brown and brown paint as the same? –proper name detection : difficult problem

7 Tokenization : What is a word?(1)  Tokenization –To divide the input text into unit called token –what is a word? graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo- strophes, but no other punctuation marks” -> 정의되는 경우 : $22.50, Micro$oft, C|net, :-)

8 Tokenization : What is a word?(2)  Period – 문자의 끝을 나타내는 의미가 있다. – 약어를 나타낸다. : as in etc. or Wash.  Single apostrophes – 영어의 축약 : I’ll or isn’t –dog’s : dog is or dog has or 소유격  Hyphenation – 일반적으로 인쇄상 다음 줄로 넘어가는 한 단어를 표시. – , 26-year-old, co-operate

9 Tokenization : What is a word?(3)  The same form representing multiple “words” – 동형이의어 : seal 「 바다표범」과 seal 「인장」등 (chap 7)  Word segmentation in other languages – Word 와 Word 사이에 space 를 넣지 않는 경우가 많다  White space not indicating a word break –the New York - New Haven railroad : 한 단어 안에 space 가 들어간다.  명확한 의미의 정보가 다양한 형태로 존재한다 – 다양한 punctuation 이 사용된 전화번호

10 Phone number Country UK Denmark (44.171) UK Pakistan +44 (0) UK +411/ Switzerland UK (94-1) Sri Lanka (202) USA Germany USA France USA The Nerherlands Table 4.2 Different formats for telephone numbers appearing in an issue of the Economist

11 Morphology  Stemming processing – 접두사, 접미사 등을 제거하여 어간을 얻어낸다  lemmatization – 변형된 form 에서 lemma( 표제어 ) 와 lexeme( 어휘소 ) 등을 찾아내는 방법  IR community has shown that doing stemming does not help the performance  Morphological analysis 를 구현하기 위한 추가비용 에 비해 효능이 안 좋다

12 Sentences  What is a sentence? –something ending with a ‘.’, ‘?’ or ‘!.’ –colon, semicolon, dash 도 문장으로 여겨질 수 있다.  recent research sentence boundary detection –Riley(1989) : statistical classification tree –Palmer and Hearst (1994; 1997) : a neural network to predict sentence boundaries –Mikheev(1998) : Maximum Entropy approaches to the problem

13 Mark-up Schemes  초기의 markup schemes – 단순히 내용정보만을 위해 header 에 삽입 (giving author, date, title, etc.)  SGML – 문서의 구조와 문법을 표준화하는 grammer language  XML –SGML 을 web 에 응용하기 위해 만든 SGML 의 축소 판

14 Grammatical tagging  first step of analysis – 일반적인 문법적 category 로 구별하는 것 – 최상급, 비교급, 명사의 단수, 복수 등의 구별  Tag sets (Table 4.5) –morphological distinction 을 통합한다.  The design of a tag set – 분류의 관점 Word 의 문법정보가 얼마나 유용한 요소인가 하는 관점 – 예상의 관점 문맥에서 다른 word 에 어떠한 영향을 미치는지 예상하는 관점

15