Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.

Slides:



Advertisements
Similar presentations
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
Advertisements

Website Design.
1 eVenzia Technologies Learning HTML, XHTML & CSS Chapter 1.
Chapter 5 Mechanics of Writing
Information Retrieval in Practice
XHTML 16-Apr-17.
Stemming, tagging and chunking Text analysis short of parsing.
HTML Computing Concepts HTML - An Introduction 1.
REVIEW OF GRAMMAR Wrighting good meens you got to follow all the ruls; like speling, good, propper, punctuashun and coreckt grammar.
17-Jun-15 XHTML 2 What is XHTML? XHTML stands for Extensible Hypertext Markup Language XHTML is aimed to replace HTML.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
XML CS 105. What is XML? XML stands for Extensible Markup Language. XML is a markup language like HTML. XML was designed to describe data. You must define.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
Developing a Basic Web Page Posting Files on UMBC
Introduction to XML This material is based heavily on the tutorial by the same name at
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
JavaScript, Fifth Edition Chapter 1 Introduction to JavaScript.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
HTML HyperText Markup Language Constantly evolving - extra facilities being added regularly Java applets and JavaScript used to increase functionality.
XML Technologies Surekha Akula
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
F-1 Management Information Systems for the Information Age Copyright 2004 The McGraw-Hill Companies, Inc. All rights reserved Extended Learning Module.
Html Basic Codes Week Two. Start Your Text Editor Windows use 'Notepad’ Macintosh use 'Simple Text'
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
 2008 Pearson Education, Inc. All rights reserved Introduction to XHTML.
Course Content - Chapter 2 Introduction to HTML Introduction to a Text Editor as a web authoring tool Instructional Activity: Creating a webpage using.
A Basic Web Page. Chapter 2 Objectives HTML tags and elements Create a simple Web Page XHTML Line breaks and Paragraph divisions Basic HTML elements.
How do I use HTML and XML to present information?.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
INTRODUCTORY Tutorial 1 Using HTML Tags to Create Web Pages.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
PSYC 200 Week #5 APA Language Guidelines (review and new)
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Web Technologies COMP6115 Session 4: Adding a Database to a Web Site Dr. Paul Walcott Department of Computer Science, Mathematics and Physics University.
What it is and how it works
XML Basics A brief introduction to XML in general 1XML Basics.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
XML stands for Extensible Mark-up Language XML is a mark-up language much like HTML XML was designed to carry data, not to display data XML tags are not.
UoS Libraries 2011 EndNote X5 - basic graduate session.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
HTML Basics. HTML Coding HTML Hypertext markup language The code used to create web pages.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Session 2: Basic HTML HTML Coding Spring 2009 The LIS Web Team Presents.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Academic Computing Services 2007 Microsoft Word 2010 Publishing Long Documents This Guide will teach you how to work with long documents such as dissertations.
NOTEPAD++ Lab 1 1 Riham ALSmari. Why Notepad++ ?  Syntax highlighting  Tabbed document interface  Zooming  Indentation code  Find and replace over.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 홍 정 아홍 정 아.
What is XHTML? XHTML stands for Extensible Hypertext Markup Language
Information Retrieval in Practice
Introduction to XHTML.
INP150: Basic HTML Instructor: Paul J. Millis
Statistical NLP: Lecture 6
XHTML 7-May-19.
XHTML 29-May-19.
Information Retrieval and Web Design
Presentation transcript:

Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten

Overview Computers, corpora, and software Looking at Text - problems with simple processing Dealing with markup

Computers Important considerations are hard disk space and RAM Doesn’t make specific recommendations as computers are changing so quickly As of the writing, a personal computer with extra RAM seems sufficient

Corpora For pay –Marked up text –Consistent (or at least known) text sources For free –Tons of available data on the WWW –Automatic markup is often reasonably good

Software Good text editor (with regexp search) Programming Languages –Mostly done in C/C++ for efficiency –Use other tools (Perl/awk/Python) for initial formatting –Prolog, SNOBOL, SPITBOL, and Icon are also used for their specific strengths with data structures or text processing

Programming Techniques Map words to numbers for processing –Speed comparison and saves space Use a series of programs for counting –This helps reduce memory requirements –The first program emits a token for each counted event –Other programs (perhaps UNIX utilities) sort or count these tokens

Online Resources –A huge collection of linguistic tools –A large index of online linguistics resources –Corpora mailing list archive Text’s web site

Section II: Looking at Text - Problems with Simple Processing Junk formatting/content –Document Headers –Tables, Figures, and Footnotes –OCR problems Upper-case and lower-case –Convert all to upper or lower-case? ‘Richard Brown’ vs ‘brown paint’ –Heuristic: convert sentence starts to lower-case Keep a list of proper names to leave upper-case Other Heuristics?

Tokenization: What is a word? Divide text in to a series of word and sentence boundaries, strip punctuation. Graphic word: “a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks”.

Problems with Graphic Words Online data (web pages or news groups) –Micro$oft and C|net. ; : –Seems simple to strip ; and : –. Is used as a sentence ender and more etc. Calif. ‘Wash. state’ vs ‘wash the dog’

Problems with Single Apostrophes Count them as one word or as two? (I’ll or I will) is dog’s: –The dog is/has –genitive or possessive case of dog –a clitic or phrasal affix Orthographic-word-final single quotations –end of quotation –plural possessive

Problems with Hyphenation Typographical, one word split across lines (doesn’t usually occur in electronic texts) other one word cases –co-operate – –so-called hyphens for word grouping –the once-quite study –the aluminum-export ban

Problems with Hyphenation (cont.) quotation or expression of quantity –take-it-or-leave-it –the 90-cent-an-hour rise Inconsistent usage (even in single sources), Dow Jones has –database –data-base –data base dashes usually are rendered as two hyphens

More Problems The same word but two tokens –The mill uses a saw –I saw the house Treat multiple words as a single token –The New York-New Haven railroad –Phrasal verbs or other fixed phrases work out, make up, because of

Morphology - Stemming collapse sit, sits, and sat to one token Ambiguity: lying Extensive empirical research shows no help for query based IR –Stemming can cost a lot of information operating system vs operating a tractor –English has a very limited morphology –Interactive IR, or more context may allow stemming to be more useful

Sentences 90% of periods are sentence boundaries : ;, and -- may effectively bound sentences Nested sentences: “You remind me,” she remarked, “of your mother.” Ending quote after punctuation

Sentence Boundary Heuristic Place temporary boundaries after. ? ! ; : -- Move them to after following double quotes Disqualify a period boundary when –it is preceded by a know abbreviation that doesn’t normally end a sentence (like Prof.) –it is preceded by a known abbreviation and it isn’t followed by an upper case word Disqualify ? or ! boundary when –it is followed by a closing quote then a lower case letter or known name

Sentence Boundary (cont.) Statistical methods get 98-99% accuracy –Used parts of speech of words just before and after potential boundaries –Is largely language independent Maximum Entropy approach got 99.25% accuracy

Section III: Marked-up Data Can mark just about anything –Sentence or paragraph boundaries –full syntactic structure –Parts of speech (most common) There are many methods –Ad hoc angle brackets, slashes or underlines –SGML (HTML is an example of SGML) –XML (simplified subset of SGML)

Really Quick SGML Introduction Document Type Definition (DTD) –a grammar for the structure of the document One or more recursively nested element –text set off with starting and ending tags paragraph sentence –character and entity references &reference;

Nature of Tag Sets Detailed parts of speech differentiation –e.g., differentiate comparative and superlative forms of adjectives Attributes –Title words –Foreign words Contractions may get multiple (combined) tags Punctuation may get tags

SHOW THE TAGS FROM THE TEXT

Tag Set Design –Types of part of speech classification Semantic (notational) grounds Syntactic distributional grounds morphological grounds –Would like to pick the classification that is the best predictor of the parts of speech of nearby words –This often doesn’t happen Fulton/NP-TL County/NN-TL Purchasing/VBG Department/NN

Finishing touches Errors in these steps will lead to errors in later steps –Removing Junk –Upper-lower case –Finding words –Finding sentences –Automatic or even manual tagging

More finishing touches How serious will these errors be? –Will 99% accuracy in these things greatly hurt later processes? Is the WWW a good corpus? –Lots of small documents by different authors –Lots of ‘junk’ content –Very few copy-editors, spell checkers, grammar checkers