Download presentation
Presentation is loading. Please wait.
Published byMae Myrtle Gardner Modified over 8 years ago
1
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten
2
Overview Computers, corpora, and software Looking at Text - problems with simple processing Dealing with markup
3
Computers Important considerations are hard disk space and RAM Doesn’t make specific recommendations as computers are changing so quickly As of the writing, a personal computer with extra RAM seems sufficient
4
Corpora For pay –Marked up text –Consistent (or at least known) text sources For free –Tons of available data on the WWW –Automatic markup is often reasonably good
5
Software Good text editor (with regexp search) Programming Languages –Mostly done in C/C++ for efficiency –Use other tools (Perl/awk/Python) for initial formatting –Prolog, SNOBOL, SPITBOL, and Icon are also used for their specific strengths with data structures or text processing
6
Programming Techniques Map words to numbers for processing –Speed comparison and saves space Use a series of programs for counting –This helps reduce memory requirements –The first program emits a token for each counted event –Other programs (perhaps UNIX utilities) sort or count these tokens
7
Online Resources http://www.dfki.de/lt/registry/ –A huge collection of linguistic tools http://www.sil.org/linguistics/ –A large index of online linguistics resources http://nora.hd.uib.no/corpora/archive.html –Corpora mailing list archive Text’s web site
8
Section II: Looking at Text - Problems with Simple Processing Junk formatting/content –Document Headers –Tables, Figures, and Footnotes –OCR problems Upper-case and lower-case –Convert all to upper or lower-case? ‘Richard Brown’ vs ‘brown paint’ –Heuristic: convert sentence starts to lower-case Keep a list of proper names to leave upper-case Other Heuristics?
9
Tokenization: What is a word? Divide text in to a series of word and sentence boundaries, strip punctuation. Graphic word: “a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks”.
10
Problems with Graphic Words Online data (web pages or news groups) –Micro$oft and C|net. ; : –Seems simple to strip ; and : –. Is used as a sentence ender and more etc. Calif. ‘Wash. state’ vs ‘wash the dog’
11
Problems with Single Apostrophes Count them as one word or as two? (I’ll or I will) is dog’s: –The dog is/has –genitive or possessive case of dog –a clitic or phrasal affix Orthographic-word-final single quotations –end of quotation –plural possessive
12
Problems with Hyphenation Typographical, one word split across lines (doesn’t usually occur in electronic texts) other one word cases –co-operate –e-mail –so-called hyphens for word grouping –the once-quite study –the aluminum-export ban
13
Problems with Hyphenation (cont.) quotation or expression of quantity –take-it-or-leave-it –the 90-cent-an-hour rise Inconsistent usage (even in single sources), Dow Jones has –database –data-base –data base dashes usually are rendered as two hyphens
14
More Problems The same word but two tokens –The mill uses a saw –I saw the house Treat multiple words as a single token –The New York-New Haven railroad –Phrasal verbs or other fixed phrases work out, make up, because of
15
Morphology - Stemming collapse sit, sits, and sat to one token Ambiguity: lying Extensive empirical research shows no help for query based IR –Stemming can cost a lot of information operating system vs operating a tractor –English has a very limited morphology –Interactive IR, or more context may allow stemming to be more useful
16
Sentences 90% of periods are sentence boundaries : ;, and -- may effectively bound sentences Nested sentences: “You remind me,” she remarked, “of your mother.” Ending quote after punctuation
17
Sentence Boundary Heuristic Place temporary boundaries after. ? ! ; : -- Move them to after following double quotes Disqualify a period boundary when –it is preceded by a know abbreviation that doesn’t normally end a sentence (like Prof.) –it is preceded by a known abbreviation and it isn’t followed by an upper case word Disqualify ? or ! boundary when –it is followed by a closing quote then a lower case letter or known name
18
Sentence Boundary (cont.) Statistical methods get 98-99% accuracy –Used parts of speech of words just before and after potential boundaries –Is largely language independent Maximum Entropy approach got 99.25% accuracy
19
Section III: Marked-up Data Can mark just about anything –Sentence or paragraph boundaries –full syntactic structure –Parts of speech (most common) There are many methods –Ad hoc angle brackets, slashes or underlines –SGML (HTML is an example of SGML) –XML (simplified subset of SGML)
20
Really Quick SGML Introduction Document Type Definition (DTD) –a grammar for the structure of the document One or more recursively nested element –text set off with starting and ending tags paragraph sentence –character and entity references &reference;
21
Nature of Tag Sets Detailed parts of speech differentiation –e.g., differentiate comparative and superlative forms of adjectives Attributes –Title words –Foreign words Contractions may get multiple (combined) tags Punctuation may get tags
22
SHOW THE TAGS FROM THE TEXT
23
Tag Set Design –Types of part of speech classification Semantic (notational) grounds Syntactic distributional grounds morphological grounds –Would like to pick the classification that is the best predictor of the parts of speech of nearby words –This often doesn’t happen Fulton/NP-TL County/NN-TL Purchasing/VBG Department/NN
24
Finishing touches Errors in these steps will lead to errors in later steps –Removing Junk –Upper-lower case –Finding words –Finding sentences –Automatic or even manual tagging
25
More finishing touches How serious will these errors be? –Will 99% accuracy in these things greatly hurt later processes? Is the WWW a good corpus? –Lots of small documents by different authors –Lots of ‘junk’ content –Very few copy-editors, spell checkers, grammar checkers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.