CSCE 590 Web Scraping – Information Retrieval Topics Information Retrieval Named Entities Readings: SLP by Jurafsky and Martin Chap 22 March 2, 2017
Information Retrieval Information Retrieval - knowledge Extraction from Natural Language Speech and Language Processing –Chap 22 http://www.cs.colorado.edu/~martin/slp.html Online book: http://nlp.stanford.edu/IR-book/
Information extraction Information extraction – turns unstructured information buried in texts into structured data Extract proper nouns – “named entity recognition” Reference resolution – \ named entity mentions Pronoun references Relation Detection and classification Event detection and classification Temporal analysis Template filling
Template Filling Example template for “airfare raise”
Figure 22.1 List of Named Entity Types
Figure 22.2 Examples of Named Entity Types
Figure 22.3
Figure 22.4 Categorical Ambiguity
Figure 22.5 Chunk Parser for Named Entities
Figure 22.6 Features used in Training NER Gazetteers – lists of place names www.geonames.com www.census.gov
Figure 22.7 Selected Shape Features
Figure 22.8 Feature encoding for NER
Figure 22.9 NER as sequence labeling
Figure 22.10 Statistical Seq. Labeling
Evaluation of Named Entity Rec. Sys. Recall terms from Information retreival Recall = #correctly labeled / total # that should be labeled Precision = # correctly labeled / total # labeled F- measure where β weights preferences β=0 balanced β>1 favors recall β<1 favors precision
Relation Detection and classification Consider Sample text: Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PERSON Tim Wagner] said. [ORG United Airlines] an unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]. After identifying named entities what else can we extract? Relations
Figure 22.11 Example relations
Figure 22.12 Example Extraction
Figure 22.13 Supervised Learning Approaches to Relation Analysis Algorithm two step process Identify whether pair of named entities are related Classifier is trained to label relations
Factors used in Classifying Features of the named entities Named entity types of the two arguments Concatenation of the two entity types Headwords of the arguments Bag-of-words from each of the arguments Words in text Bag-of-words and Bag-of-digrams Stemmed versions Distance between named entities (words / named entities) Syntactic structure Parse related structures
Figure 22.14
Figure 22.15 Sample features Extracted
Figure 22.16 Bootstrapping Relation Extraction
Semantic Drift Note it will be difficult (impossible) to get annotated materials for training Accuracy of process is heavily dependant on initial sees Semantic Drift – Occurs when erroneous patterns(seeds) leads to the introduction of erroneous tuples
Figure 22.17 Temporal and Event Absolute temporal expressions Relative temporal expressions
Figure 22.18
Figure 22.19
Figure 22.20
Figure 22.21
Figure 22.22
Figure 22.23
Figure 22.24
Figure 22.24 continued
Figure 22.25
Figure 22.26
Figure 22.27
Figure 22.28
Figure 22.29
Figure 22.30
Figure 22.31