CSCE 590 Web Scraping – Information Retrieval

CSCE 590 Web Scraping – Information Retrieval
Topics Information Retrieval Named Entities Readings: SLP by Jurafsky and Martin Chap 22 March 2, 2017

Information Retrieval
Information Retrieval - knowledge Extraction from Natural Language Speech and Language Processing –Chap 22 Online book:

Information extraction
Information extraction – turns unstructured information buried in texts into structured data Extract proper nouns – “named entity recognition” Reference resolution – \ named entity mentions Pronoun references Relation Detection and classification Event detection and classification Temporal analysis Template filling

Template Filling Example template for “airfare raise”

Figure 22.1 List of Named Entity Types

Figure 22.2 Examples of Named Entity Types

Figure 22.3

Figure 22.4 Categorical Ambiguity

Figure 22.5 Chunk Parser for Named Entities

Figure 22.6 Features used in Training NER
Gazetteers – lists of place names

Figure 22.7 Selected Shape Features

Figure 22.8 Feature encoding for NER

Figure 22.9 NER as sequence labeling

Figure 22.10 Statistical Seq. Labeling

Evaluation of Named Entity Rec. Sys.
Recall terms from Information retreival Recall = #correctly labeled / total # that should be labeled Precision = # correctly labeled / total # labeled F- measure where β weights preferences β=0 balanced β>1 favors recall β<1 favors precision

Relation Detection and classification
Consider Sample text: Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PERSON Tim Wagner] said. [ORG United Airlines] an unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]. After identifying named entities what else can we extract? Relations

Figure 22.11 Example relations

Figure 22.12 Example Extraction

Figure 22.13 Supervised Learning Approaches to Relation Analysis
Algorithm two step process Identify whether pair of named entities are related Classifier is trained to label relations

Factors used in Classifying
Features of the named entities Named entity types of the two arguments Concatenation of the two entity types Headwords of the arguments Bag-of-words from each of the arguments Words in text Bag-of-words and Bag-of-digrams Stemmed versions Distance between named entities (words / named entities) Syntactic structure Parse related structures

Figure 22.14

Figure 22.15 Sample features Extracted

Figure 22.16 Bootstrapping Relation Extraction

Semantic Drift Note it will be difficult (impossible) to get annotated materials for training Accuracy of process is heavily dependant on initial sees Semantic Drift – Occurs when erroneous patterns(seeds) leads to the introduction of erroneous tuples

Figure 22.17 Temporal and Event
Absolute temporal expressions Relative temporal expressions

Figure 22.18

Figure 22.19

Figure 22.20

Figure 22.21

Figure 22.22

Figure 22.23

Figure 22.24

Figure continued

Figure 22.25

Figure 22.26

Figure 22.27

Figure 22.28

Figure 22.29

Figure 22.30

Figure 22.31

CSCE 590 Web Scraping – Information Retrieval

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping – Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 590 Web Scraping – Information Retrieval

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping – Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback