Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 590 Web Scraping – Information Retrieval

Similar presentations


Presentation on theme: "CSCE 590 Web Scraping – Information Retrieval"— Presentation transcript:

1 CSCE 590 Web Scraping – Information Retrieval
Topics Information Retrieval Named Entities Readings: SLP by Jurafsky and Martin Chap 22 March 2, 2017

2 Information Retrieval
Information Retrieval - knowledge Extraction from Natural Language Speech and Language Processing –Chap 22 Online book:

3 Information extraction
Information extraction – turns unstructured information buried in texts into structured data Extract proper nouns – “named entity recognition” Reference resolution – \ named entity mentions Pronoun references Relation Detection and classification Event detection and classification Temporal analysis Template filling

4 Template Filling Example template for “airfare raise”

5 Figure 22.1 List of Named Entity Types

6 Figure 22.2 Examples of Named Entity Types

7 Figure 22.3

8 Figure 22.4 Categorical Ambiguity

9 Figure 22.5 Chunk Parser for Named Entities

10 Figure 22.6 Features used in Training NER
Gazetteers – lists of place names

11 Figure 22.7 Selected Shape Features

12 Figure 22.8 Feature encoding for NER

13 Figure 22.9 NER as sequence labeling

14 Figure 22.10 Statistical Seq. Labeling

15 Evaluation of Named Entity Rec. Sys.
Recall terms from Information retreival Recall = #correctly labeled / total # that should be labeled Precision = # correctly labeled / total # labeled F- measure where β weights preferences β=0 balanced β>1 favors recall β<1 favors precision

16 Relation Detection and classification
Consider Sample text: Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PERSON Tim Wagner] said. [ORG United Airlines] an unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]. After identifying named entities what else can we extract? Relations

17 Figure 22.11 Example relations

18 Figure 22.12 Example Extraction

19 Figure 22.13 Supervised Learning Approaches to Relation Analysis
Algorithm two step process Identify whether pair of named entities are related Classifier is trained to label relations

20 Factors used in Classifying
Features of the named entities Named entity types of the two arguments Concatenation of the two entity types Headwords of the arguments Bag-of-words from each of the arguments Words in text Bag-of-words and Bag-of-digrams Stemmed versions Distance between named entities (words / named entities) Syntactic structure Parse related structures

21 Figure 22.14

22 Figure 22.15 Sample features Extracted

23 Figure 22.16 Bootstrapping Relation Extraction

24 Semantic Drift Note it will be difficult (impossible) to get annotated materials for training Accuracy of process is heavily dependant on initial sees Semantic Drift – Occurs when erroneous patterns(seeds) leads to the introduction of erroneous tuples

25 Figure 22.17 Temporal and Event
Absolute temporal expressions Relative temporal expressions

26 Figure 22.18

27 Figure 22.19

28 Figure 22.20

29 Figure 22.21

30 Figure 22.22

31 Figure 22.23

32 Figure 22.24

33 Figure continued

34 Figure 22.25

35

36 Figure 22.26

37 Figure 22.27

38 Figure 22.28

39 Figure 22.29

40 Figure 22.30

41 Figure 22.31


Download ppt "CSCE 590 Web Scraping – Information Retrieval"

Similar presentations


Ads by Google