Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Similar presentations


Presentation on theme: "Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections."— Presentation transcript:

1 Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections

2 Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. … Donald Knuth works in research … is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc. Focus: Open relationship extraction vs. targeted extraction Context Entity

3 Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. … Donald Knuth works in research … is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going from unstructured data to structured data Applications in search, business intelligence, etc. Focus: targeted Open relationship extraction vs. targeted extraction Large document collections (> 10 7 Documents) Context Entity

4 Using Aggregate Context Single-context Extraction: ([Entity], is-a-researcher) Multi-context Extraction: …[Entity] works in research… …[Entity] published… …[Entity]s paper… …[Entity] gave a talk… ([Entity], is-a-researcher) Multi-Feature Relation Extractor Extraction logic: [E] works … research We track an entity across contexts, allowing us to combine less predictive features. [Entity], paper [Entity], talk [Entity], published Aggregate Context Features

5 Using Co-occurrence Features Leverage co-occurrence of entity classes (e.g. directors likely co- occur with actors) for extraction. Example: Extraction of is-a-director relation: Alan Alba Richard Gere Julia Roberts … Actor-List … Julia Roberts starred in a Robert Altman film in 1994 … Co-occurrence features can be between Entities of different classes. Entities of one class. Combination with text-features possible: e.g., [Entity] plays for [Team_Name]. Robert_Altman, co-occurs with actor name … Aggregate Context Features Two Questions: (a) What difference do the aggregate contexts make for extraction accuracy? (b) This means keeping track of contexts across documents - can we make this efficient? Two Questions: (a) What difference do the aggregate contexts make for extraction accuracy? (b) This means keeping track of contexts across documents - can we make this efficient?

6 Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: Open Relation extraction vs. targeted extraction Context Entity

7 Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: targeted Open Relation extraction vs. targeted extraction Unary vs. n-nary relations Context Entity

8 Relationship Extraction from Text Task: Extraction of structured relations from text. …Donald Knuth works in research… is-a-researcher(Donald_Knuth) …Yao Ming plays for the Houston Rockets… works-for(Yao Ming, Houston Rockets) Applications: Bridging chasm from unstructured to structured data Advanced search Business intelligence Focus: targeted Open Relation extraction vs. targeted extraction Unary Unary vs. n-nary relations Context Entity

9 Aggregate Features

10 Using Aggregate Context Each context is not in itself sufficient to infer the category researcher Single-context Extraction: ([Entity], is-a-researcher) We can track an entity across pages, allowing us to combine less predictive features. Multi-context Extraction: …[Entity] works in research… …[Entity] published… …[Entity]s paper… …[Entity] gave a talk at… ([Entity], is-a-researcher) Multi-Feature Classifier Extraction Rule: [E] works in research

11 Using Co-occurrence In targeted extraction, we can leverage co-occurrences of entities. Example: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … Actor-List … Julia Roberts starred in Pretty Woman in 1988 … Multi-Feature Classifier Feature: Co-occurrence between entity and actor name in context. Co-occurrence features can be between Entities of different classes. Entities of one class (e.g., actors) Combination with text-features: e.g., [E] plays for [Team_Name].

12 Processing large Document Collections

13 Architecture Problem setting: | D | > available memory |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ …

14 Architecture Problem setting: - | D | > available memory |M| - Co-occurrence lists |L| > |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ … Entity-Feature Pairs Classification Feature Extraction List corpus L Aggregate Feature Extraction

15 Single-Context Extraction Agg. Feature Extraction Architecture Context Feature Extraction Document Corpus D Entity-Relation Pairs Aggregation COUNT(entity, relation) > Δ Entity-Feature Pairs Classification Co-Occurrence List corpus L Co-Occurrence Detection Co-Occurrence Detection Co-Occurrence Detection Co-Occurrence Detection Duplicated overhead from - Document scanning - Document processing - Entity Extraction. New Architecture

16 Challenges: 1. Fast & accurate co- occurrence detection using the synopsis. 2. Pruning of redundant output. Context Feature Extraction New Architecture Document Corpus D Aggregation Rule-based Extraction Classification Agg. Feature Extraction Synopsis of L Delete false Positives Co-Occurrence List corpus L Aggregation List-Member Extraction Co-Occurrence Detection Entity – Candidate Context Pairs Entity-List Pairs Entity-Feature Pairs Fast identification of candidate matches through 2-stage filtering. Use of Bloom-Filters to trade off memory footprint with false positive rate. Fast identification of candidate matches through 2-stage filtering. Use of Bloom-Filters to trade off memory footprint with false positive rate. Frequency-distribution of entities very skewed. Pruning based on retaining most frequent entities and list members in memory. Challenge: Determining frequencies online. => Compact hash-synopses of frequencies (CM-Sketch) perform well. Frequency-distribution of entities very skewed. Pruning based on retaining most frequent entities and list members in memory. Challenge: Determining frequencies online. => Compact hash-synopses of frequencies (CM-Sketch) perform well. Potentially very large output: Duplication via very many co-occurrences, e.g. actor- actor. Potentially very large output: Duplication, e.g. Entity: George Bush Feature: President

17 Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

18 Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

19 Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

20 Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Test only: {starred, pretty, woman, pretty woman}. Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. … Julia Roberts starred in Pretty Woman in 1988 …

21 Fast Document Processing Replace each co-occurrence list L i with approximate representation Filter i. Fast detection of candidate contexts via separate token filter to detect hit sequences. Example: Test only: {starred, pretty, woman, pretty woman}. Straight-forward implementation requires testing all subsets of length up longest member in L i. Token filters identify all sub-sequences containing individual tokens in L, reducing overhead. Realization of each Filter i as well as token filters based on Bloom Filters, allowing trading of false-positive rate against memory footprint.

22 Pruning redundant output

23

24 Pruning Candidate Contexts Retain most frequent list-members in memory. List-Membership is known and frequency can be estimated off-line. Pruning Entity-List pairs Retain most frequent entities and list IDs they co-occur with in memory. Challenge: dynamically determining entity frequencies. Distribution of entity-frequencies and list-member frequencies very skewed. => Compact hash-synopsis of frequencies (CM-Sketch). Pruning of entity-feature pairs Similar to online caching problems, with very small size of items. Very large space of entities x features => few repetitions. Simple algorithm – Write-On-Full - performs within 10% of best caching approaches.

25 Architecture Problem setting: - | D | > available memory |M| - Co-occurrence lists |L| > |M| Document Corpus D Entity-Relation Pairs Aggregation Rule-based Extraction COUNT(entity, relation) > Δ … Entity-Feature Pairs Classification Feature Extraction List corpus L Aggregate Feature Extraction List-Member Detection List-Member Detection List-Member Detection List-Member Detection

26 Architecture Synopsis of L Steam Documents D Verification List corpus L Edges(G E,F ) Aggregation Classifiers C List-Member Extraction Feature Extraction List-Member Detection Edges(G E,C ) Edges(G E,L )

27 Experiments

28 Experimental Evaluation Task: Categorization of entities into professions (actor, writer, painter, etc.) Document-Corpus: 3.2 Million Wikipedia pages Training data generated using Wikipedia lists of famous painters, writers, etc… Aggregate-Context Classifier: linear SVM using text n- gram & co-occurrence features (binary) Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD06]). Co-occurrence list: contains 10% of entity strings in training data.

29 Experimental Evaluation: Accuracy

30 Reducing the correlation required for extraction rules trades off recall and precision.

31 Experimental Evaluation: Accuracy

32

33

34 Experimental Evaluation: Overhead Skew in Co-occurrence-List member frequency: efficient pruning As we scale up |D|, pruning efficiency increases.

35 Experimental Evaluation: Overhead Main remaining overhead: writing of entity-features pairs. Simple caching strategy reduces this overhead by an order of magnitude.

36

37 Conclusions Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers. The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.

38 Questions?


Download ppt "Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections."

Similar presentations


Ads by Google