Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility.

Similar presentations


Presentation on theme: "Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility."— Presentation transcript:

1 Wikipedia Knowledge Extraction

2  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility

3  “His mother wanted him to get a good education so she sent him to live with his grandparents in Honolulu, HI” (Barack Obama)

4  Current solution: replace pronouns with article title (very primitive)  Target solution: ◦ Nobody in the world has solved this yet ◦ Use an existing system that is usually correct? ◦ Simple rules for common patterns?

5  Convert information into simple sentences: ◦ Joe Biden is Barack Obama’s Vice President ◦ Barack Obama is preceded by George W. Bush  Use type of phrase (Noun Phrase, Verb Phrase) to determine sentence to form.  Read papers from Turing Center (University of Washington)

6  Performs a deep analysis on each sentence.  E.g. “Yoshi has a long tongue which he uses to grab enemies and eat them.” ◦ has (A0: Yoshi, A1: long tongue) ◦ use (A0: Yoshi, A1: long tongue, A2: grab enemies and eat them)  Use SRL parsing to improve quality and representation of knowledge.  Problem: speed and complexity

7  Current system has Subject, Object, Verb tuples  Problem: hard to define what words to incorporate in each phrase  E.g. “'The dog ( Canis lupus familiaris )' 'is' 'a mammal from the family Canidae‘” ◦ The dog? dog? The dog ( Canis lupus familiaris )? ◦ a mammal? a mammal from the family Canidae?  Possible solutions: ◦ Different levels of information? ◦ Simple rules based on part of speech tags?

8  Idea: Determine whether two separate mentions point to the same concept ◦ ‘The dog’, ‘a dog’, ‘dogs’ ◦ ‘Cats’, ‘C.A.T.S’, ‘CAT Scan’ ◦ ‘President Obama’, ‘President Barack Obama’  Possible solutions: ◦ Feature-based classification ◦ Self organizing map ◦ Terms associated

9  Need to ensure scaling is possible for move to regular Wikipedia  Hadoop is an open source implementation of the Map-Reduce algorithm  Map-Reduce is an algorithm that parallelizes a process by splitting its iterations over several machines

10


Download ppt "Wikipedia Knowledge Extraction.  Pronoun Resolution module  Infobox extraction  SRL parsing  Improved refinement  Clustering  Hadoop compatibility."

Similar presentations


Ads by Google