Web Mining for Extracting Relations Negin Nejati
Relation Extraction (James Gleick, Chaos: Making a New Science) (James Gleick, Chaos: Making a New Science) (Charles Dickens, Great Expectations) (William Shakespeare, The Comedy of Errors) (Isaac Asimov, The Robots of Dawn) (David Brin, Startide Rising) (author, title)
DIPRE Algorithm S = SampleTuples While size(S) < T O = FindOccurrences(S) P = GenPatterns(O) S = MatchingTuples(P)
Pattern Generation Existing methods assume components of tuple appear close together (e.g.” Foundation, by Isaac Asimov”) Existing methods assume components of tuple appear close together (e.g.” Foundation, by Isaac Asimov”) This is a very strong assumption. (e.g. misses all the titles in the author’s webpage). This is a very strong assumption. (e.g. misses all the titles in the author’s webpage). Non-popular relations with limited source of data suffer more. (for some relations this is not the typical appearance, e.g. (service, price)) Non-popular relations with limited source of data suffer more. (for some relations this is not the typical appearance, e.g. (service, price))
Using Heuristics We are looking for (author, title) pairs. We are looking for (author, title) pairs. It is very likely that the works of an author are presented as lists or tables. It is very likely that the works of an author are presented as lists or tables. Such tables usually have helpful titles such as: bibliography, selected work, novels, stories, etc. Such tables usually have helpful titles such as: bibliography, selected work, novels, stories, etc.
New Algorithm Charles Dickens Great Expectations occurrences
New Algorithm Group occurrences using edit distance and generate patterns : title (VIKING PENGUIN, 1987) title (VIKING PENGUIN, 1987) & title (1860Â1861) title (1860Â1861) [ title (, ) ]
Pattern Generation (An Alternative) 1.[Charles Dickens James Gleick James Gleick William Shakespeare William Shakespeare ….] ….] 2.“List of authors” New authors Run patterns on result pages New titles
Results DIPRE DIPRE 5 seeds 3 patterns 4047 pairs 5 seeds 3 patterns 4047 pairs The proposed algorithm The proposed algorithm 5 seeds 2 patterns 2596 pairs 5 seeds 2 patterns 2596 pairs
Further Investigations Study the effects of including the titles of the lists and tables in the patterns. Study the effects of including the titles of the lists and tables in the patterns. Study the qualitative differences of these two methods. Study the qualitative differences of these two methods.