Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.

Similar presentations


Presentation on theme: "Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono."— Presentation transcript:

1 Pete Bohman Adam Kunk

2  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono

3  Current search engines do not provide a complete picture  Latest events dominate top results  The user is forced to parse through lots of pages to find a complete list of information  ChronoSearch aims to summarize search results into a concise list of important events related to an entity

4  Input: An entity, E, and a set of web pages, W, related to E  Output: A sorted list of events, L, which are related to E

5  Output: L = { l i | l i occurred before l i+1 } ▪ l i is a sentence describing an event ▪ l i describes a unique event ▪ l i contains a link to the source web page w belonging to W  L is Precise ▪ Each l i describes an event the user is interested in  L is Comprehensive ▪ L contains a description of all the events a user is interested in

6 Extract textual elements from web pages Beautiful soup to extract elements and remove html tags NLTK sentence tokenization Sentence sanitization Avg. word length [3.2, 7.2] chars/word Extract sentences containing entity and date Regular expressions used for date extraction Order events by date Remove sentences reporting the same event Cosine similarity Verb similarity

7  Focus on strongest signal  Absolute entity and date in the same sentence

8  Guiding Principal – Increase Precision  Duplicate event descriptions include event descriptions using a similar set of verbs and paraphrased sentences.  Methodology  Verb similarity ▪ Remove sentences containing similar sets of verbs that occur around the same date  Cosine similarity ▪ Remove paraphrased sentences

9  Remove sentences with similarity >.5  Sentence1: “I have to go to school”  Sentence2: “I have to go to lecture”  [I, have, to, go, school, lecture] ▪ V1 = [1, 1, 2, 1, 1, 0] ▪ V2 = [1, 1, 2, 1, 0, 1]  Similarity = v1. V2 / ||V1||| * ||V2||  Similarity =.857

10  Guiding Principal - Increase Precision The web caters to user interest. The more popular an event description is on the web, the more important the event is, and therefore more likely it is to be in a users expected results.  Methodology  Increase precision by removing unpopular event descriptions as determined by search results.  Lesson  Insufficient correlation between search results and event importance

11

12  Demo time…

13  Information Retrieval (IR) performance characteristics: Precision – fraction of documents retrieved that are relevant to query Recall – fraction of documents that are relevant to query that are successfully retrieved

14  Evaluated each timeline against truth set  Compared ChronoSearch results to others  Analyzed results for: Bill Gates, Steve Jobs, Jim Tressel  Merged existing manual timelines to form truth set Truth Set

15

16  Total Sentences: total number of sentences considered for output  Sentences Removed: total number of sentences removed (3 different mechanisms combined)  Precision Improvement: Percent of non-precise results removed.  Average precision improvement: 29.33% EntityTotal ResultsResults RemovedPrecision Improvement Bill Gates1874823% Steve Jobs2068835% Jim Tressel832930%

17  Bad Sentences: sentences that did not meet the average word lengths  Cosine Similar Events: sentences that had a cosine similarity > 0.5 (by term vectors)  Verb Similar Results: sentences that occurred on the same day and had a verb similarity > 0.5 EntityBad Sentences Removed Cosine Similar Results Removed Verb Similar Results Removed Bill Gates7383 Steve Jobs4786 Jim Tressel0281

18  False Positives for removal techniques:  Average false positive rate: 14.13% EntityBad Sentences Removed (False Positives) Cosine Similar Results Removed (False Positives) Verb Similar Results Removed (False Positives) % False Positives For Total Events Removed Bill Gates4/70/381/35/48 = 10.42% Steve Jobs2/413/781/616/88 = 18.18% Jim Tressel0/03/281/14/29 = 13.79%

19  Duplicate Events Not Detected:  Average Effectiveness of Duplicate Detection: 84.65% EntityDuplicates We Failed To Remove % Of Total Duplicates We Missed Bill Gates1010/51 = 19.61% Steve Jobs2121/105 = 20% Jim Tressel22/31 = 6.45%

20  Improve recall by extracting weaker signals  Attempt to handle relative dates and/or pronouns ▪ Could resolve all relative dates in the document to be absolute based on the last seen absolute date ▪ Resolve pronouns to nearest entity ▪ Example: Steve Jobs was named the greatest CEO in 2011. One month ago, he passed away.

21  Improve precision by associating events to verbs  Attempt to find events by looking for verbs ▪ Assumption: An event should contain a verb and an entity ▪ If there is no verb, then there is no event ▪ Example: “Farewell Steve Jobs” 06 Oct 2011.

22  Thank you, we hope you enjoy ChronoSearch!


Download ppt "Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono."

Similar presentations


Ads by Google