Presentation is loading. Please wait.

Presentation is loading. Please wait.

Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy

Similar presentations


Presentation on theme: "Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy"— Presentation transcript:

1 Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy
Timedex.org Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy

2 Abstract Extract events from Wikipedia
Index the events based on date of occurrence Display the events in a useful webapp

3 Step 1: Importing Wikipedia into MySQL
Large dataset 8GB page links table ~5GB page table Difficulties: Altering tables (adding columns or indexes) Lessons learned: Be careful with alter operations on large tables Use Postgres

4 Step 2: Extracting Sentences and Page Hierarchy
Used Lingpipe API to find sentences Parsed Wikipedia tags to create heading/sentence tree of each page Difficulties: Many Wikipedia sentences are terminated by newlines Periods in abbreviations can be confusing Lessons learned: 3rd party packages never do exactly what you need

5 Step 3: Detecting Dates Used a set of regular expressions to check for dates Difficulties: Deciding what date formats to accept such that date-like constructs that are not dates are minimized Lessons learned: Regular expressions are easy to control and tune, so use them if possible

6 Step 4: Event Summaries Given a sentence containing a date, find a short phrase describing it We thought this would be done best by training an HMM or CRF to extract the events. Switched to much simpler system of using headings Difficulties: Most sentences do not contain good, complete information! E.g. “He had two children by her in 1948 and 1951.” Lessons learned: Try basic methods first, then experiment with more elaborate schemes English is hard; pronouns suck

7 Step 5: Ranking Events We have more events than anyone would ever want
Run PageRank at page level using link table Assume highly linked-to pages contain better events Count heading levels Assume events at deep subheading levels are less important Empirically, first sentence of page often has most useful event Difficulties PageRank can take until the end of time (177 million links) Lessons learned In rare cases, efficiency does matter! Buy more memory

8 Step 6: Writing the Webapp
Makes an asynchronous JS request Results returned in JSON Uses Lucene to query for keywords Difficulties: Creating a distribution of ranks for “hide/show” experience to be interesting Creating a good looking site Lessons learned: Use JS libraries whenever possible

9 Technologies Used Mallet Machine learning API Lingpipe
Linguistic analysis API Hibernate Java persistence layer Spring Java MVC framework MySQL + Tomcat + Apache Scriptaculous JS library Lucene Java Indexing API

10 Who Did What Brandon Alex Robert Sean
Sentence and event extraction, date extraction, Lucene semantics Alex MySQL import, PageRank, webapp, data access layer Robert Sentence and event extraction, webapp, data access layer Sean Sentence and event extraction, hierarchy parsing

11 Demo


Download ppt "Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy"

Similar presentations


Ads by Google