Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy Timedex.org Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy
Abstract Extract events from Wikipedia Index the events based on date of occurrence Display the events in a useful webapp
Step 1: Importing Wikipedia into MySQL Large dataset 8GB page links table ~5GB page table Difficulties: Altering tables (adding columns or indexes) Lessons learned: Be careful with alter operations on large tables Use Postgres
Step 2: Extracting Sentences and Page Hierarchy Used Lingpipe API to find sentences Parsed Wikipedia tags to create heading/sentence tree of each page Difficulties: Many Wikipedia sentences are terminated by newlines Periods in abbreviations can be confusing Lessons learned: 3rd party packages never do exactly what you need
Step 3: Detecting Dates Used a set of regular expressions to check for dates Difficulties: Deciding what date formats to accept such that date-like constructs that are not dates are minimized Lessons learned: Regular expressions are easy to control and tune, so use them if possible
Step 4: Event Summaries Given a sentence containing a date, find a short phrase describing it We thought this would be done best by training an HMM or CRF to extract the events. Switched to much simpler system of using headings Difficulties: Most sentences do not contain good, complete information! E.g. “He had two children by her in 1948 and 1951.” Lessons learned: Try basic methods first, then experiment with more elaborate schemes English is hard; pronouns suck
Step 5: Ranking Events We have more events than anyone would ever want Run PageRank at page level using link table Assume highly linked-to pages contain better events Count heading levels Assume events at deep subheading levels are less important Empirically, first sentence of page often has most useful event Difficulties PageRank can take until the end of time (177 million links) Lessons learned In rare cases, efficiency does matter! Buy more memory
Step 6: Writing the Webapp Makes an asynchronous JS request Results returned in JSON Uses Lucene to query for keywords Difficulties: Creating a distribution of ranks for “hide/show” experience to be interesting Creating a good looking site Lessons learned: Use JS libraries whenever possible
Technologies Used Mallet Machine learning API Lingpipe Linguistic analysis API Hibernate Java persistence layer Spring Java MVC framework MySQL + Tomcat + Apache Scriptaculous JS library Lucene Java Indexing API
Who Did What Brandon Alex Robert Sean Sentence and event extraction, date extraction, Lucene semantics Alex MySQL import, PageRank, webapp, data access layer Robert Sentence and event extraction, webapp, data access layer Sean Sentence and event extraction, hierarchy parsing
Demo