STEWARD: A Spatio-Textual Document Search Engine for HUDUSER.ORG Prof. Hanan Samet Department of Computer Science, University of Maryland, College Park, MD Joint work of Mike Lieberman and Jagan Sankaranarayanan of UMD and Jon Sperling of HUD PD&R.
STEWARD! Steward is a Document Search Engine, like Google. User specifies a search consisting of a Keyword and a Location Specifier. –“HUD Housing Projects” – Keyword –“El Paso, TX” – Location. It uses a Document Tagger that identifies Geographical locations in English (text, DOC, PDF, HTML ) documents. Alternative: Spatio-Textual Extraction on the Web Aiding Retrieval of Documents
The Document Tagger Uses a huge corpus of geographical locations –2.06 million locations in USA and 1.6 million locations around the world –Gleaned from GNIS Uses data mining and document modeling techniques to disambiguate and correctly identify geographical locations in documents.
Tagging Issues Identify Geographical references in Text –Is “Jefferson” the name of a person or a geographical location? Disambiguation of a geographical reference. –“London” in a document can correspond to “London, UK” or “London, Ontario” or to 2570 other geographical entities in our corpus. Spatial Focus of a document –Is “Singapore” relevant to a news article printed in the Singapore Straits times, about hurricane Katrina?
The STEWARD System Maps provided by Google Maps Search results powered by the SAND database system Available to anyone with an Internet connection and a Web Browser –E.g. Microsoft Internet Explorer, Firefox or Mozilla STEWARD is on the WEB at –
STEWARD as a research tool STEWARD could be used for document retrieval, data exploration and knowledge discovery Potential users –Researchers at HUD –Users of HUDUSER.ORG STEWARD complements the existing search tools at HUDUSER.ORG
Natural Language Cues Research Named-Entity Tagging –Tags text phrases with the type of information they represent, such as “location,” “organization,” or “person” –Improperly trained tagger will produce incorrect entity classifications Part-of-speech Tagging –Tags every word with its part of speech –Locations tend to be tagged as proper nouns –Does not distinguish between locations and peoples’ names Other language-related cues –Addresses and zip codes –City, State combinations
Future Work 1.Hidden Web 2.Incorporation into Other Mapping APIs a)Google Earth b)Microsoft Virtual Earth 3.Full Spatial Query Capabilities à la the SAND Spatial Browser 4.Natural Language Cues 5.Document Meta-language 6.Incorporation of Machine Learning Techniques to Identify Principal Geographic Focus 7.User Interface and Graphics 8.Applications a)Other Federal Agencies b)News Reading c)Common Alerting Protocol (CAP) of USGS for exchanging all-hazard emergency alerts and public warnings
Acknowledgements HUD PD & R Digital Government Program at the NSF University of Maryland
Live Demo