Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom Flavius Frasincar Damir Vandic Jeroen van der Meer Ferry Boon Uzay Kaymak Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp ). ACM.
Introduction (1) Semantically annotating Web pages enhances machine interpretation Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Introduction (2) Automating annotation for static and 3 rd party Web sites is deemed necessary Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (1) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Four main stages: –Hotspot identification –Subjectivity analysis –Information extraction –Page annotation Web pages are converted to DOM trees in order to enable easy processing
Framework (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) RDFa
Framework (3): Hotspots Reviews are characterized by large blocks of text: hotspots Headers, navigation elements, footers, etc., do not contain these blocks Text blocks have few HTML elements For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):, with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (4): Hotspots Illustrative example: The h1 element contains 64/73 × 100% ≈ 88% text However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Intel Core i7-975 Extreme And i7-950 Processors Reviewed Page 1 of 15
Framework (5): Subjectivity Hotspots are verified as reviews whenever they are subjective enough We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): –Original: check if document has ≥ k sentences that contain ≥ n subjectivity words each –Modification: check if document has ≥ m percent of all sentences that contain ≥ n subjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (6): IE Various information is extracted: –Authors: Named entities are detected in the vicinity of hotspots Named Entity Recognizer (NER) –Dates: Many different date formats are easily parsed Regular expressions –Products: Name often found in title and h1 elements Overlapping words –Ratings: Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) We focus on numerical ratings Regular expressions on plain text or alt text of images 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) ([0-9.,]+)\s?/\s?([0-9.,]+) MM dd yyyy 4/5
Framework (7): Annotation Key elements are tagged using Google Rich Snippets A new annotated Web page is returned 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) <div xmlns:v=" typeof="v:Review"> Tango Hotel Taichung Sarah Lee 4 stars 18th December 2008 Boutique like hotel without the boutique price
Implementation (1) We have implemented the ARROW framework as a Web application: –Java-based –Apache Tomcat server Input: –URL –Preferred output: Visualizer Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Evaluation Test set: 100 review, 100 non-review Web pages Sub-second performance Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) Main problems related to detecting authors, likely caused by the use of nicknames Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Conclusions We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets Framework not bound to vocabulary Proof-of-concept implementation shows promising results Future work: –Improve heuristics –Add intelligent (semantically enabled) text parsers –Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Questions 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)