Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.

Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom fhogenboom@ese.eur.nl Flavius Frasincar frasincar@ese.eur.nl Damir Vandic vandic@ese.eur.nl Jeroen van der Meer jeroenvdmeer@gmail.com Ferry Boon ferry.boon@gmail.com Uzay Kaymak kaymak@ese.eur.nl Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.

Introduction (1) Semantically annotating Web pages enhances machine interpretation Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Introduction (2) Automating annotation for static and 3 rd party Web sites is deemed necessary Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (1) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Four main stages: –Hotspot identification –Subjectivity analysis –Information extraction –Page annotation Web pages are converted to DOM trees in order to enable easy processing

Framework (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) RDFa

Framework (3): Hotspots Reviews are characterized by large blocks of text: hotspots Headers, navigation elements, footers, etc., do not contain these blocks Text blocks have few HTML elements For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):, with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (4): Hotspots Illustrative example: The h1 element contains 64/73 × 100% ≈ 88% text However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Intel Core i7-975 Extreme And i7-950 Processors Reviewed Page 1 of 15

Framework (5): Subjectivity Hotspots are verified as reviews whenever they are subjective enough We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): –Original: check if document has ≥ k sentences that contain ≥ n subjectivity words each –Modification: check if document has ≥ m percent of all sentences that contain ≥ n subjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (6): IE Various information is extracted: –Authors: Named entities are detected in the vicinity of hotspots Named Entity Recognizer (NER) –Dates: Many different date formats are easily parsed Regular expressions –Products: Name often found in title and h1 elements Overlapping words –Ratings: Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) We focus on numerical ratings Regular expressions on plain text or alt text of images 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) ([0-9.,]+)\s?/\s?([0-9.,]+) MM dd yyyy 4/5

Framework (7): Annotation Key elements are tagged using Google Rich Snippets A new annotated Web page is returned 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) <div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> Tango Hotel Taichung Sarah Lee 4 stars 18th December 2008 Boutique like hotel without the boutique price

Implementation (1) We have implemented the ARROW framework as a Web application: –Java-based –Apache Tomcat server Input: –URL –Preferred output: Visualizer Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Evaluation Test set: 100 review, 100 non-review Web pages Sub-second performance Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) Main problems related to detecting authors, likely caused by the use of nicknames Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Conclusions We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets Framework not bound to vocabulary Proof-of-concept implementation shows promising results Future work: –Improve heuristics –Add intelligent (semantically enabled) text parsers –Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Questions http://www.arrow-project.com/ 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.

Similar presentations

Presentation on theme: "Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.

Similar presentations

Presentation on theme: "Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom."— Presentation transcript:

Similar presentations

About project

Feedback