Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.

Slides:



Advertisements
Similar presentations
A Comparison Study for Novelty Control Mechanisms Applied to Web News Stories 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2012)
Advertisements

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.
Polarity Analysis of Texts using Discourse Structure CIKM 2011 Bas Heerschop Erasmus University Rotterdam Frank Goossen Erasmus.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
Exploiting Discourse Structure for Sentiment Analysis of Text OR 2013 Alexander Hogenboom In collaboration with Flavius Frasincar, Uzay Kaymak, and Franciska.
Identity Management Based on P3P Authors: Oliver Berthold and Marit Kohntopp P3P = Platform for Privacy Preferences Project.
Connecting Customer Relationship Management Systems to Social Networks 7th International Conference on Knowledge Management, Services, and Cloud Computing.
Determining Negation Scope and Strength in Sentiment Analysis SMC 2011 Paul van Iterson Erasmus School of Economics Erasmus University Rotterdam
Exploiting Emoticons in Sentiment Analysis SAC 2013 Daniella Bal Erasmus University Rotterdam Flavius Frasincar Erasmus University.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Yunhua Hu 1, Guomao Xin 2, Ruihua Song, Guoping Hu 3, Shuming.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Towards Domain-Independent Information Extraction from Web Tables Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
March 17, 2008SAC WT Hermes: a Semantic Web-Based News Decision Support System* Flavius Frasincar Erasmus University Rotterdam.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
An Overview of Event Extraction from Text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE'11) October 23,
News Personalization using the CF-IDF Semantic Recommender International Conference on Web Intelligence, Mining, and Semantics (WIMS 2011) May 25, 2011.
A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009) 1 Nov. 30, 2009 Frederik Hogenboom
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
A News-Based Approach for Computing Historical Value-at-Risk International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) Frederik Hogenboom.
Building the User Interface by Using HTML5: Organization, Input, and Validation Lesson 3.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
RADAR “How To…” Guide DEPOSITING RESEARCH OUTPUTS in RADAR Covered: -Accessing RADAR -Logging in -Depositing outputs -Managing outputs -Uploading documents.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Ontology Updating Driven by Events Dutch-Belgian Database Day 2012 (DBDBD 2012) November 21, 2012 Frederik Hogenboom Jordy Sangers.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
*Erasmus University Rotterdam P.O. Box 1738, NL-3000 DR Rotterdam, the Netherlands † Teezir BV Wilhelminapark 46, NL-3581 NL, Utrecht, the Netherlands.
Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013) June 13, 2013 Marnix Moerland.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Towards Cross-Language Sentiment Analysis through Universal Star Ratings KMO 2012 Malissa Bal Erasmus University Rotterdam Flavius.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Kevin Meijer, Flavius Frasincar, Frederik Hogenboom 2014.DSS. A semantic approach.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Semantics-Based News Recommendation International Conference on Web Intelligence, Mining, and Semantics (WIMS 2012) June 14, 2012 Michel Capelle
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Search Engine Architecture
Based on Menu Information
Institute of Informatics & Telecommunications
Erasmus University Rotterdam
Bing-SF-IDF+: A Hybrid Semantics-Driven News Recommender
Embedding Knowledge in HTML
Social Knowledge Mining
RichAnnotator: Annotating rich (XML-like) documents
HTML 5 SEMANTIC ELEMENTS.
Embedding Knowledge in HTML
Presentation transcript:

Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom Flavius Frasincar Damir Vandic Jeroen van der Meer Ferry Boon Uzay Kaymak Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp ). ACM.

Introduction (1) Semantically annotating Web pages enhances machine interpretation Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages The vocabulary enables interesting applications 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Introduction (2) Automating annotation for static and 3 rd party Web sites is deemed necessary Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (1) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Four main stages: –Hotspot identification –Subjectivity analysis –Information extraction –Page annotation Web pages are converted to DOM trees in order to enable easy processing

Framework (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) RDFa

Framework (3): Hotspots Reviews are characterized by large blocks of text: hotspots Headers, navigation elements, footers, etc., do not contain these blocks Text blocks have few HTML elements For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):, with = # textual characters, and = total # characters in DOM 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (4): Hotspots Illustrative example: The h1 element contains 64/73 × 100% ≈ 88% text However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) Intel Core i7-975 Extreme And i7-950 Processors Reviewed Page 1 of 15

Framework (5): Subjectivity Hotspots are verified as reviews whenever they are subjective enough We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009): –Original: check if document has ≥ k sentences that contain ≥ n subjectivity words each –Modification: check if document has ≥ m percent of all sentences that contain ≥ n subjectivity words each 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Framework (6): IE Various information is extracted: –Authors: Named entities are detected in the vicinity of hotspots Named Entity Recognizer (NER) –Dates: Many different date formats are easily parsed Regular expressions –Products: Name often found in title and h1 elements Overlapping words –Ratings: Many formats, e.g., images (90%), which can be numerical (80%), descriptors (15%), or letters (5%) We focus on numerical ratings Regular expressions on plain text or alt text of images 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) (\w)\s(\d{1,2})(th|,)?\s(\d{2,4}) ([0-9.,]+)\s?/\s?([0-9.,]+) MM dd yyyy 4/5

Framework (7): Annotation Key elements are tagged using Google Rich Snippets A new annotated Web page is returned 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) <div xmlns:v=" typeof="v:Review"> Tango Hotel Taichung Sarah Lee 4 stars 18th December 2008 Boutique like hotel without the boutique price

Implementation (1) We have implemented the ARROW framework as a Web application: –Java-based –Apache Tomcat server Input: –URL –Preferred output: Visualizer Annotated document 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Implementation (2) 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Evaluation Test set: 100 review, 100 non-review Web pages Sub-second performance Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%) Main problems related to detecting authors, likely caused by the use of nicknames Dependency on Web site structures 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Conclusions We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets Framework not bound to vocabulary Proof-of-concept implementation shows promising results Future work: –Improve heuristics –Add intelligent (semantically enabled) text parsers –Extend to other domains, e.g., recipes, videos, etc. 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

Questions 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)