Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Presented by Zeehasham Rasheed
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Populating the Semantic Web by Macro-Reading Internet Text T.M Mitchell, J. Betteridge, A. Carlson, E. Hruschka, R. Wang Presented by: Will Darby.
Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Mining and Summarizing Customer Reviews
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Semantic Learning Instructor: Professor Cercone Razieh Niazi.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
A Language Independent Method for Question Classification COLING 2004.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Semantic Technologies & GATE NSWI Jan Dědek.
Presenter: Shanshan Lu 03/04/2010
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Natural Language Processing Vasile Rus
Presented by: Hassan Sayyadi
Machine Learning in Natural Language Processing
Statistical NLP: Lecture 9
Web Mining Research: A Survey
Extracting Information from Diverse and Noisy Scanned Document Images
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers

Outline Introduction –Mining Indicator Terms –Integrating Rainbow –Ontological Analysis of Web Directories –IE and Ontology Learning Future Work Related Work Assessment

Introduction Goal: “…to extract information about (mostly generic) products, services and areas of competence of companies, from the free text chunks embedded in web presentations.” Taking advantage of: –Collections of extraction patterns –Ontologies of problem domains Approach: Combine Information Extraction With Ontologies –Ontologies can improve quality of IE –Extracted information can improve/extend ontologies –Bootstrapping

Introduction Uses Open Directory ( –Obtain labeled training data –Lightweight ontologies “The Open Directory Project is the largest, most comprehensive human-edited directory of the Web.”

Mining Indicator Terms Informative terms = generic names of products Indicator terms = situated near informative terms –Example: ‘our assortment includes…’ ‘in our shop you can buy…’ ‘in our shop you can buy…’ Assumption: Directory headings coincide with informatives Purpose: Generate extraction patterns based on Indicator terms They use deeper linguistic techniques

Mining Indicator Terms Example:…/Manufacturing/Materials/Metals/Steel/… Informative terms Match headings with text pages to find sentences containing informative terms Grab nearby words as indicator terms Generate extraction patterns from indicator terms

Mining Indicator Terms Choosing Indicator Terms –Syntactical analysis: Link Grammar Parser –Chose verbs occurring closest in parse tree to informative word –Arrange verbs into a frequency table –Order by ratio of frequency near informative term to frequency in general –Chose 8 most promising verbs

Mining Indicator Terms Preliminary Testing –Sampled 14,500 sentences containing heading terms –Randomly chose 130 sentences with indicators –Manually labeled to estimate if informative term was present or not Example: “We are equipped to run any grade of corrugated from E-flute to Triplewall, including all government grades.” “We are equipped to run any grade of corrugated from E-flute to Triplewall, including all government grades.”

Mining Indicator Terms Preliminary Test Results CoverageNon-Filtered 10 – 20 % Pre-Filtered 70 – 80 %

Integration into Rainbow RAINBOW (Reusable Architecture for INtelligent Brokering Of Web information access) (Reusable Architecture for INtelligent Brokering Of Web information access) –Web Analysis Tasks: Sentence Extraction Explicit Metadata HTML Structure* Inline Image * Link Topology Structure* Page Similarity –Internal Communication: based on SOAP –Will use ontologies for verifying semantic consistency of web services provided within the distributed system

Integration into Rainbow Rainbow will help solve “coverage” problem of directory links pointing to ‘barren’ pages –Using Analysis of: Keywords and HTML Structure on start-up pages URLs of embedded links –Metadata Extractor will be navigated towards promising pages. –Looking for ‘about-us’ or ‘profile’ to find more syntactically correct text, for example.

Ontological Analysis of Web Directories Terms and Phrases in single heading belong to a small set of classes Parent-child relations belong to particular classes corresponding to ‘deep’ ontological relations. -Industries - Construction_and_Maintenance - Materials_and_supplies - Masonry_and_Stone - Natural_Stone - International_Sources - Mexico

Ontological Analysis of Web Directories Meta-ontology of directory headings Class Named Relations Class- subclass Relations Reflexive Binary Relations

Ontological Analysis of Web Directories Interpretation Rules

IE and Ontology Learning Extracting with plain indicator terms with simple heuristics works But Even Better: –Learn indicators for each class –Use ontology analysis to classify indicators found –Fill in database templates: true IE

IE and Ontology Learning Classify Headings Learn class-specific indicators Human Classifies Directory Headings (WordNet) Closed Loop Strategy:

Future Work Complete the Information extraction & ontology learning loop. With relation to Semantic Web, they want to adapt technique to the standards of usual explicit metadata –Example: The information extracted can be forged to RDF triples, with indicator collections accessible over the web

Related Work Combining IE and Ontologies (without use of web directories) –Bootstrapping an Ontology-Based Information Extraction Systems Advantages of using Link Grammar Parser –Learning to Generate Semantic Annotation for Domain Specific Sentences Using Yahoo to classify whole documents –Turning Yahoo into an Automatic Web-Page Classifier Similar work aimed at more structured information using search engines –Extracting Patterns and Relations form the World Wide Web Bootstrapping and other statistical methods for IE –Text Classification by Bootstrapping with Keywords –Learning Dictionaries of Information Extraction by Multi-Level Bootstrapping

Assessment I don’t think indicator term learning is done (even though they say it is) Counts on not yet decided Ontology learning techniques Need to develop an official directory