Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Chapter 5: Introduction to Information Retrieval
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
H YPERLINKING DIGITAL LIBRARIES ON THE WEB Juan Camilo Zapata ITEC – 810 Supervisor Robert Dale 1.
Aki Hecht Seminar in Databases (236826) January 2009
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Chapter 10: Information Integration and Synthesis.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Mining and Summarizing Customer Reviews
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Entity Discovery and Assignment for Opinion Mining Applications (ACM KDD 09’) Xiaowen Ding, Bing Liu, Lei Zhang Date: 09/01/09 Speaker: Hsu, Yu-Wen Advisor:
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Search Engine Optimization & Pay Per Click Advertising
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Retrieval in Practice
Queensland University of Technology
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval (in Practice)
Web Mining Ref:
Summarizing Entities: A Survey Report
Thanks to Bill Arms, Marti Hearst
Information Retrieval
Dept. of Computer Science University of Liverpool
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
CS246: Information Retrieval
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Lexical relationship mining A lexical relationship is a relationship between words, such as synonym, antonym, hypernym (“poodle” “dog”) A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot). Thus, subtopic mining is in this category, but definition mining is not.

Information Extraction MUC Information Extraction: the extraction or pulling out of pertinent information from large volumes of texts Items of Information Percentile Reliability Entities90 Attributes80definition falls here Facts70 Events60 Attribute: a property of an entity such as its name, alias, descriptor, or type

Mining Topic-Specific Concepts and Definitions on the Web Goal : Systematically learn an unfamiliar topic from Web Definitions Topic hierarchy Input : a term “data mining”, “Web mining” Tasks –Identify sub-topics or salient concepts Like building ontology, but no clear hierarchy E.g.: Genetic Algorithm Algorithms –Find and organize definition pages Definition question answering –Concept disambiguation

Techniques A lot of heuristics –Simple linguistic patterns {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … … –Web page tags,…, … Frequent pattern mining –A classic data mining technique

Algorithm WebLearn(T) Submit T to a search engine, get relevant pages Mines subtopics or salient concepts of T Finds definition pages Output the concepts and definition pages to users. If a user wants to know more about subtopics T’ do WebLearn(T’)

Mining subtopic/salient concept(1) Input: a set of top-ranked relevant document Steps: 1. Filter out “noisy” documents Publication listing pages “in proceeding”, “journal” Forum discussion pages “previous message”, “reply to” Pages that do not contain all query terms

Mining subtopic/salient concept(2) 2. Identify important phrases in each page Extract text segments in HTML emphasizing tags,…, … Except those containing: Salutation title (Mr. Dr. Professor) URL or address “conference”, “journal” … Digits ( KDD2004) Images Too many words (15 words as limit)

Mining subtopic/salient concept(3) 3. Mine frequent phrases Input: emphasized text segments Mine frequent word sets using associate rule mining technique 4. Eliminate word sets unlikely to be subtopics Heuristic: those that do not appear alone in emphasizing tags in any page “process” Remove generic words from result set “abstract”, “introduction”, “conclusion”, “research”,… 5. Rank result sets According to number of pages they occur

Definition Finding Definition identification patterns suitable for Web pages {concept} {-|:} {definition} {concept} {refer(s) to | satisfy(ies)} … HTML structuring clues and hyperlinks If only one header,,… or one big emphasized segment at the beginning => definition page Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept

Subtopic disambiguation By adding context terms – usually parent topic or subtopics context terms tend to dominate results cannot work for the first (root) topic Heuristics to combat domination of context terms – only consider text segments containing the topic or subtopic – identify pages with topic hierarchy HTML list tag The hierarchy should also contain other subtopics of the parent topic – shallow linguistic phenomena Topic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including” + subtopics ) Then, how does this help disambiguate?

Evaluation Use Google to get the initial set of relevant pages Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant More salient concepts than subtopics Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves Result 3 : disambiguation Seem to be useful

Analysis Interesting topic Potentially to be used in practice A complete system Techniques –Avoid NLP, Machine Learning –Apply heuristics of shallow text structures

Limitations Research topics, not much ambiguity Techniques: –Heuristics are empirical, by no means being flawless or exhaustive, and hard to applied to other domains

How to improve? -- discussion Better research: – do you think it is a good research topic? Better techniques: – what techniques would you like to try to solve the problme?

Thank you!