Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

-- MetaQuerier Mid-flight -- Toward Large-Scale Integration: Building a MetaQuerier over Databases on the Web Kevin C. Chang Joint work with: Bin He, Zhen.
Chapter 5: Introduction to Information Retrieval
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Information Retrieval in Practice
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Chapter 14 The Second Component: The Database.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Databases & Data Warehouses Chapter 3 Database Processing.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)
Querying Structured Text in an XML Database By Xuemei Luo.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Chapter 6: Information Retrieval and Web Search
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
ITGS Databases.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Entity Search Are you searching for what you want? Kevin C. Chang Joint work with: Bin He, Zhen Zhang, Chengkai Li, Govind Kabra, Shui-Lung Chuang, Joe.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
ACIS Introduction to Data Analytics & Business Intelligence Database s Benefits & Components.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Tutorial in SIGMOD’06.
1 Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web Tao Cheng, Kevin Chang University Of Illinois, Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Information Retrieval in Practice
Statistical Schema Matching across Web Query Interfaces
Search Engine Architecture
Data Mining: Concepts and Techniques Course Outline
Chair of Tech Committee, BetterGrids.org
Toward Large Scale Integration
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search

2 Version 0.1– “Web is a SET of PAGES.”

3 Version 1.1– “Web is a GRAPH of PAGES.”

4 What have you been searching lately? But,…

5 First Question: Where is U. of Illinois? Or: Is it in California? Let’s ask the Web …

6 Second Question: San Francisco to Chicago? AA.com or ? Let’s ask the Web …

7 Structured Data--- Prevalent but ignored !

8 Version V.2.1: Our View– Web is “Distributed Bases” of “Data Entities”. ???

9 Challenges on the Web come in “dual”: Getting access to the structured information! Access Structure   Deep Web Surface Web  Kevin’s 4-quardants:

10 We are inspired: From search to integration— Mining in the middle! Access Structure Deep WebSurface Web Mining Integration Search

11 Challenge of the Deep Web: MetaQuerier : Holistic Integration over the Deep Web. Access: How to Get There?

12 The previous Web: Search used to be “crawl and index”

13 The current Web: Search must eventually resort to integration

14 How to enable effective access to the deep Web? Cars.com Amazon.com Apartments.com Biography.com 401carfinder.com 411localte.com

15 MetaQuerier : Exploring and integrating the deep Web Explorer source discovery source modeling source indexing Integrator source selection schema integration query mediation FIND sources QUERY sources db of dbs unified query interface Amazon.com Cars.com 411localte.com Apartments.com

16 The challenge – How to deal with “ deep ” semantics across a large scale? “Semantics” is the key in integration! How to understand a query interface?  Where is the first condition? What’s its attribute? How to match query interfaces?  What does “author” on this source match on that? How to translate queries?  How to ask this query on that source?

17 Survey the frontier before going to the battle. Challenge reassured:  450,000 online databases  1,258,000 query interfaces  307,000 deep web sites  3-7 times increase in 4 years Insight revealed:  Web sources are not arbitrarily complex  “Amazon effect” – convergence and regularity naturally emerge

18 Shallow observable clues:  ``underlying'' semantics often relates to the ``observable'' presentations in some way of connection. Holistic hidden regularities:  Such connections often follow some implicit properties, which will reveal holistically across sources Large-scale itself presents opportunity -- Shallow integration across holistic sources Semantics: (to be discovered) Presentations (observed) Reverse Analysis Some Way of Connection Hidden Regularities

19 Some evidences for “holistic integration” Evidence 1: [ SIGMOD04 ] Query Interface Understanding Hidden-syntax parsing Evidence 2: [ SIGMOD03,KDD04,KDD05 ] Matching Query Interfaces Hidden-model discovery attributeoperatorvalue

20 Demo. Knocking the Door to the Deep Web

21 MetaQuerier: Technology transfer in progress Need for domain-based integration pervasive! In Jobs domain:  With Soc. Sec. Admin for “Job-Demands”  A few days crawling collected ~4000 sources In Real-Estate domain:  With Homestore.com for vertical search engine  A few days crawling collected ~15,000 sources15,000 sources

22 And things are indeed happening! Real Estate.

23 Jobs

24 The “MetaQuerier” Model??? Airfares.

25 Challenge of the Surface Web: WISDM : Holistic Search over the Surface Web. Structure: What to look for?

26 What have you been searching lately? What is the of Marc Snir? What is Marc Snir’s research area? Who are Marc Snir’s coauthors? What are the phones of CS database faculty? How much is “Canon PowerShot A400”? Where is SIGMOD 2006 to be held? When is the due date of SIGMOD 2006? Find PDF files of “SIGMOD 2006”?

27 Regardless of what you want, you are searching for pages… NO!

28 We take an entity view of the Web:

29 What is an “entity”? Your target of information– or, anything. Phone number address PDF Image Person name Book title, author, … Price (of something)

30 Example application: Question answering Q: Who are DB profs at UIUC? WIS DM query: #dtf-nnuw100(#entity(professor) #entity(university) #entity(research Database Systems, Data Mining, IR)) results: ranked list of (, ) Query Generation Querying Filtering & Validation A: Geneva Belford, Kevin C. Chang, AnHan Doan, Jiawei Han, Marianne Winslett, ChengXiang Zhai

31 Example application: Relation construction … Marianne Winslett DeWitt phone prof WIS DM tagging: #entity(prof) query: #tf-nnow50(#entity(professor) #tf-nnuw20(#entity( ) #entity(phone))) results: ranked list of (, ) App-specific Entity Tagging Querying Relation Construction

32 Example application: Best-effort integration Price of “Hamlet”? WIS DM query: #od50(#entity(title Hamlet) #entity(price)) results: ranked list of (, ) Buy.com: $ $10.99, Amazon.com: $12.00 … Query Generation Querying Validation & Ranking

33 How different is “ entity search ”? How to define such searches?

34 Let’s motivate by contrasting… Page RetrievalEntity Search

35 Consider the entire process: Page Retrieval 1. Input : pages. 2. Criteria : content keywords. 3. Scope : Each page itself. 4. Output : one page per result. Marc Snir Marc Snir

36 Is this an entity? In contrast: You just don’t ask this for pages. First, in terms of input:

37 1. input-- Entity is probabilistic: Want to account for imperfect extraction. name? ? location? phone? name? title?

38 How to match an entity? In contrast: You match a page by content keywords. Second, in terms of matching criteria:

39 2. Criteria-- Entity is contextual: Want to match entities by their context keywords.  Q: David DeWitt’s phone number:

40 Seen this entity somewhere else? In contrast: Every page is distinct, by itself. Third, in terms of matching scope:

41 3. Scope-- Entity is holistic: Want to score across all matchings. Q: David DeWitt’s phone: 

42 What is the target of your search? In contrast: One page at a time. Finally, in terms of output:

43 4. Output-- Entity is associative: Want to find association of entities. Q: David DeWitt’s & phone:

44 Entity search is thus different… Entity Search 1. Input : probabilistic entities. 2. Criteria : contextual patterns. 3. Scope : holistic aggregagtes. 4. Output : associative results.

45 Query language: Entity-search goes beyond keyword queries. #  entity( type )[ restriction ] | keyword> + ) To qualify:  -- Boolean instantiation of instances.  e.g., uw100, ow50, nnow100 To quantify:  -- Fuzzy scoring function.  e.g., pr, tf, dtf, mi Examples:  #tf-nnow50(#entity(professor DeWitt) fax #entity(phone))  #pr-od20(#entity(title Romeo and Juliet) #entity(author))

46 What are technical challenges? Or, how to write (reviewer-friendly) papers?

47 System architecture: How to realize? (a) Page retrieval system. (b) Entity search system. Entity Aggregator Entity Search Pattern Matcher What-Where Inverted Indexer Entity Indexing Pattern Matcher Page Ranker Where Inverted Indexer Keyword Indexing Mining Application query Page Retrieval Entity Extraction/Merging entities Entity Ranker querypages Mining Application

48 Ranking Functions: How to score results? Say, Jiawei Han with # , #phone, #researcharea Entity matters  Is an ? Is “2-3457” a phone? Context matters:  Order, distance Frequency matters:  How often is Jiawei Han – “data mining”? Associativity matters:   “algorithm” Source matters:  Where did you get this info from?

49 Query Processing: How to optimize?  phone tf  #entity(professor)  prof=“…” “fax”-#entity(phone) nnow50 Q: #tf-nnow50(#entity(professor[David DeWitt]) fax #entity(phone)) (pre-materialized context index)

50 Sample issues– Indexing, Optimization Index configuration:  What “pre-join” into context index?  Tradeoff: space cost vs. time efficiency Query optimization:  Multiple ways to answer-- What plan to use?  Plan generation and cost estimation

51 More issues… Tagging/merging of basic entities?  Application-driven tagging  Web’s redundancy will alleviate accuracy demand. Powerful pattern language  Linguistic; visual Advanced statistical analysis  correlation; sampling Scalable query processing  new components scale?

52 Promises of the Concepts From page at a time to entity-tuple at a time  getting directly to target info and evidences From IR to a mining engine  not only page retrieval but also construction From offline to online Web mining and integration  enable large scale ad-hoc mining over the web From Web to controlled corpus  enhance not only efficiency but also effectiveness From passive to active application-driven indexing  enable mining applications

53 Conclusion: Mining in just the middle!  Dual Challenges:  Getting access to the deep Web.  Getting structure from the surface Web.  Central Techniques:  Holistic mining for both search and integration. Mining Integration Search

54 Implications: Open up mining over the Web. Mining Engines: Mining as primary functions Mining for end users Mining Integration Search

55 What will such a Mining Engine be? Mining Integration Search You tell me! Students’ imagination knows no bounds.