Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.

Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search

2 Version 0.1– “Web is a SET of PAGES.”

3 Version 1.1– “Web is a GRAPH of PAGES.”

4 What have you been searching lately? But,…

5 First Question: Where is U. of Illinois? Or: Is it in California? Let’s ask the Web …

6 Second Question: San Francisco to Chicago? AA.com or ? Let’s ask the Web …

7 Structured Data--- Prevalent but ignored !

8 Version V.2.1: Our View– Web is “Distributed Bases” of “Data Entities”. ???

9 Challenges on the Web come in “dual”: Getting access to the structured information! Access Structure   Deep Web Surface Web  Kevin’s 4-quardants:

10 We are inspired: From search to integration— Mining in the middle! Access Structure Deep WebSurface Web Mining Integration Search

11 Challenge of the Deep Web: MetaQuerier : Holistic Integration over the Deep Web. Access: How to Get There?

12 The previous Web: Search used to be “crawl and index”

13 The current Web: Search must eventually resort to integration

14 How to enable effective access to the deep Web? Cars.com Amazon.com Apartments.com Biography.com 401carfinder.com 411localte.com

15 MetaQuerier : Exploring and integrating the deep Web Explorer source discovery source modeling source indexing Integrator source selection schema integration query mediation FIND sources QUERY sources db of dbs unified query interface Amazon.com Cars.com 411localte.com Apartments.com

16 The challenge – How to deal with “ deep ” semantics across a large scale? “Semantics” is the key in integration! How to understand a query interface?  Where is the first condition? What’s its attribute? How to match query interfaces?  What does “author” on this source match on that? How to translate queries?  How to ask this query on that source?

17 Survey the frontier before going to the battle. Challenge reassured:  450,000 online databases  1,258,000 query interfaces  307,000 deep web sites  3-7 times increase in 4 years Insight revealed:  Web sources are not arbitrarily complex  “Amazon effect” – convergence and regularity naturally emerge

18 Shallow observable clues:  ``underlying'' semantics often relates to the ``observable'' presentations in some way of connection. Holistic hidden regularities:  Such connections often follow some implicit properties, which will reveal holistically across sources Large-scale itself presents opportunity -- Shallow integration across holistic sources Semantics: (to be discovered) Presentations (observed) Reverse Analysis Some Way of Connection Hidden Regularities

19 Some evidences for “holistic integration” Evidence 1: [ SIGMOD04 ] Query Interface Understanding Hidden-syntax parsing Evidence 2: [ SIGMOD03,KDD04,KDD05 ] Matching Query Interfaces Hidden-model discovery attributeoperatorvalue

20 Demo. Knocking the Door to the Deep Web

21 MetaQuerier: Technology transfer in progress Need for domain-based integration pervasive! In Jobs domain:  With Soc. Sec. Admin for “Job-Demands”  A few days crawling collected ~4000 sources In Real-Estate domain:  With Homestore.com for vertical search engine  A few days crawling collected ~15,000 sources15,000 sources

22 And things are indeed happening! Real Estate.

23 Jobs

24 The “MetaQuerier” Model??? Airfares.

25 Challenge of the Surface Web: WISDM : Holistic Search over the Surface Web. Structure: What to look for?

26 What have you been searching lately? What is the email of Marc Snir? What is Marc Snir’s research area? Who are Marc Snir’s coauthors? What are the phones of CS database faculty? How much is “Canon PowerShot A400”? Where is SIGMOD 2006 to be held? When is the due date of SIGMOD 2006? Find PDF files of “SIGMOD 2006”?

27 Regardless of what you want, you are searching for pages… NO!

28 We take an entity view of the Web:

29 What is an “entity”? Your target of information– or, anything. Phone number Email address PDF Image Person name Book title, author, … Price (of something)

30 Example application: Question answering Q: Who are DB profs at UIUC? WIS DM query: #dtf-nnuw100(#entity(professor) #entity(university) #entity(research Database Systems, Data Mining, IR)) results: ranked list of (, ) Query Generation Querying Filtering & Validation A: Geneva Belford, Kevin C. Chang, AnHan Doan, Jiawei Han, Marianne Winslett, ChengXiang Zhai

31 Example application: Relation construction … winslett@cs.uiuc.edu 333-3536 Marianne Winslett dewitt@cs.wisc.edu608-263-5489David DeWitt emailphone prof WIS DM tagging: #entity(prof) query: #tf-nnow50(#entity(professor) #tf-nnuw20(#entity(email) #entity(phone))) results: ranked list of (, ) App-specific Entity Tagging Querying Relation Construction

32 Example application: Best-effort integration Price of “Hamlet”? WIS DM query: #od50(#entity(title Hamlet) #entity(price)) results: ranked list of (, ) Buy.com: $ $10.99, Amazon.com: $12.00 … Query Generation Querying Validation & Ranking

33 How different is “ entity search ”? How to define such searches?

34 Let’s motivate by contrasting… Page RetrievalEntity Search

35 Consider the entire process: Page Retrieval 1. Input : pages. 2. Criteria : content keywords. 3. Scope : Each page itself. 4. Output : one page per result. Marc Snir Marc Snir

36 Is this an entity? In contrast: You just don’t ask this for pages. First, in terms of input:

37 1. input-- Entity is probabilistic: Want to account for imperfect extraction. name? email? location? phone? name? title?

38 How to match an entity? In contrast: You match a page by content keywords. Second, in terms of matching criteria:

39 2. Criteria-- Entity is contextual: Want to match entities by their context keywords.  Q: David DeWitt’s phone number:

40 Seen this entity somewhere else? In contrast: Every page is distinct, by itself. Third, in terms of matching scope:

41 3. Scope-- Entity is holistic: Want to score across all matchings. Q: David DeWitt’s phone: 

42 What is the target of your search? In contrast: One page at a time. Finally, in terms of output:

43 4. Output-- Entity is associative: Want to find association of entities. Q: David DeWitt’s email & phone:

44 Entity search is thus different… Entity Search 1. Input : probabilistic entities. 2. Criteria : contextual patterns. 3. Scope : holistic aggregagtes. 4. Output : associative results.

45 Query language: Entity-search goes beyond keyword queries. #  entity( type )[ restriction ] | keyword> + ) To qualify:  -- Boolean instantiation of instances.  e.g., uw100, ow50, nnow100 To quantify:  -- Fuzzy scoring function.  e.g., pr, tf, dtf, mi Examples:  #tf-nnow50(#entity(professor DeWitt) fax #entity(phone))  #pr-od20(#entity(title Romeo and Juliet) #entity(author))

46 What are technical challenges? Or, how to write (reviewer-friendly) papers?

47 System architecture: How to realize? (a) Page retrieval system. (b) Entity search system. Entity Aggregator Entity Search Pattern Matcher What-Where Inverted Indexer Entity Indexing Pattern Matcher Page Ranker Where Inverted Indexer Keyword Indexing Mining Application query Page Retrieval Entity Extraction/Merging entities Entity Ranker querypages Mining Application

48 Ranking Functions: How to score results? Say, Jiawei Han with #email, #phone, #researcharea Entity matters  Is “jhan@” an email? Is “2-3457” a phone? Context matters:  Order, distance Frequency matters:  How often is Jiawei Han – “data mining”? Associativity matters:  “webmaster@cs.uiuc.edu”  “algorithm” Source matters:  Where did you get this info from?

49 Query Processing: How to optimize?  phone tf  #entity(professor)  prof=“…” “fax”-#entity(phone) nnow50 Q: #tf-nnow50(#entity(professor[David DeWitt]) fax #entity(phone)) (pre-materialized context index)

50 Sample issues– Indexing, Optimization Index configuration:  What “pre-join” into context index?  Tradeoff: space cost vs. time efficiency Query optimization:  Multiple ways to answer-- What plan to use?  Plan generation and cost estimation

51 More issues… Tagging/merging of basic entities?  Application-driven tagging  Web’s redundancy will alleviate accuracy demand. Powerful pattern language  Linguistic; visual Advanced statistical analysis  correlation; sampling Scalable query processing  new components scale?

52 Promises of the Concepts From page at a time to entity-tuple at a time  getting directly to target info and evidences From IR to a mining engine  not only page retrieval but also construction From offline to online Web mining and integration  enable large scale ad-hoc mining over the web From Web to controlled corpus  enhance not only efficiency but also effectiveness From passive to active application-driven indexing  enable mining applications

53 Conclusion: Mining in just the middle!  Dual Challenges:  Getting access to the deep Web.  Getting structure from the surface Web.  Central Techniques:  Holistic mining for both search and integration. Mining Integration Search

54 Implications: Open up mining over the Web. Mining Engines: Mining as primary functions Mining for end users Mining Integration Search

55 What will such a Mining Engine be? Mining Integration Search You tell me! Students’ imagination knows no bounds.

Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.

Similar presentations

Presentation on theme: "Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search.

Similar presentations

Presentation on theme: "Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search."— Presentation transcript:

Similar presentations

About project

Feedback