Download presentation
Presentation is loading. Please wait.
1
Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with : the UIUC and Cazoodle Teams Mining Integration Search
2
2 Version 0.1– “Web is a SET of PAGES.”
3
3 Version 1.1– “Web is a GRAPH of PAGES.”
4
4 What have you been searching lately? But,…
5
5 First Question: Where is U. of Illinois? Or: Is it in California? Let’s ask the Web …
6
6 Second Question: San Francisco to Chicago? AA.com or ? Let’s ask the Web …
7
7 Structured Data--- Prevalent but ignored !
8
8 Version V.2.1: Our View– Web is “Distributed Bases” of “Data Entities”. ???
9
9 Challenges on the Web come in “dual”: Getting access to the structured information! Access Structure Deep Web Surface Web Kevin’s 4-quardants:
10
10 We are inspired: From search to integration— Mining in the middle! Access Structure Deep WebSurface Web Mining Integration Search
11
11 Challenge of the Deep Web: MetaQuerier : Holistic Integration over the Deep Web. Access: How to Get There?
12
12 The previous Web: Search used to be “crawl and index”
13
13 The current Web: Search must eventually resort to integration
14
14 How to enable effective access to the deep Web? Cars.com Amazon.com Apartments.com Biography.com 401carfinder.com 411localte.com
15
15 MetaQuerier : Exploring and integrating the deep Web Explorer source discovery source modeling source indexing Integrator source selection schema integration query mediation FIND sources QUERY sources db of dbs unified query interface Amazon.com Cars.com 411localte.com Apartments.com
16
16 The challenge – How to deal with “ deep ” semantics across a large scale? “Semantics” is the key in integration! How to understand a query interface? Where is the first condition? What’s its attribute? How to match query interfaces? What does “author” on this source match on that? How to translate queries? How to ask this query on that source?
17
17 Survey the frontier before going to the battle. Challenge reassured: 450,000 online databases 1,258,000 query interfaces 307,000 deep web sites 3-7 times increase in 4 years Insight revealed: Web sources are not arbitrarily complex “Amazon effect” – convergence and regularity naturally emerge
18
18 Shallow observable clues: ``underlying'' semantics often relates to the ``observable'' presentations in some way of connection. Holistic hidden regularities: Such connections often follow some implicit properties, which will reveal holistically across sources Large-scale itself presents opportunity -- Shallow integration across holistic sources Semantics: (to be discovered) Presentations (observed) Reverse Analysis Some Way of Connection Hidden Regularities
19
19 Some evidences for “holistic integration” Evidence 1: [ SIGMOD04 ] Query Interface Understanding Hidden-syntax parsing Evidence 2: [ SIGMOD03,KDD04,KDD05 ] Matching Query Interfaces Hidden-model discovery attributeoperatorvalue
20
20 Demo. Knocking the Door to the Deep Web
21
21 MetaQuerier: Technology transfer in progress Need for domain-based integration pervasive! In Jobs domain: With Soc. Sec. Admin for “Job-Demands” A few days crawling collected ~4000 sources In Real-Estate domain: With Homestore.com for vertical search engine A few days crawling collected ~15,000 sources15,000 sources
22
22 And things are indeed happening! Real Estate.
23
23 Jobs
24
24 The “MetaQuerier” Model??? Airfares.
25
25 Challenge of the Surface Web: WISDM : Holistic Search over the Surface Web. Structure: What to look for?
26
26 What have you been searching lately? What is the email of Marc Snir? What is Marc Snir’s research area? Who are Marc Snir’s coauthors? What are the phones of CS database faculty? How much is “Canon PowerShot A400”? Where is SIGMOD 2006 to be held? When is the due date of SIGMOD 2006? Find PDF files of “SIGMOD 2006”?
27
27 Regardless of what you want, you are searching for pages… NO!
28
28 We take an entity view of the Web:
29
29 What is an “entity”? Your target of information– or, anything. Phone number Email address PDF Image Person name Book title, author, … Price (of something)
30
30 Example application: Question answering Q: Who are DB profs at UIUC? WIS DM query: #dtf-nnuw100(#entity(professor) #entity(university) #entity(research Database Systems, Data Mining, IR)) results: ranked list of (, ) Query Generation Querying Filtering & Validation A: Geneva Belford, Kevin C. Chang, AnHan Doan, Jiawei Han, Marianne Winslett, ChengXiang Zhai
31
31 Example application: Relation construction … winslett@cs.uiuc.edu 333-3536 Marianne Winslett dewitt@cs.wisc.edu608-263-5489David DeWitt emailphone prof WIS DM tagging: #entity(prof) query: #tf-nnow50(#entity(professor) #tf-nnuw20(#entity(email) #entity(phone))) results: ranked list of (, ) App-specific Entity Tagging Querying Relation Construction
32
32 Example application: Best-effort integration Price of “Hamlet”? WIS DM query: #od50(#entity(title Hamlet) #entity(price)) results: ranked list of (, ) Buy.com: $ $10.99, Amazon.com: $12.00 … Query Generation Querying Validation & Ranking
33
33 How different is “ entity search ”? How to define such searches?
34
34 Let’s motivate by contrasting… Page RetrievalEntity Search
35
35 Consider the entire process: Page Retrieval 1. Input : pages. 2. Criteria : content keywords. 3. Scope : Each page itself. 4. Output : one page per result. Marc Snir Marc Snir
36
36 Is this an entity? In contrast: You just don’t ask this for pages. First, in terms of input:
37
37 1. input-- Entity is probabilistic: Want to account for imperfect extraction. name? email? location? phone? name? title?
38
38 How to match an entity? In contrast: You match a page by content keywords. Second, in terms of matching criteria:
39
39 2. Criteria-- Entity is contextual: Want to match entities by their context keywords. Q: David DeWitt’s phone number:
40
40 Seen this entity somewhere else? In contrast: Every page is distinct, by itself. Third, in terms of matching scope:
41
41 3. Scope-- Entity is holistic: Want to score across all matchings. Q: David DeWitt’s phone:
42
42 What is the target of your search? In contrast: One page at a time. Finally, in terms of output:
43
43 4. Output-- Entity is associative: Want to find association of entities. Q: David DeWitt’s email & phone:
44
44 Entity search is thus different… Entity Search 1. Input : probabilistic entities. 2. Criteria : contextual patterns. 3. Scope : holistic aggregagtes. 4. Output : associative results.
45
45 Query language: Entity-search goes beyond keyword queries. # entity( type )[ restriction ] | keyword> + ) To qualify: -- Boolean instantiation of instances. e.g., uw100, ow50, nnow100 To quantify: -- Fuzzy scoring function. e.g., pr, tf, dtf, mi Examples: #tf-nnow50(#entity(professor DeWitt) fax #entity(phone)) #pr-od20(#entity(title Romeo and Juliet) #entity(author))
46
46 What are technical challenges? Or, how to write (reviewer-friendly) papers?
47
47 System architecture: How to realize? (a) Page retrieval system. (b) Entity search system. Entity Aggregator Entity Search Pattern Matcher What-Where Inverted Indexer Entity Indexing Pattern Matcher Page Ranker Where Inverted Indexer Keyword Indexing Mining Application query Page Retrieval Entity Extraction/Merging entities Entity Ranker querypages Mining Application
48
48 Ranking Functions: How to score results? Say, Jiawei Han with #email, #phone, #researcharea Entity matters Is “jhan@” an email? Is “2-3457” a phone? Context matters: Order, distance Frequency matters: How often is Jiawei Han – “data mining”? Associativity matters: “webmaster@cs.uiuc.edu” “algorithm” Source matters: Where did you get this info from?
49
49 Query Processing: How to optimize? phone tf #entity(professor) prof=“…” “fax”-#entity(phone) nnow50 Q: #tf-nnow50(#entity(professor[David DeWitt]) fax #entity(phone)) (pre-materialized context index)
50
50 Sample issues– Indexing, Optimization Index configuration: What “pre-join” into context index? Tradeoff: space cost vs. time efficiency Query optimization: Multiple ways to answer-- What plan to use? Plan generation and cost estimation
51
51 More issues… Tagging/merging of basic entities? Application-driven tagging Web’s redundancy will alleviate accuracy demand. Powerful pattern language Linguistic; visual Advanced statistical analysis correlation; sampling Scalable query processing new components scale?
52
52 Promises of the Concepts From page at a time to entity-tuple at a time getting directly to target info and evidences From IR to a mining engine not only page retrieval but also construction From offline to online Web mining and integration enable large scale ad-hoc mining over the web From Web to controlled corpus enhance not only efficiency but also effectiveness From passive to active application-driven indexing enable mining applications
53
53 Conclusion: Mining in just the middle! Dual Challenges: Getting access to the deep Web. Getting structure from the surface Web. Central Techniques: Holistic mining for both search and integration. Mining Integration Search
54
54 Implications: Open up mining over the Web. Mining Engines: Mining as primary functions Mining for end users Mining Integration Search
55
55 What will such a Mining Engine be? Mining Integration Search You tell me! Students’ imagination knows no bounds.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.