1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

Slides:

Advertisements

Similar presentations

Downloading Textual Hidden-Web Content Through Keyword Queries

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Fast Algorithms For Hierarchical Range Histogram Constructions

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar.

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.

Chapter 4 Randomized Blocks, Latin Squares, and Related Designs

The “Deep Web” ISC 110 Final Project Kaila Ryan - 12/12/2013.

Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download University of Utah.

Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.

Deep Web Under the guidance of Prof. Pushpak Bhattacharyya Presented by - Jayanta Das (11305R012) Souvik Pal ( ) Subhro Bhattacharyya ( )

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Evaluating Search Engine

ISP 433/533 Week 2 IR Models.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Modeling Modern Information Retrieval

Topic 2: Statistical Concepts and Market Returns

Evaluating Hypotheses

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,

Information Retrieval

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Deep-Web Crawling “Enlightening the dark side of the web”

Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

A Web Crawler Design for Data Mining

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Search Engines By: Faruq Hasan.

Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

1 CS 430: Information Discovery Lecture 5 Ranking.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

General Architecture of Retrieval Systems 1Adrienn Skrop.

CS 440 Database Management Systems Web Data Management 1.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Estimating standard error using bootstrap

Search Engines and Search techniques

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Data Mining Chapter 6 Search Engines

Retrieval Performance Evaluation - Measures

Presentation transcript:

1 Deep web 1/26/2016 Jianguo Lu

2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated by a search interface. The search interface can be –HTML form –Web service Content in general is stored in a database Usually not indexed by a search engine –That is the reason that sometimes people define surface web as the web accessible by a search engine Deep web

3 Deep web vs. surface web Bergman, Michael K. (August 2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing 7 (1).The Deep Web: Surfacing Hidden Value

4 How large is deep web Deep web

5 Deep and surface web may overlap Some content hidden behind an HTML form or web service can also be available in normal html pages Some search engines try to index part of the deep web –Google is also crawling deep web –Madhavan, Jayant; David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy (2008). Google’s Deep-Web Crawl. VLDBGoogle’s Deep-Web Crawl –Only a very small portion of deep web is indexed Deep web

6 Why is there a deep web Not everything can be in surface web, for many reasons… Some pages are generated on the fly –There are pages that are generated by a specific request, e.g., –books in a library, –historical weather data, –newspaper archives, –all the accounts/members in flickr/tweeter/facebook…web sites –There would be too many items if they are represented as web pages –It is easier to save them in a data base instead of providing it as static web pages –Some pages are the result of integration from various databases Content is not restricted to text or html. Can be image, pdf, software, music, books, etc. E.g., –all the paintings in a museum. –Books in a library Maybe password protected But still, we wish the content is searchable… Deep web

7 Deep web crawling Crawl and index the deep web so that hidden data can be surfaced Unlike the surface web, there are no hyperlinks to follow Two tasks –Find deep web data sources, i.e., html forms, web services –Accessing the deep web: A survey, B He, M Patel, Z Zhang, KCC Chang - Communications of the ACM, 2007Accessing the deep web: A survey –Given a data source, download the data from this data source We focus on the second task Deep web

8 Crawling a deep web data source The only interface is an html form or a web service – if the data is hidden by HTML form –Fill the forms –Select and send appropriate queries –Alexandros, Ntoulas; Petros Zerfos, and Junghoo Cho (2005). Downloading Hidden Web Content. UCLA Computer Science. Downloading Hidden Web ContentUCLA –Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a New Set Covering Algorithm. ADMA 2009: –Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An Approach to Deep Web Crawling by Sampling. Web Intelligence 2008: –Extract relevant data from return HTML page –If the data is hidden by a web service –Select and send appropriate queries –Form filling and data extraction are exempted It also attracts public interests –Wright, Alex ( ). "Exploring a 'Deep Web' That Google Can’t Grasp". New York Times.Exploring a 'Deep Web' That Google Can’t GraspNew York Times Deep web

9 Deep web crawling is not a trivial task it is not easy to obtain all the data Query quota Return limit More importantly, high overlapping c:\research\dictionaries\newsgroup_dict_10000_random.txt filetype:xsd n= Query matches total distinctnew maven disarm sudanese profession compete … windsor bosch cliff pursuit konstantin The overlapping rate is 56996/16204=3.5 When percentage is 16204/212000=0.07 Deep web

10 The problem Minimize the cost while dislodging most of the data –Some people try to minimize the number of queries while we minimize the number of total documents Minimize the OR (Overlapping Rate) while reaching a high Hit Rate (HR) –S(qj, DB) : set of results of the query qj on database DB. Deep web

11 Random queries What is the cost if random queries are sent? The answer depends on the assumptions of a data source In the diagram, model in the lower layer is more difficult to crawl Random queries ModelAll the matched documents returned? Each document has equal probability of being matched? example M0yesYes MrNoYes MhYesNo MrhNono

12 Notations N: the actual number of documents in a data source; t: the number of queries that are sent to a data source; mj : the number of matched documents for query j. 1 <j <t. –n =  mj is the sample size, i.e., the total number of matched documents; uj : the number of new documents retrieved by query j. 1 <=j <=t. Mi =  j<i uj is the total number of unique documents that are retrieved before query i. –Note that M1 = 0, and M2 = m1. Let M = M t+1 denote the total number of distinct documents that are retrieved by all the queries in the estimation process; di: the number of duplicate documents retrieved by query i. di+ui = mi; k: the maximal number of returns from a ranked data source, even if there are mj > k number of matches. OR = n/M: the Overlapping Rate up to the t-th query, i.e., the ratio between the sample size and the distinct documents; P = M/N: the percentage of the documents that has been sampled, i.e., the ratio between the distinct documents and the actual size. Random queries

13 Example of crawling process querymiriuidiMiniORP q q q q Suppose N=600, limit=30 Random queries

14 Model M 0 Random queries Assumptions –All the matched documents are returned –Each document has equal probability of being matched Result –Jianguo Lu, Dingding Li, Estimating Deep Web Data Source Size by Capture- Recapture Method, Information Retrieval. Springer. Estimating Deep Web Data Source Size by Capture- Recapture Method

15 Model M 0 The more accurate formula for the relationship between P and OR is Conclusion: In model M0, it is not difficult to crawl a data source at all In most cases OR will be higher than what is calculated by the above formula –Because M0 is the simplest Random queries POR

16 Model M 0 vs M h The blue line is drawn using equation P=1-OR^(-2.1) Several real data show different trend Why? Random queries

17 Model M h Assumptions: –Each document has unequal probability of being matched by a query –All matched documents are returned h means heterogeneity in catch probability –Originally developed in ecology, to estimate the population of wild animals –Process: capture a group of animals, mark and release them; capture another group of animals, mark and release them again; … … Mh was first proposed in capture-recapture method Random queries Capture frequency of news groups documents by queries (A) is the scatter plot when documents are selected by queries. In total 13,600 documents are retrieved. (B) is the first 100 captures in Figure (A). (C) is the histogram of (A). (D) is the log-log plot of (C).

18 Model Mh The empirical result is Obtained by linear regression POR Random queries

19 File size distributions Random queries

20 Measuring heterogeneity Coefficient of Variation (CV) Assume that the documents in the data source have different but fixed probabilities of being captured, i.e., p = {p1, p2, …,pn},  Pj=1. Sampling based approach Scatter plots for various CVs. 200 random numbers within the range of 1 and 20,000 are generated in Pareto distribution.

21 Measuring heterogeneity Relationship between CV (γ) and α Random queries CV  α  P

22 Model M 0r Assumptions –Only top k documents are returned –Each document has equal probability being matched –Documents have static ranking Random queries

23 Model M 0r When k and m are fixed for every query Not a practical assumption Random queries

24 Model M hr Assumptions –Only top k documents are returned –documents have unequal probability being matched –Documents have static ranking When k and m are fixed, we have Random queries

25 Evolution of the models Comparison of models M0, Mh, M0r, and Mhr documents are sorted according to their file size in decreasing order. 600 documents are selected in the four models, including the duplicates. k = 10;m = 20. Subplot M0 shows that all the documents are retrieved uniformly. Subplot Mh shows that large documents are preferred, but most of the documents can be eventually sampled. Subplot M0r exhibits a clear cut around the 500th document. Beyond this line there are almost no documents retrieved. Mhr is the compound of M0r and Mh. Random queries

26 Selecting queries We have learnt the cost when random queries are issued. Can we select the queries to reduce the cost? What models can we apply to? –Mh or Mhr?

27 Select queries Incremental approach –Method 1.send a query to download matched documents; 2.While (not most of the documents downloaded) Analyze the downloaded documents to select the next most appropriate query; send the query to download documents; –Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho, Downloading Textual Hidden Web Content through Keyword Queries. JCDL, –Disadvantages –Need to download many (almost all) documents –Crawler may only need to know the URL, not the entire documents Sampling based approach –Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An Approach to Deep Web Crawling by Sampling. Web Intelligence 2008: Select queries

28 Sampling based approach The queries are selected from a sample set of documents In contrast to incremental approach Steps –Send a few random queries to TotalDB; –Obtain the matched documents and construct the SampleDB; –Analyse all the documents in SampleDB, construct QueryPool; –Use set covering algorithms to select the Queries; –Send Queries to TotalDB to retrieve documents. Whether the queries can cover most of the data source? Whether low OR in SampleDB can be projected to TotalDB? Whether SampleDB need to be very large? Sampling based approach

29 Hypothesis 1: vocabulary learnt from sample can cover most of the documents in TotalDB Impact of sample size on HR. The queries are selected from SampleDB and cover above 99% of the documents in SampleDB. The HR in the plot is obtained when those queries are sent to the TotalDB. relative query pool size is 20. Sampling based approach

30 Hypothesis 2: low OR in sampleDB can be projected to TotalDB Comparison of our method on the four corpora with queries selected randomly from sample. Sample size is 3000, relative query pool size is 20. Our method achieves a much smaller OR when HR is high. Sampling based approach

31 Hypothesis 3: both the sample size and query pool size do not need to be very large Sampling based approach

32 Hypothesis 3 (continue) Sampling approach

33 Set covering problem given a universe U and a family of subsets S={S1, S2, …, Sn} of U. a cover is a subfamily of S whose union is U. Let J={1,2,…,n}. J* is a cover if set covering decision problem: the input is a pair (S,U) and an integer k; the question is whether there is a set covering of size k or less. set covering optimization problem: the input is a pair (S,U), and the task is to find a set covering which uses the fewest sets. The decision version of set covering is NP-complete, and the optimization version of set cover is NP-hard. Set covering

34 Set covering example d1d1 d3d3 d2 t1t1 t3t3 t2t2 Set covering Suppose each row represent a term, each column represent a document. If the cell (i,j) is 1, term i can retrieve document j, or term i covers document j.

35 Set covering algorithms Optimal solution is hard to obtain, within polynomial time various approximation algorithms are developed –Greedy –A classical algorithm –Weighted greedy –Developed for our particular application –Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a New Set Covering Algorithm. [PDF] ADMA 2009: [PDF] –Genetic algorithm –Clustering –… Set covering

36 Greedy algorithm Number of new elements Set covering

37 Greedy algorithm may not be able to find the optimal solution There can be two solutions –If the first set selected is t1, then the solution is –{t1, t2} –The cost 4 –If the first selection is t2, then the solution is –{t2, t3} – the cost is 3. Set covering

38 Weighted greedy algorithm 3 8 q1q1 q4q4 q3q3 q5q5 q2q Set covering

39 One solution obtained by greedy algorithm Set covering 3 8 q5q q4q4 q5q q4q4 q3q3 q5q Total cost is 5+4+5=14

40 Solution obtained by weighted greedy algorithm 3 8 q1q1 q4q4 q3q3 q5q5 q2q q4q q4q4 3 q3q q1q1 q3q Set covering Total cost is 4+5+4=13

41 Review Deep web crawling Random queries and models Sampling based crawling Set covering algorithms –Greedy –Weighted greedy –Clustering –Genetic –Currently they are for model Mh –What is the solution for Model Mhr? For Model Mhr, we need to predict the term frequencies of the terms in TotalDB