Presentation is loading. Please wait.

Presentation is loading. Please wait.

Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.

Similar presentations


Presentation on theme: "Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser."— Presentation transcript:

1 Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser

2 What is the hidden Web Two approaches in searching the hidden Web o Browsing Yahoo! like Web directory o Crawling the hidden Web conclusion

3  The surface Web ◦ reachable via hyperlinks

4  The hidden Web ◦ no static hyperlink points to the webpage ◦ access via a query interface ◦ dynamically generated base on the query submitted

5

6  About 500 times larger than the surface web ◦ The surface web - 1 billion pages ◦ Hidden web - over 550 billion pages  Top sixty largest Deep web sites are about 40 times larger than the surface web. the Deep Web V.S. the Surface Web (from Bergman)

7 Name URLWeb Size (GBs) National Climatic Data Center (NOAA)http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html366,000 NASA EOSDIShttp://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/32,940 MP3.comhttp://www.mp3.com/4,300 US PTO - Trademarks + Patentshttp://www.uspto.gov/tmdb/, http://www.uspto.gov/patft/2,440 Informedia (Carnegie Mellon Univ.)http://www.informedia.cs.cmu.edu/1,830 UC Berkeley Digital Library Projecthttp://elib.cs.berkeley.edu/766 US Censushttp://factfinder.census.gov610 NCI CancerNet Databasehttp://cancernet.nci.nih.gov/488 Amazon.comhttp://www.amazon.com/461 IBM Patent Centerhttp://www.patents.ibm.com/boolquery345 NASA Image Exchangehttp://nix.nasa.gov/337 some of the largest Hidden Web sites (from Bergman)

8  Browsing Yahoo! like Web directory  Crawling the Hidden Web.

9  Manually populate Yahoo! like directory  Classify collections of text database into categories and subcategories

10  Pros ◦ Intuitive ◦ Easy to use  Cons ◦ Labor intensive Yahoo Directory containing 200, 0000 categories and there are millions of database searchable online ◦ Accurate classification is not an easy task

11  Main challenge in searching the hidden Web ◦ How to automatically generate meaningful query as input against query interface  The query generation problem ◦ assume that a Web site contains a set of pages, s. ◦ each query qi issued returns a subset of s, si ◦ the task is to select a set of queries that would return maximum number of unique pages in the database with minimum cost

12  Random - select the query randomly from a list of keywords (e.g. a random word from an English dictionary).  Generic Frequency - select a list of most frequent key words from a generic document corpus.  Adaptive - select promising keywords from documents downloaded based on previously issued queries.

13 comparison of policies for dmoz (modified from Ntoulas et al )

14 comparison of policies for PubMed (modified from Ntoulas et al)

15  The surface web is the tip of the iceberg  Beneath it is an even vaster hidden Web  Two main approaches to access the hidden Web ◦ Yahoo! like web directory ◦ Crawling the Hidden Web  Much work need to be done.  Hidden Web searching technology would enable us to connect different data sources and allow businesses use data in new ways.

16  [1] "The Deep Web: Surfacing Hidden Value"Michael K. Bergman.. The Journal of Electronic Publishing, August 2001  [2] "Exploring a 'Deep Web' That Google Can’t Grasp"Alex Wright.. New York Times, February 3 2009  [3] S. Raghavan and H. Garcia-Molina. “Crawling the Hidden Web.” In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.  [4] Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, Luis Gravano "Modeling and Managing Content Changes in Text Databases."ACM Transactions on Database Systems, 32(3): June 2007.  [5] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.  [6] Alexandros Ntoulas, Petros Zerfos, Junghoo Cho "Downloading Textual Hidden Web Content by Keyword Queries",In Proceedings of the Joint Conference on Digital Libraries (JCDL),June 2005  [7] J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 97–130, 2001.

17 Thanks!


Download ppt "Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser."

Similar presentations


Ads by Google