Download presentation
Presentation is loading. Please wait.
Published byEdgar Payne Modified over 9 years ago
1
Divide and Conquer: Challenges in Scaling Federated Search Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC SearchEngine Meeting 24 April 2006 Boston, MA
2
SEARCH ALL OF THESE SOURCES ONE AT A TIME
3
OR SEARCH THEM ALL AT ONCE
4
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
5
Finding the Gold Hidden in the World Wide Web “Google-type” search engines “pan” the surface web for gold “Deep Web” search engines go mining for gold
6
Challenges Overview Managing a large number of sources Searching a large number of sources in parallel Organizing and ranking the results returned
7
Challenges of Managing Thousands of Data Sources Locate Reliable Sources Categorize Sources by Content Configure Sources for Searching Maintain Sources 4
8
Challenges in Searching Thousands of Sources Automatically Select Sources to Search Retrieve Results from Cache 5 Perform Many Searches in Parallel Bring Back Best Results
9
Source Selection Optimizer Search Conductor Source Selection Optimizer Source Descriptions Previous Results
10
Caching of Search Results Reduces the load (cost) of accessing sources CHALLENGES Requires a large database Need to determine how often to update the cache Works best with lots of users doing similar searches
11
We Address Scalability Through a Grid-Based Solution Uses open standards (Web Services, WSDL, SOAP, XML) Runs on distributed nodes Is platform independent (Java based) Very flexible, providing a framework for integration of various filtering and analysis tools
12
Distributing the Workload as Grid Services
13
Select sources to search Can I get more results from “good” sources? Enough good results? YES Deliver results to user YES NO Perform Search Get Next Results Search Conductor
14
Searching a large number of sources can lead to a flood of results
15
Challenges in Organizing and Ranking Results 5 Multi-tier Relevance Ranking User-driven Ranking Clustering of Results
16
Multi-tier Relevance Ranking QuickRank – Ranks results based on occurrence of search terms in title, author, and snippet MetaRank – Ranks results utilizing custom algorithms applied to meta- data DeepRank – Downloads and indexes full-text documents HEAVY LIFTING REQUIRED!
17
User-driven Ranking Credibility of source Date range Document length Document type Geographic proximity Popularity of document Reading level Relevance Desired: Blending (weighing) of above criteria
18
Clustering
19
A Grand Challenge for Federated Search Source: Walter Warnick, Ph.D., DOE OSTI. Global Discovery: Increasing the Pace of Knowledge Diffusion to Increase the Pace of Science. Presented at the Annual Meeting of the American Association for the Advancement of Science, February 16-20, 2006.
20
Mathematician’s Scientific Discovery Biology Researcher’s Scientific Discovery Physics Scientific Discovery Math Databases: Research Papers Correspondence Conferences Biology Databases: Research Papers Correspondence Conferences Physics Databases: Research Papers Correspondence Conferences Global Discovery Search Portal Math Community Biology Community Physics Community Knowledge Diffusion in Action
21
Grid of Grids Each circle = a portal with 10- 100 sources End result is thousands of sources in 2 hops Scaling to the Next Level
22
Abe Lederman 122 Longview Drive Los Alamos, NM 87544 abe@deepwebtech.com www.deepwebtech.com 12 Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.