Download presentation
Presentation is loading. Please wait.
Published bySibyl Hodges Modified over 9 years ago
1
By: Jamie McPeek
2
1. Background Information 1. Metasearch 2. Sets 3. Surface Web/Deep Web 4. The Problem 5. Application Goals
3
Metasearch is a way of searching any number of other search engines to create and generate more precise data. Metasearch engines began appearing on the web in the mid 90s. Many metasearch engines exist today but are generally overlooked due to Google’s grasp on the “searching” industry. An example: www.info.comwww.info.com
4
Set is a collection of similar objects with no repeated values. For our purposes, a set consists of any number of web pages; the name of the set being based on the search engine that returns the values. When viewing all sets returned from all search engines, there is no longer a set as there may be many repeated values. Removing search engines to lower/remove redundancy is an NP-Complete problem.
5
A cover is a collection of sets that contain at the very least one of each element contained in the “universe”; in our case, at least one of each web page from the total amount of web pages returned. Keeping the “cover” in mind, we have a goal: Remove as many search engines as possible while maintaining a cover. If possible, remove all redundant search engines. This creates a minimal cover.
6
The surface web is any content that is publicly available or accessible on the internet. The deep web is all other content. Data accessible only through an on-site query system. Generated on-the-fly data. Frequently updated/changed content. Most normal search engines are incapable of retrieving data from the deep web or catch it in only one state.
7
Various search engines catching different states would allow a compilation from them to provide a clearly picture of the actual content. Specialized and site-only search systems can almost always be generalized to allow remote searching. With the above in mind, metasearch becomes an intriguing idea as a way to view not only the surface web, but the deep web as well.
8
A finite and known number of search engines each return a set of a finite and known number of web pages. Across all search engines, there may be redundancy. The idea is to remove as many unnecessary search engines from the meta set as possible while leaving a complete cover of the web pages. Accuracy (relative to the true minimal cover) and speed are the most important aspects.
9
Using two different languages, we want to: Compare the accuracy and speed of two different algorithms. Compare different structures for the data based on the same algorithm. Assess the impact of “regions” on overall time. Regions are a way of grouping elements based on which search engines they are in. At most one element in a region is necessary, all others are fully redundant.
10
2. Source Code 1. Original Setup 2. Key Structures – C 3. Key Structures – C++ 4. Procedure 5. Reasons For Changing
11
System: UW-Platteville’s IO System Language: C Minor work on this code as it was already written. Used as a baseline for improvements. Managed using subversion to allow rollbacks and “checkins”.
12
Structure for storing the sets. Each web page is mapped to a specific bit. Bitwise operators are used for accessing/editing a specific bit. list[index] &= ~(1 << (position % (sizeof(unsigned int) << 3)));
13
Structure for storing web pages. Stored using a tree for faster insertion. BITMAP in this instance stores the specific search engine that the document exists in. nID allows reference back to the specific web page instead of dragging the string around the entire time.
14
The bitmap structure as changed for C++. Added some variables to reduce “fixed” calculations. Added variables to hold new data available when reading in the web pages. Converted to a Class – OOP.
15
A new structure implemented for C++. A two-dimensional grid of these nodes implemented as a linked list. Eliminates “empty” bits. Structure is self-destructive in use. Access coordinate based on matrix (i, j)
16
1. Read web pages in from file. 1. Number of search engines. 2. Number of documents per search engine. 2. Store each incoming web page as a node in a balanced tree. 1. Total number of web pages. 2. Total number of unique web pages. 3. Setup whichever structure is to be used based on the numbers learned from reading in and storing the web pages.
17
Populate the structures based on the data available in the tree. This can be the original tree or the region tree. The bitmap structure is stored in two ways 1. Search engine “major” – used in original C code. 2. Document “major” – used in new C++ code. Run one of the two algorithms over the structure and document the results. Cover size and amount of time taken.
18
Personal preference. I’m significantly more familiar with C++ than I am with C. Additional compiler options for the language. OOP. Additional language features. Inline functions. Operator overloading. More readable code.
19
3. Algorithms 1. Greedy Algorithm 2. Check and Remove (CAR) Algorithm 4. Results 1. Data Sets 2. Baseline Results 3. Updated (C++) Results 5. Regions 1. Impact (Pending)
20
Straight-forward, brute force. Add the largest set, then the next largest, and so on. Easily translated to code. Makes no provision for removing redundant sets after reaching a cover set.
21
Less direct approach; adds based on “uncovered” elements. Remove phase makes a single pass at removing any redundant sets.
22
The structures and algorithms were tested on moderately large to very large data sets. Number of documents range from 100,000 to 1,000,000. The number of search engines was constant at 1,000. Distribution was uniform (all search engines contained the same number of documents). Non-uniform sets were tested by Dr. Qi. It apparently worked or he would have let me know.
23
Greedy Min: 1,500 Seconds Max: 29,350 Seconds 8h 9m 10s CAR Min: 16.5 Seconds Max: 138.5 Seconds
24
Greedy Min: 4.5 Sec. Max: 19.25 Sec. CAR Min: 1.0 Sec. Max: 7.75 Sec. Matrix (Both) Min: 0.20 Sec. Max: 0.40 Sec.
25
The idea is to find and remove redundant web pages in an intermediary step between reading data and performing the algorithm. Redundant web pages are determined based on the search engines that contain them. Currently the process of removing these web pages takes more time than it saves. This is not true for the baseline code as the run-time of the algorithms is significantly longer. Not determined whether this can be improved.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.