Improving searches through community clustering of information Sinergia Improving searches through community clustering of information
Motivation Finding useful information using today’s search engines technology is usually a time consuming process. Users in a “community” that share similar interests usually search for the same kind of things. There is no mechanism to share information about good search results, thus repeating time consuming searches.
Main idea Collect information about search queries made by users in the community. Allow people in the community to rank pages. Combine that information to provide a way to reuse previous searches made by users to speed up similar search queries from other users of the community.
Main idea Cluster users by their surfing preferences and use that information to suggest/filter related links.
Proposed solution Watch user’s queries to search engines and keep track of the URLs they find interesting. Build up a profile of the user based on his/her surfing behavior and correlate that information with profiles from other users. Complement the search results obtained from the traditional search engines with our knowledge of what is interesting to users thus potentially saving search time.
Goals of the implementation Use an approach that requires as few changes as possible in the client side and in the existing internet infrastructure. Leverage the power of our approach to as many client platforms as possible, ideally provide a truly platform-independent solution.
Goals of implementation Complement existing search technology instead of reinventing it. Use a modular architecture that allows: Flexible support of different search engines. Allow easy switching between correlation techniques Design the solution for high scalability and availability.
Architecture DB Proxy Web Server Hard state info stored in DB Soft state info stored in Proxy Proxy Web Server Search requests (modified) Most requests * Search queries * Vote for pages * Find related
Architecture: Overview Community: The group of users that the proxy serves. The proxy collects information transparently. The Web Server provides information to the users like related pages and results of searches. Both the proxy and the Web Server talk to the same Database. The Database maintains the hard state of the system.
Architecture: Proxy Maintains the concept of a session for each user. For most HTTP requests it works as a normal proxy. For requests that are search queries on a search engine, it keeps track of the URLs returned for that query and logs activity on those URLs.
Architecture: Proxy Requests to Sinergia web server are modified to provide user ID information on a fat URL. When a user’s session ends or he/she generates a new search store the collected information for that user in the DB. Uses a modularized architecture that allows to add support easily for additional search engines.
Architecture:Web Server Provides a front end for the Sinergia search engine. Provide the backend for supporting the “Vote”, “Get Rank” and “Find Related” functionality. To relate a request to the user ID of the client it receives information from the proxy in the form of a fat URL.
Architecture: Scalability Scalability is achieved through the clear division in tasks between the Proxy, the Web Server and the DB. Increase in number of search requests to our search engine should not create a bottleneck as the Database is decoupled from the Web Server and Proxy, thus allowing to scale any of it as needed.
Architecture: Scalability Soft state information about current searches is maintained in the Proxy and transactions to the database occur only when the user session ends or he makes a new search. The Proxy itself is scalable using clustering, as is the Web Server.
Architecture: Availability Our architecture addresses this by decoupling the Proxy, Web Server and Database therefore allowing the use of clustering for each of these components to provide a highly available system. Managing the hard state only in the Database provides availability of information as long the database is working. The database availability can be ensured by modern database techniques like distribution and replication.