FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University
Demonstration, SIGMOD Outline Introduction Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions
Demonstration, SIGMOD How Do We Query the Web? Use a search engine Form query key words An example: Find room rates of hotels in Hong Kong used search engine keywords: Hong Kong+hotel
Demonstration, SIGMOD Hotel 2 Hotel 1 3 done forward Look at the Number!
Demonstration, SIGMOD Query the Web -- Current Situation Search engines return a long list of URLs. User is required to browse the web pages to find the information. The information required is often not on the returned page -- navigation through hyperlinks is often required (those links may or may not that obvious). The target information is in different forms (paragraphs, lists, tables …) A lot of web pages to be browsed Are we happy with this?
Demonstration, SIGMOD Efforts to Improve the Situation Search engines better index, improve precision/recall, metasearch engines, better presentation of results, …. IR techniques to Web document clustering/indexing, better model, similarity functions, documents ranking,... Intelligent agent user profiling, hyperlink recommendation,... Database approach wrappers, query languages, …
Demonstration, SIGMOD Our Dream Querying the Web as easy as querying a relational database SQL query returns a table of hotel prices SELECT room rates FROM web.hotel WHERE city = “hong kong” May remain a dream for a while :-(
Demonstration, SIGMOD A Practical goal Use keywords to express query requirements simple, no need to know schema of data inaccurate Relieve users from tedious browsing as much as possible Not URLs, not Web sites, even not Web pages Present query results to users as accurate and concise as possible Tables, lists, paragraphs, … containing user required information
Demonstration, SIGMOD Query Results -- Queried Segments Return query results as accurate and concise as possible. Basic idea: Breaking a Web page into segments: a row in a table, a table, an item in a list, a list, a paragraph, returning only queried segments to users queried segments : segments that contain the information the user is interested in.
Demonstration, SIGMOD Outline Introduction Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions
Demonstration, SIGMOD Learning Based Query Processing The fundamental difficulties in Web query processing: Web is a huge, ever growing, heterogeneous, semi-structured data source Most users of Web are naïve users issuing ad hoc queries Learn the knowledge for query processing from the User!
Demonstration, SIGMOD A Learning Based Technique Learn from the user when he browses from the first few URLs to navigate through the web pages to identify the required information in a web page Process the rest URLs automatically and retrieve queried segments
Demonstration, SIGMOD Hotel 2 Hotel 1 3 done forward User browses it!
Demonstration, SIGMOD Back User clicks here!
Demonstration, SIGMOD Room information User marks it!
Demonstration, SIGMOD back Fact starts here!
Demonstration, SIGMOD roomrates Fact chooses it!
Demonstration, SIGMOD xxx Fact finds it!
Demonstration, SIGMOD Outline Introduction Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions
Demonstration, SIGMOD A Query Processing System A learning based query processing system: User Interface: accepts user queries, presents query results, a browser capable of capturing user actions Query Analyzer: analyzes and transforms user queries Session Controller: coordinates learning and locating Learner: generates knowledge from captured user actions Locator: applies knowledge and locates query results Crawler & Parser: retrieves pages and parses to trees Knowledge Base: stores learned knowledge
Demonstration, SIGMOD Reference Architecture Session Controller Locator Search Engine Web User Interface Knowledge Base Learner Query Analyzer Crawler & Parser User
Demonstration, SIGMOD A Query Session Session Controller Training Strategy Segment Graph Result Buffer Knowledge Base User Actions Query results Checking URLs Locating Process Locator Query Result Presenter Learning Process Learner Browser Scripts
Demonstration, SIGMOD Training Strategies Sequential First n sites: user browses and system learns Next N-n sites: system processes Random Randomly choose n sites: user browses and system learns the system processes the rest Interleaved First n 0 sites, user browses and system learns Next n - n 0 site, system makes decision. For incorrect ones, user browses and system re-learns Next N-n sites: system processes
Demonstration, SIGMOD Outline Introduction Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions
Demonstration, SIGMOD System Evaluation Functionality Performance precision, recall, correctness efficiency: in a site, how many pages the system visits to find a result training efficiency: how many training samples are needed User interface
Demonstration, SIGMOD
Demonstration, SIGMOD System Evaluation - Effectiveness Given a set of keywords, the system makes N decisions N =N1 + N2 + N3 + N4 Precision = N1 / (N1+N3), Recall= N1 / # relevant sites, Correctness = (N1+N2) / N.
Demonstration, SIGMOD System Evaluation - Efficiency How efficiently the system finds a queried segment in a site? Level of a Queried Segment = the length of the shortest path to find it Absolute Path length = # Crawled pages, Relative Path Length = # Crawled pages / Level of the Queried Segment.
Demonstration, SIGMOD Basic Performance Q 11 : Hong Hong Hotel Room Rate Q 12 : Hong Kong Hotel Sequential training
Demonstration, SIGMOD Query Q 12 Effects of training Strategies
Demonstration, SIGMOD Improved Performance Interleaved training
Demonstration, SIGMOD Outline Introduction Learning Based Web Query Processing FACT: A Prototype System Preliminary System Evaluation Conclusions
Demonstration, SIGMOD Conclusions Proposed and implemented learning based Web query processing with the following features Returning succinct results: segments of pages; No a prior knowledge or preprocessing, suited for ad hoc queries; exploiting page formatting and linkage information simultaneously. The preliminary results are promising
Demonstration, SIGMOD Future Work Better knowledge key factor that affects system performance Dynamic web pages ? Integrating results from another project System evaluation Prototype product dot com company $$$ ???