A Quality Focused Crawler for Health Information Tim Tang.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Optimizing search engines using clickthrough data
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluating Search Engine
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Information Retrieval
Overview of Search Engines
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Introduction to SEO August 2011 NowSourcing, Inc..
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Engines By: Faruq Hasan.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Post-Ranking query suggestion by diversifying search Chao Wang.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
Cs Future Direction : Collaborative Filtering Motivating Observations:  Relevance Feedback is useful, but expensive a)Humans don’t often have time.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
KiloBytes Technologies “New Face Of Technology” / Website: SEOwww.kilobytes.inSEO.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Internet research as an early feasibility indicator
Evaluation Anisio Lacerda.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
WEB SPAM.
Evaluation of IR Systems
Source: Procedia Computer Science(2015)70:
A Comparative Study of Link Analysis Algorithms
Wikitology Wikipedia as an Ontology
IR Theory: Evaluation Methods
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

A Quality Focused Crawler for Health Information Tim Tang

2 Outline  Overview  Contributions  Experiments and results  Issues for discussion  Future work  Questions & Suggestions?

3 Overview  Many people use the Internet to search for health information  But… health web pages may contain low quality information, and may lead to personal endangerment. (example)  It is important to find means to evaluate the quality of health websites and to provide high quality results in health search.

4 Motivation  Web users can search for health information using general engines or domain-specific engines like health portals  79% of Web users in the U.S search for health information from the Internet (Fox S. Health Info Online, 2005)  No measurement technique is available for measuring the quality of Web health search results.  Also, there is no method for automatically enhancing the quality of health search results  Therefore, people building a high quality health portal have to do it manually and, without work on measurement, we can’t tell how good a job they are doing  Example of such a health portal is BluePages search, developed by the ANU’s centre for mental health research.

5 BluePages Search (BPS)

6 BPS result list

7 Research Objectives  To produce a health portal search that: Is built automatically to save time, effort, and expert knowledge (cost saving). Contains (only) high quality information in the index by applying some quality criteria Satisfies users’ demand for getting good advice (evidence-based medicine) about specific health topics from the Internet

8 Contributions  New and effective quality indicators for health websites using some IR-related techniques  Techniques to automate the manual quality assessment of health websites  Techniques to automate the process of building high quality health search engines

9 Expt1: General vs. domain specific search engines  Aim: To compare the performance of general search engines (Google, GoogleD) and domain specific engines (BPS) for domain relevance and quality.  Details: Running 100 depression queries in these engines. The top 10 results for each query from each engine are evaluated.  Results: next slide.

10 Expt1: Results RelevanceQuality Engine Mean MAP NDCGScore GoogleD BPS Google MAP = Modified Average Precision NDCG = Normalised Discounted Cumulative Gain

11 Expt1: Findings  Findings: GoogleD can retrieve more relevant pages, but less high quality pages compared to BPS. Domain-specific engines (BPS) have poor coverage (causing worse performance in relevance).  What next: How to improve coverage for domain- specific engines? How to automate the process of constructing a domain specific engine?

12 Expt2: Prospect of Focused Crawling in building domain-specific engines  Aim: To investigate into the prospect of using focused crawling (FC) techniques to build health portals. In particular: Seed list: BPS uses a seed list (start list for a crawl) that was manually selected by experts in the field. Can we automate this process? Relevance of outgoing links: Is it feasible to follow outgoing links from the currently crawled pages to obtain more relevant links? Link prediction: Can we successfully predict relevant links from available link information?

13 Expt2: Results & Findings  Out of 227 URLs from DMOZ, 186 were relevant (81%) => DMOZ provides good starting list of URLs for a FC  An unrestricted crawler starting from the BPS crawl can reach 25.3% more known relevant pages in one single step from the currently crawled pages. => Outgoing links from a constraint crawl lead to additional relevant content  Machine learning algorithm C4.5 decision tree can predict link relevance with a precision of 88.15% => A decision tree created using features like anchor text, URL words and link anchor context can help a focused crawler obtain new relevant pages

14 Expt3: Automatic evaluation of Websites  Aim: To investigate if Relevance Feedback (RF) technique can help in the automatic evaluation of health websites.  Details: RF is used to learn terms (words and phrases) representing high quality documents and their weights. This weighted query is then compared with the text of web pages to find degree of similarity. We call this “Automatic quality tool” (AQT).  Findings: Significant correlation was found between human-rated (EBM) results and AQT results.

15 Expt3: Results – Correlation between AQT score and EBM score

16 Expt3: Results – Correlation between Google PageRank and EBM score  Correlation: small & non- significant  r=0.23, P=0.22, n=30  Excluding sites with PageRank of 0, we obtained better correlation, but still significantly lower than the correlation between AQT and EBM.

17 Expt4: Building a health portal using FC  Aim: To build a high-quality health portal automatically, using FC techniques  Details: Relevance scores for links are predicted using the decision tree found in Expt. 2. Relevance scores are transformed into probability scores using Laplace correction formula We found that machine learning didn’t work well for predicting quality but RF helps. Quality of target pages is predicted using the mean of quality scores of all the known (visited) source pages Combination of relevance and quality score: The product of the relevance score and the quality score is used to determine crawling priority

18 Expt4: Results – Quality scores 3 crawls were built: BF, Relevance and Quality

19 Expt4: Results – Below Average Quality (BAQ) pages in each crawl

20 Expt4: Findings  RF is a good technique to be used in predicting quality of web pages based on the quality of known source pages.  Quality is an important measure in health search because a lot of relevant information is of poor quality (e.g. the relevance crawler)  Further analysis shows that quality of content might be further improved by post-filtering a very big BF crawl but at the cost of substantially increased network traffic.

21 Issues for discussion  Combination of scores  Untrusted sites  Quality evaluation  Relevance threshold choice  Coverage  Combination of quality indicators  RF vs Machine learning

22 Issue: Combination of scores  The decision to multiply the relevance and quality scores was taken arbitrarily, the idea was to keep a balance between relevance and quality, to make sure both quality and coverage are maintained.  Question: Should addition (or other linear combinations) be a better way to calculate this score? Or rather, only the quality score should be considered? In general, how to combine relevance and quality scores?

23 Issue: Untrusted sites  Untrusted sites RF was used for predicting high quality, but … Analysis showed that low quality health sites are often untrusted sites, such as commercial sites, chat sites, forums, bulletins and message boards. Our results don’t seem to exclude a some of these sites.  Question: Is it feasible to use RF somehow, or any other means to detect these sources? How should that be incorporated into the crawler?

24 Issue: Quality evaluation expt.  Expensive because manual evaluation for quality requires a lot of expert knowledge and effort. To know the quality of a site, we have to judge all the pages of that site.  Question: How to design a cheaper but effective evaluation experiment for quality? Can lay judgment for quality be used somehow?

25 Issue: Relevance threshold choice  A relevance classifier was built to help reducing the relevance judging effort. A cut- off point for relevance score needs to be identified. The classifier runs on 2000 pre- judged documents, half are relevant. I decided the cut-off threshold as a score at which the total number of false positive and false negative is minimised.  Question: Is it a reasonable way to decide a relevance threshold? Any alternative?

26 Issue: Coverage  The FC may not explore all the directions of the Web and resulted in low coverage. It’s important to know how much of the high quality Web documents that the FC can index.  Question: How to design an experiment that evaluates coverage issue? (How to measure recall?)

27 Issue: Combination of quality indicators  Health experts have identified several quality indicators that may help in the evaluation of quality, such as content currency, authoring information, information about disclosure, etc.  Question: How can/should these indicators be used in my work to predict quality?

28 Issue: RF vs Machine Learning  Compared to RF, ML has the flexibility of adding more features such as ‘inherited quality score’ (from source pages) into the leaning process to predict the quality of the results.  However, we’ve tried ML initially to predict quality but found that RF is much better. Maybe because we didn’t do it right!?  Question: Could ML be used in a similar way that RF is used? Does the former promise better result?

29 Future work  Better combination of quality and relevance scores to improve quality  Involve quality dimension in ranking of health search results (create something similar to BM25, with the incorporation of quality measure?)  Move to another topic in health domain or an entirely new topic?  Combine heuristics, other medical quality indicators with RF?

30 Suggestions  Any more suggestions to improve my work?  Any more suggestions for future work?  Other suggestions? The end!