Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
A Quality Focused Crawler for Health Information Tim Tang.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Search Engines and Information Retrieval
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Search Engine Optimization
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Search Engine Optimization & Pay Per Click Advertising
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Frompo is a Next Generation Curated Search Engine. Frompo has a community of users who come together and curate search results to help improve.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Search Engine Optimization
A Comparative Study of Link Analysis Algorithms
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Information Retrieval and Web Design
Presentation transcript:

Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005

2 Outlines Motivation - Aims Experiments & Results –Domain specific vs. general search –A Quality Focused Crawler Conclusion & Future work

3 Why health information on the Web? Internet is a free medium High user demand for health information Health information of various quality Incorrect health advice is dangerous

4 Problems Normal definition of relevance: Topical relevance Normal way to search: Word-matching Q: Are these applicable to health information? A: Not complete, we also need quality = usefulness of the information

5 Problem: Quality of health info The various quality of health information in search results

6 Wrong advice

7 Dangerous information

8

9 Dangerous Information

10 Problem: Commercial sites Health information for commercial purposes

11 Commercial promotion

12 Problem: Types of search engine The difference between domain-specific search and general-purpose search.

13 Querying BPS

14 Querying Google: Irrelevant information

15 Problem of domain-specific portals Domain-specific portals may be good, but … It often requires intensive effort to build and maintain (will be discussed more in experiment 2)

16 Aims To analyse the relative performance of domain specific and general purpose search engines To discover how to provide effective domain specific search, particularly in the health domain To automate the quality assessment of medical web sites

17 Two experiments First: Compare search results for health info between general and domain specific engines Second: Build and evaluate a Quality focused crawler for a health topic

18 The First Experiment A Comparison of the relative performance of general purpose search engines and domain- specific search engines In Journal of Information Retrieval ‘05 – Special Issue with Nick Craswell, Dave Hawking, Kathy Griffiths and Helen Christensen

19 Domain specific vs. General engines General search engines: Google, Yahoo, MSN search, … Domain specific: Search service for scientific papers, search service for health, or a topic in the health domain. A depression portal: BluePages ( )

20 BluePages Search (BPS)

21 BPS result list

22 Engines –Google –GoogleD (Google with “depression”) –BPS –4sites (4 high quality depression sites) –HealthFinder (HF): A health portal search named Health Finder –HealthFinderD (HFD): HF with depression

23 Queries 101 queries about depression: –50 treatment queries suggested by domain experts –51 non-treatment queries collected from 2 query logs: domain-specific query log and general query log. Examples: –Treatment queries: acupuncture, antidepressant, chocolate –Non-treatment queries: depression symptoms, clinical depression

24 Experiment details Run the 101 queries on the 6 engines. For each query, top 10 results from each engine are collected. All results were judged by research assistants: degrees of relevance, recommendation of advice Relevance and quality for all engines were then compared

25 Results Engine RelevanceQuality GoogleD BPS sites Google HFS

26 Findings Google is not good in either relevance or quality GoogleD can retrieve more relevant pages, but less high quality pages. 4sites and BPS provide good quality but have poor coverage. It’s important to have a domain-specific portal which provides both high quality and high coverage. How to improve coverage?

27 Experiment 2 Building a high quality domain-specific portal using focused crawling techniques In CIKM ’05 With Dave Hawking, Nick Craswell, Kathy Griffiths

28 A Quality Focused Crawler Why? –The first experiment shows: Quality can be achieved using domain specific portals –The current method for building such a portal is expensive. – Focused crawling may be a good way to build a health portal with high coverage, while reducing human effort.

29 The problems of BPS Manual judgments of health sites by domain experts for two weeks to decide what to include. 207 Web sites are included, i.e., a lot of useful web pages are left out. Tedious maintenance process: Web pages change, cease to exist, new pages, etc. Also, the first experiment shows: High quality but quite low coverage.

30 Focused Crawling (FC) Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure. Examples of topics: sport, health, cancer, or scientific papers, etc.

31 FC Process URL Frontier Link extractorDownload Classifier {URLs, link info} dequeue {URLs, scores} enqueue Link info = anchor text, URL, source page’s content, so on.

32 FC: simple example Crawling pages about psychotherapy

33 Relevance prediction anchor text: text appearing in a hyperlink text around the link: 50 bytes before and after the link URL words: parse the URL address

34 Relevance Indicators URL: herapy.html => URL words: depression, com, psychotherapy Anchor text: psychotherapy Text around the link: –50 bytes before: section, learn –50 bytes after: talk, therapy, standard, treatment

35 Methods Machine learning approach: Train and test relevant and irrelevant URLs using the discussed features. Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron. Result: The C4.5 decision tree was the best to predict relevance. The same method applied to predict quality but not successful!!!

36 Quality prediction Using evidence-based medicine, and Using Relevance Feedback (RF) technique

37 Evidence-based Medicine Interventions that are supported by a systematic review of the evidence as effective. Examples of effective treatments for depression: –Antidepressants –ECT (electroconvulsive therapy) –Exercise –Cognitive behavioral therapy These treatments were divided into single and 2-word terms.

38 Relevance Feedback Well-known IR approach of query by examples. Basic idea: Do an initial query, get feedback from users about what documents are relevant, then add words from relevant document to the query. Goal: Add terms to the query in order to get more relevant results.

39 RF Algorithm Identify the N top-ranked documents Identify all terms from these documents Select the terms with highest weights Merge these terms with the original query Identify the new top-ranked documents for the new query (Usually, 20 terms are added in total)

40 Our Modified RF approach Not for relevance, but Quality No only single terms, but also phrases Generate a list of single terms and 2-word phrases and their associated weights Select the top weighted terms and phrases Cut-off points at the lowest-ranked term that appears in the evidence-based treatment list 20 phrases and 29 single words form a ‘quality query’

41 Terms represent topic “depression” TermWeight Depression13.3 Health6.9 Treatment5.7 Mental5.4 patient3.3 Medication3 ECT2.4 antidepressants1.9 Mental health1.2 Cognitive therapy0.84

42 Predicting Quality For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights. Quality of a page is then predicted based on the quality of all downloaded pages linking to it. (Assumption: Good pages are usually inter-connected) Predicted quality score of a page with n downloaded source pages: PScore = Σ QScore/n

43 Combining relevance and quality Need to have a way of balancing relevance and quality Quality and relevance score combination is new Our method uses a product of the two scores Other ways to combine these scores will be explored in future work A quality focused crawler rely on this combined score to order the crawl queue

44 The Three Crawlers A Web crawler (spider): –A program which browses the WWW in a methodical, automated manner –Usually used by a search engine to index web pages to provide fast searches. We built three crawlers: –The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison) –The Relevance crawler: For relevance purpose, ordering the crawl queue using the C4.5 decision tree –The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.

45 Results

46 Relevance

47 Relevance Results The relevance and quality crawls each stabilised after 3000 pages, at 80% and 88% relevance respectively. The BF crawl continued to degrade over time, and down to 40% at 10,000 pages. The quality crawler outperformed the relevance crawler due to the incorporation of the RF quality scores.

48 Quality

49 High quality pages AAQ = Above Average Quality: top 25%

50 Low quality pages BAQ = Below Average Quality: bottom 25%

51 Quality Results The quality crawler performed significantly better than the relevance crawler. (50% better towards the end of the crawl) All the crawls did well in crawling high quality pages. The quality crawler performed very well, with more than 50% of its pages being high quality. The quality crawl only has about 5% pages from low quality sites while the BF crawl has about 3 times higher.

52 Findings Topical-relevance could be well predicted using link anchor context. Link anchor context could not be used to predict quality. Relevance feedback technique proved its usefulness in quality prediction.

53 Overall Conclusions Domain-specific search engines could offer better quality of results than general search engines. The current way to build a domain-specific portal is expensive. We have successfully used focused crawling techniques, relevance decision tree and relevance feedback technique to build high-quality portals cheaply.

54 Future works So far we only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain. Other ways of combining relevance and quality should be explored. Experiments to compare our quality crawl with other health portals is necessary. How to remove spam from the crawl is another important step.