Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
A Quality Focused Crawler for Health Information Tim Tang.
Evaluating Search Engine
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.
Chapter 5: Information Retrieval and Web Search
Search Engine Optimization
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Adversarial Information Retrieval The Manipulation of Web Content.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Chapter 5 Searching for Truth: Locating Information on the WWW.
1 The BT Digital Library A case study in intelligent content management Paul Warren
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB SPAM.
Artface (Automated reorganization to fit approximate client expectations) Mike Venzke 9/19/2018.
Information Retrieval
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

2 Why Depression?  Leading cause of disability burden in Australia  One in five people suffer from a mental disorder in any one year  The Web is a good way to deliver information and treatments, but...  A lot of depression information on the Web is of poor quality

3 Bluepages Search (BPS)

4 BluePages Search

5 Bluepages Search  Indexes approximately 200 sites, e.g. Whole server: suicidal.com/ Directory: Individual page:  Approximately 2 weeks of manual effort to create / update seed list and include patterns  Experiments showed that Google (with ‘depression’) had better relevance but more bad advice  Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl

6 Approach  BPS: higher quality but much lower coverage, and …  It is time consuming to identify and maintain the list of sites to be included Is it worth it? Can it be done more cheaply? How to increase coverage but still maintain high quality? Can we automate the process? =>  Seed list: Using an existing directory, e.g.: DMOZ, Yahoo! Directory  Crawling: Use general crawler with inclusion/exclusion rules Use focused crawler with mechanisms to predict relevant/high quality links from source pages

7 DMOZ Depression Directory  DMOZ is “the most comprehensive human- edited directory of the web”  Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and individual pages about depression Other pages in DMOZ Servers, directories & individual pages

8 DMOZ Seed List  How to generate Start from the depression directory Decide whether to include links to other pages within the DMOZ site (little manual effort) Automatically generate most of the seed URLs Seed URLs are same as URLs, except that default page suffixes are removed. E.g.: has the pattern

9 Should DMOZ be used?  Requires very little effort in boundary setting  Provides a big seed list of URLs locating heterogeneously on the Web (three times bigger than BPS)  Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%) => DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?

10 Focused Crawler  Seeks, acquires, indexes and maintains pages on a specific set of topics  Requires small investment in hardware and network resources  Starts with a seed list of URLs relevant to the topics of interest  Follows links from seed pages to identify the most promising links to crawl Is focused crawling a promising technique for building a depression portal?

11 One link away URLs Additional Link-accessible Relevant Information  Illustration of one link away collection  If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low DMOZ Crawl

12 Additional Link Experiments  Experiment: Relevance of outgoing links from a crawled collection An unrestricted crawler starting from the BPS crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages.  Experiment: Linking patterns between relevant pages Out of 196 new relevant URLs, 158 were linked to by known relevant pages.

13 Findings for Additional Links  Relevant pages tend to link to each other  Outgoing link set of a good collection contains quite a large number of additional relevant pages  These support the idea of focused crawling, but …  How can a crawler tell which links lead to relevant content?

14 Hypertext Classification  Traditional text classification only looks at the text in each document  Hypertext classification uses link information  We experimented with anchor text, text around the link and URL words  Here is an example

15 Features  URL: sychotherapy.html => URL words: depression, com, psychotherapy  Anchor text: psychotherapy  Text around the link: 50 bytes before: section, learn 50 bytes after: talk, therapy, standard, treatment

16 Input Data & Measures  Calculate tf.idf for all the features appearing in each URL  10-fold cross validation on 295 relevant and 251 irrelevant URLs  Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc.  Measures: Accuracy, precision and recall.

17 Hypertext Classification - Results => In overall, J48 is the best classifier J Naïve Bayes Complement Naïve Bayes ZeroR Recall (%)Precision (%)Accuracy (%)Classifier

18 Hypertext Classification - Others  Bagging and boosting showed little improvement for recall  No applicable results in the literature relating to the topic of depression to compare  A classifier looking at the content of the target pages showed similar results => Hypertext classification is quite effective

19 Findings  Web pages about depression are strongly interlinked  DMOZ depression category seems to provide a good seed list for a focused crawl  Predictive classification of outgoing links using link features achieves promising results => Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list

20 Future Work  Build a domain-specific search portal: URL ranking in the order of degree of relevance Data structures to hold accumulated information for unvisited URLs  Determine how to use the focused crawler operationally: No include/exclude rules, but appropriate stopping conditions What to do if none of the outgoing links are classified as relevant?

21 Future Work  Incorporate site quality into the focused crawler or filtering high quality pages after crawling  Extend the techniques to other domains, such as health related domains, is it applicable?