Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

2 Why Depression?  Leading cause of disability burden in Australia  One in five people suffer from a mental disorder in any one year  The Web is a good way to deliver information and treatments, but...  A lot of depression information on the Web is of poor quality

3 Bluepages Search (BPS)

4 BluePages Search

5 Bluepages Search  Indexes approximately 200 sites, e.g. Whole server: suicidal.com/ Directory: www.healingwell.com/depression/www.healingwell.com/depression/ Individual page: www.mcmanweb.com/article-226.htmwww.mcmanweb.com/article-226.htm  Approximately 2 weeks of manual effort to create / update seed list and include patterns  Experiments showed that Google (with ‘depression’) had better relevance but more bad advice  Relevance: Only 17% of relevant pages returned by Google were contained in the BPS crawl

6 Approach  BPS: higher quality but much lower coverage, and …  It is time consuming to identify and maintain the list of sites to be included Is it worth it? Can it be done more cheaply? How to increase coverage but still maintain high quality? Can we automate the process? =>  Seed list: Using an existing directory, e.g.: DMOZ, Yahoo! Directory  Crawling: Use general crawler with inclusion/exclusion rules Use focused crawler with mechanisms to predict relevant/high quality links from source pages

7 DMOZ Depression Directory  DMOZ is “the most comprehensive human- edited directory of the web”  Depression directory contains: Links to a few other DMOZ pages Links to servers, directories, and individual pages about depression Other pages in DMOZ Servers, directories & individual pages

8 DMOZ Seed List  How to generate Start from the depression directory Decide whether to include links to other pages within the DMOZ site (little manual effort) Automatically generate most of the seed URLs Seed URLs are same as URLs, except that default page suffixes are removed. E.g.: www.depression.com/default.asp has the pattern www.depression.comwww.depression.com/default.asp

9 Should DMOZ be used?  Requires very little effort in boundary setting  Provides a big seed list of URLs locating heterogeneously on the Web (three times bigger than BPS)  Using 101 judged queries from our previous study, we retrieved 227 judged URLs from DMOZ of which 186 were relevant (81%) => DMOZ provided a good set of relevant pages with little effort, but…can we find more relevant pages else where?

10 Focused Crawler  Seeks, acquires, indexes and maintains pages on a specific set of topics  Requires small investment in hardware and network resources  Starts with a seed list of URLs relevant to the topics of interest  Follows links from seed pages to identify the most promising links to crawl Is focused crawling a promising technique for building a depression portal?

11 One link away URLs Additional Link-accessible Relevant Information  Illustration of one link away collection  If pages in the current crawl have no link to additional relevant content, the prospect of successful focused crawling is very low DMOZ Crawl

12 Additional Link Experiments  Experiment: Relevance of outgoing links from a crawled collection An unrestricted crawler starting from the BPS crawl can reach 25.3% (quite high) more known relevant pages in one single step from current crawled pages.  Experiment: Linking patterns between relevant pages Out of 196 new relevant URLs, 158 were linked to by known relevant pages.

13 Findings for Additional Links  Relevant pages tend to link to each other  Outgoing link set of a good collection contains quite a large number of additional relevant pages  These support the idea of focused crawling, but …  How can a crawler tell which links lead to relevant content?

14 Hypertext Classification  Traditional text classification only looks at the text in each document  Hypertext classification uses link information  We experimented with anchor text, text around the link and URL words  Here is an example

15 Features  URL: http://www.depression.com/p sychotherapy.html => URL words: depression, com, psychotherapy  Anchor text: psychotherapy  Text around the link: 50 bytes before: section, learn 50 bytes after: talk, therapy, standard, treatment

16 Input Data & Measures  Calculate tf.idf for all the features appearing in each URL  10-fold cross validation on 295 relevant and 251 irrelevant URLs  Classifiers: IBK, ZeroR, Naïve Bayes, C4.5, Bagging and AdaboostM1, etc.  Measures: Accuracy, precision and recall.

17 Hypertext Classification - Results => In overall, J48 is the best classifier 68.1388.1577.83J48 69.8378.0373.07Naïve Bayes 65.4277.5171.06Complement Naïve Bayes 10054.02 ZeroR Recall (%)Precision (%)Accuracy (%)Classifier

18 Hypertext Classification - Others  Bagging and boosting showed little improvement for recall  No applicable results in the literature relating to the topic of depression to compare  A classifier looking at the content of the target pages showed similar results => Hypertext classification is quite effective

19 Findings  Web pages about depression are strongly interlinked  DMOZ depression category seems to provide a good seed list for a focused crawl  Predictive classification of outgoing links using link features achieves promising results => Cheap and high coverage depression portal might be built & maintained using focused crawling techniques starting with the DMOZ seed list

20 Future Work  Build a domain-specific search portal: URL ranking in the order of degree of relevance Data structures to hold accumulated information for unvisited URLs  Determine how to use the focused crawler operationally: No include/exclude rules, but appropriate stopping conditions What to do if none of the outgoing links are classified as relevant?

21 Future Work  Incorporate site quality into the focused crawler or filtering high quality pages after crawling  Extend the techniques to other domains, such as health related domains, is it applicable?

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Similar presentations

Presentation on theme: "Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)

Similar presentations

Presentation on theme: "Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)"— Presentation transcript:

Similar presentations

About project

Feedback