University of Economics Prague - UEP 1 MedIEQ Web Spider and Link scoring component Marek Ruzicka Project meeting TKK, Helsinki, Finland 23.October.2006
MedIEQ Web spider and link scoring component 2 Presentation overview Navigation component (Spider) Link scoring component Current state Next steps
MedIEQ Web spider and link scoring component 3 Navigation Component (Spider) Input: list of urls from Crawler Spidering process –Retrieve web page and convert its coding into UTF-8 –Extract all links on page –Put internal links in link queue –Repeat process for each link in queue Configuration of spider –Supported/activated link types –Supported/activated file (web page) types Pos. Classified pages Extract Links Visit internal links Content Classification Component Content Classification Component SPIDER CRAWLER URLs Links UTF8 Content IE
MedIEQ Web spider and link scoring component 4 Navigation Component (Spider) Storing web pages –Content of each page is given to CCC –Pos. classified pages are stored locally for IE Pos. Classified pages Extract Links Visit internal links Content Classification Component Content Classification Component SPIDER CRAWLER URLs Links UTF-8 Content IE
MedIEQ Web spider and link scoring component 5 Link Scoring Component Link Scoring component –Extracts „link objects“ (links including link text, surrounding text, alt text etc.) –Consists of several modules (specialized to given content e.g. contact pages) –If at least one module “scores” link positively, it is explored by spider Link scoring modules –Created by ML or heuristics –Tested on heuristics Extract Link objects SPIDER Link objects Link objects Pos. Classified pages Content Classification Component Content Classification Component Link Scoring Component Link Scoring Component Pos. Scored links UTF8 Content IE CRAWLER URLs
MedIEQ Web spider and link scoring component 6 Current state Current state –Spider successfully retrieve about 95% web pages –List of „unreachable“ pages is stored for nest run –Spider runs multi-thread – one thread per web site Spidering experience –„Correct“ number of threads is strongly dependant on HW and network capacities –Common „spider-traps“ are usually harmless –There are still „spider-killer“ pages in medical domain –LSMs based on heuristics haven't good results
MedIEQ Web spider and link scoring component 7 Next steps Spider –Examine influence of spider-traps on spider –Avoid spider-killer pages –Enable Spider configuration by web interface Link scoring component –Train link scoring modules using ML –Enable LSC configuration by web interface