17th APAN Meetings & Joint Techs Workshop FilipinianaWeb Nestor Michael C. Tiglao Computer Networks Lab (CNL) University of the Philippines 17th APAN Meetings & Joint Techs Workshop Jan. 30, 2004
World Wide Web Enormous growth (10 billion pages) Imagine the Web without search engines Need for intelligent document discovery mechanisms
Web Crawlers Programs that retrieve Web pages Two kinds: General-purpose crawlers Focused crawlers
Sample Query: anthrax
Result 1
Result 2
Focused Crawler Selectively seek out pages that are relevant to a pre-defined set of topics Topics are specified by sample documents
Research on Search Engines Implemented the focused crawler on a Linux cluster using Beowulf and MPI (2002) Philippine-specific search engine using the openMosix platform (2003)
Focused Crawler Architecture User Interface Results Sample Document Classifier Crawl Tables Distiller Crawler
Focused Crawler Design
Flowchart
Performance (Crawl Time)
Why another search engine? Existing Philippine search engines: Yehey.com, Alleba, Tanikalang Ginto, Pugad.com and EdsaWorld actually web directories We need a better search engine
Unique Situation Many Philippine-related sites are not registered under the .ph domains Many sites are hosted outside the Philippines English as the de facto language
System Design (Gagambot)
Filters ph Domain filter Language filter gov.ph, edu.ph iso 639, iso-8859-1/latin1 and windows-1252 subset of Unicode characters utf-8 and us-ascii
Filters 2 GeoURL filter Bayesian filter Location-to-URL reverse directory Finds URLs by their proximity to a given location (www.geourl.org) Bayesian filter Analyzes the textual content of the HTML document
FilipinianaWeb
Current Plans Develop FilipinianaWeb on a grid platform Better filtering techniques Integrate focused crawling Support for other object formats: documents, images, XML, etc.
Conclusion FilipinianaWeb is a work-in-progress and a proof-of-application Grid infrastructure will help provide the computational and resource requirements of a production-level search engine