Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation). Web pages presented on the internet do not conform to any data organization standard and search engines provide primitive query capabilities for users to retrieve relevant data [1]. In addition to that, they do not list sites equally and are inclined toward listing more popular pages. These tendencies brush many web pages aside and leave a limited number of alternatives to the users. I have created a lightweight domain ontology that consists of a taxonomic hierarchy and made use of it by an automated agent to classify web pages on the internet. The automated agent discovers and classifies relevant pages with the help of Yahoo and Google search engines.
Related Research Desai and Spink presented a clustering scheme that groups documents into partially and substantially relevant pages by using similarity measures and ranking heuristics [2]. They worked with the end-user queries (limited number of terms) to obtain the relevance score. Instead, my automated agent will act along with an established ontology to discover and classify documents. Chiang, Chua, and Storey parsed snippets of returned links to find the ratio of the number of matching terms to rank the web pages for relevance [1]. I believe snippets consist of a very few number of words and we cannot judge the web page on the basis of snippets. My agent scours the entire web page which is more time-consuming but more effective
Yahoo and Google search results contain scores of links. My lightweight domain ontology consists of 7 branches. Each branch is comprised of predetermined terms. Score of the web page increases in a certain branch as the agent comes across these predetermined terms on the html code. I’ve chosen country ontology because internet users’ interest in a certain country can be quite high.
The ranks of web pages in each cluster will clarify their content to the user. In addition to that, we can compare result sets of different search engines (Yahoo and Google) for the same queries and find complement and intersection of their result sets to have a clear understanding of search engines’ behaviors.
I’ve used web stream classes in C# Programming language to create my agent. A WebRequest is an object that requests a Uniform Resource Identifier (URI) such as the URL for a web page [3]. You can use a WebRequest object to create a WebResponse object that will encapsulate the object pointed to by the URI. Once you get the actual object (e.g., a web page) pointed to by the URI, what you get back is a stream of the web page.
I used this capability for reading a page from a site to extract the information I need. I have created two web requests using search syntax given below
Google search engine has returned 200 URLs and I have created 200 web requests to extract relevant information from each web page. Yahoo search engine has been used in the same manner to extract relevant information.
If we analyze price rankings, we come across pages that have information about student flights, travel insurances, vacation package discounts, cheap flights and etc.. If we analyze nature rankings, we come across web pages that offer adventure and etc. If I want to do scuba diving on my vacation, I know that hawai and fiji are among my options by looking at activities rankings
Interestingly enough, in the top 100 search lists, the number of web pages that both appear on Google and Yahoo is 19.
In the top 200 search lists, the number of web pages that both appear on Google and Yahoo is 24.
As a result, ontologically organized clusters of web sites that are offering information about a given country regarding vacation and travel alternatives serve our objective to a greater extent in finding what we are in search of. In my work, I have used 2 search engines and a single ontology. Search Engines’ shortcomings can be prevented by combining multiple engines with multiple ontologies to ease the search for most needed information on the internet. Venn Diagrams prove that a specific search engine is not very effective by itself.
References [1] R.H.L. Chiang, C.E.H. Chua, V.C. Storey, A smart web query method for semantic retrieval of web data, Data & Knowledge Engineering 38 (2001) [2] M. Desai, A. Spink, An algorithm to cluster documents based on relevance, Information Processing and Management 41 (2005) [3] Liberty, J., Programming C#,3rd ed. O’REILLY, 2003.