- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.

- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version Rajendra Akerkar Pawan Lingras

Tankertanker Design OUTLINES Introduction CrawlersQueries Search Engine

Tankertanker Design INTRODUCTION Web content mining 1 Uses of Web-content mining techniques 2 Problems with the web data 3 Two approaches of web-content mining 4

Tankertanker Design o Web-content mining techniques are used to discover useful information from content on the web. o Some of the web content is generated dynamically using queries to database management systems. o Other web content may be hidden from general users. INTRODUCTION Web Content Uses of Web-content Mining techniques

Tankertanker Design Prob.1 INTRODUCTION Distributed data Large volume Prob.2 Unstructured data Prob.3 Quality of data Prob.5 Extreme percentage volatile data Prob.6 Prob.7 Problems with the web data 3 Prob.4 Redundant data Varied data

Tankertanker Design INTRODUCTION database oriented agent-based Two approaches of web-content mining 4 software agents perform the content mining view the Web data as belonging to a database

Tankertanker Design CRAWLERS

Tankertanker Design CRAWLERS C rawling process A computer program that navigates the hypertext structure of the web. Builds an index visiting number of pages and then replaces the current index. - Begin with group of URLs - Breath-first or depth- first - Extract more URLs N umerous crawlers C ontext Graph - Problem of redundancy - Web partition robot per partition - Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC). - Two steps of the CFC performs crawling

Tankertanker Design CRAWLERS Focused Crawler Two major parts Priority-based structure Documents Generally recommended for use due to large size of the Web Visits pages related to topics of interest The focused crawler structure consists of two major parts: The distiller & The hypertext classifier The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller Sample documents are identified and classified based on a hierarchical classification tree Documents are used as the seed documents to begin the focused crawling

Tankertanker Design SEARCH ENGINE Examples of search engine 1 Components to a search engine 2 Search engine mechanism 3 Responsibilities of Search Engines 4

Tankertanker Design SEARCH ENGINE o Basic components to a search engine: The spider: gathers new or updated information on Internet websites The index: used to store information about several websites The search software: performs searching through the huge index in an effort to generate an ordered list of useful search results o Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting for new or updated Web pages to store in an index.

Tankertanker Design SEARCH ENGINE Search engine mechanism Responsibilities of Search Engines o Generic structure of all search engines is basically the same o However, the search results differ from search engine to search engine for the same search terms o Document collection choose the documents to be indexed o Document indexing indicate the content of the selected documents frequently 2 indices preserved o Searching indicate the user information need into a query Retrieval o Document and query management present the outcome virtual collection Search engine mechanism

Tankertanker Design QUERIES On the next level, the search engine must translate the words with possible spelling errors into processing tokens. The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. On the third level, the search engine must use the processing tokens to search the document database and retrieve the appropriate documents. o Three-tier process of translating the user's need into a search engine query:

Tankertanker Design QUERIES Boolean Queries Natural LanguageThesaurus Queries Fuzzy Queries Term Searches Probabilistic Queries The most common type of query on the Web is when a user provides a few words or phrases for the search. Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy. In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system. Boolean logic queries connect words in the search using operators such as AND or OR. In natural language queries the user frames as a question or a statement. Fuzzy queries reflect no specificity.

Tankertanker Design Thank you for your attention!

- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.

Similar presentations

Presentation on theme: "- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.

Similar presentations

Presentation on theme: "- University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version."— Presentation transcript:

Similar presentations

About project

Feedback