Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining for Web Intelligence Presentation by Julia Erdman.

Similar presentations


Presentation on theme: "Data Mining for Web Intelligence Presentation by Julia Erdman."— Presentation transcript:

1 Data Mining for Web Intelligence Presentation by Julia Erdman

2 Data Mining the Web Searching, comprehending, and using the semi-structured data on the web poses a significant challenge over data mining in a commercial database system The data from the web is more sophisticated and dynamic Data mining helps search engine find high- quality web pages

3 Why Data Mining? Challenges of data mining the web Web page complexity far exceeds the complexity of any traditional text document collection The Web constitutes a highly dynamic information source The Web serves a broad spectrum of user communities Only a small portion of the Web’s pages contain truly relevant or useful information

4 Why Data Mining? Approaches to accessing information on the web Keyword-based search or topic-directory browsing i.e. Google, Yahoo Querying deep Web sources i.e. Amazon.com, Realtor.com Random surfing

5 Design Challenges Traditional schemes for accessing data on the web are based on text-oriented, keyword- based web pages The current access schemes must be replaced with more sophisticated schemes in order to exploit the Web completely

6 Access Limitations Lack of high-quality keyword-based searches A search can return many answers i.e. searching popular categories, like sports or politics Overloading keyword semantics can return many low-quality answers i.e. a search for jaguar could be for an animal, car, sports team A search can miss many highly related pages that do not contain the posed keywords

7 Access Limitations Lack of effective deep-Web access There are at least 100,000 searchable databases on the Web with high-quality, well-maintained information, but are not effectively accessible There is an extremely large collection of autonomous and heterogeneous databases, each supporting specific query interfaces with different schema and query constraints

8 Access Limitations Lack of automatically constructed directories A topic or type-oriented Web information directory creates an organized picture of a web sector Developers must organizes these directories manually Costly Provides only limited coverage Not easily scalable or adaptable

9 Access Limitations Lack of semantics-based query primitives Most keyword-based searches only allow of small set of search options

10 Access Limitations Lack of feedback on human activities Web links may not be updated frequently, regularly, or at all Changes in access frequency do not automatically adjust search results

11 Access Limitations Lack of multidimensional analysis and data mining support Cannot drill deeply into sites in order to find the data we are looking for

12 Mining Web search-engine data Current keyword-based search engines have several deficiencies A widely covered topic can contain hundreds of thousands of documents Highly relevant documents may not contain the keywords used in the search

13 Analyzing the Web’s link structure When one web page contains a link to another, this can be considered an endorsement of the linked page Collected endorsements of the same page from many different web authors leads to an authoritative web page A hub is a single web page that contains a collection of links to authoritative web pages

14 Classifying Web documents automatically Generally, human readers classify Web documents, but an automatic classification is highly desirable Hyperlinks contain high-quality semantic clues to a page’s topic, which can help achieve accurate classifications However, links to unrelated sites can cloud the classification i.e. many sites have a link to weather.com, but generally are not weather sites Automatic classification can determine what classification a web page belongs to, but not to which classification it does not belong to

15 Mining Web page semantics structures and page contents Fully automatic extraction of Web page structures and semantic contents can be difficult due to the limitations on automated natural-languages parsing Semiautomatic methods can recognize a portion of such structures Then further analysis can see how the contents fit into these structures

16 Mining Web page semantics structures and page contents To identify the structures to extract, either an expert manually specifies the structures, or techniques must be developed to automatically produce the structures Or developers can use Web page classes for automatic extraction Semantic page structure and content recognition will provide for more in-depth analysis of Web pages

17 Mining Web dynamics Contents, structures, and access patterns change on the Web Storing historical data about Web pages assists in finding changes in content and links But due to phenomenal breadth of the Web, it is impossible to store images and updates Mining web logs records can provide quality results This data needs to be analyzed and transformed into useful, significant information

18 Building a multilayered, multidimensional Web Systematically analyze a set of Web pages Group closely related local Web pages or an individual page into a cluster, called a semantic page The analysis provides a descriptor for the cluster Then create a semantics-based, evolving, multidimensional, multilayered Web information directory

19 Questions? Comments?

20 Jiawei H. & Chang, K.C.-C. "Data mining for Web intelligence" IEEE Computer, Volume 35, Issue 11, Nov. 2002. pp. 64- 70.


Download ppt "Data Mining for Web Intelligence Presentation by Julia Erdman."

Similar presentations


Ads by Google