Data Mining for Web Intelligence Presentation by Julia Erdman.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Information Retrieval in Practice
Search Engines and Information Retrieval
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Web Mining Research: A Survey
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
Web Mining Research: A Survey
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Search engines. The number of Internet hosts exceeded in in in in in
Link Structure and Web Mining Shuying Wang
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Web Data Mining and Applications Part I
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Search Engines and Information Retrieval Chapter 1.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Data Mining By Dave Maung.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Search Tools and Search Engines Searching for Information and common found internet file types.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
GRID ENABLED SYSTEM FOR MEDICAL IMAGE GATHERING, ANALYZING, RETRIEVAL AND PROCESSING Gorgi Kakasevski†, Aneta Buckovska*, Suzana Loskovska*, Ivica Dimitrovski*
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Data mining in web applications
Search Engines and Search techniques
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Web Mining Ref:
Personalized Social Image Recommendation
Text & Web Mining 9/22/2018.
Search Engines & Subject Directories
Information Retrieval
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
Search Engines & Subject Directories
Search Engines & Subject Directories
Web Mining Research: A Survey
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Data Mining for Web Intelligence Presentation by Julia Erdman

Data Mining the Web Searching, comprehending, and using the semi-structured data on the web poses a significant challenge over data mining in a commercial database system The data from the web is more sophisticated and dynamic Data mining helps search engine find high- quality web pages

Why Data Mining? Challenges of data mining the web Web page complexity far exceeds the complexity of any traditional text document collection The Web constitutes a highly dynamic information source The Web serves a broad spectrum of user communities Only a small portion of the Web’s pages contain truly relevant or useful information

Why Data Mining? Approaches to accessing information on the web Keyword-based search or topic-directory browsing i.e. Google, Yahoo Querying deep Web sources i.e. Amazon.com, Realtor.com Random surfing

Design Challenges Traditional schemes for accessing data on the web are based on text-oriented, keyword- based web pages The current access schemes must be replaced with more sophisticated schemes in order to exploit the Web completely

Access Limitations Lack of high-quality keyword-based searches A search can return many answers i.e. searching popular categories, like sports or politics Overloading keyword semantics can return many low-quality answers i.e. a search for jaguar could be for an animal, car, sports team A search can miss many highly related pages that do not contain the posed keywords

Access Limitations Lack of effective deep-Web access There are at least 100,000 searchable databases on the Web with high-quality, well-maintained information, but are not effectively accessible There is an extremely large collection of autonomous and heterogeneous databases, each supporting specific query interfaces with different schema and query constraints

Access Limitations Lack of automatically constructed directories A topic or type-oriented Web information directory creates an organized picture of a web sector Developers must organizes these directories manually Costly Provides only limited coverage Not easily scalable or adaptable

Access Limitations Lack of semantics-based query primitives Most keyword-based searches only allow of small set of search options

Access Limitations Lack of feedback on human activities Web links may not be updated frequently, regularly, or at all Changes in access frequency do not automatically adjust search results

Access Limitations Lack of multidimensional analysis and data mining support Cannot drill deeply into sites in order to find the data we are looking for

Mining Web search-engine data Current keyword-based search engines have several deficiencies A widely covered topic can contain hundreds of thousands of documents Highly relevant documents may not contain the keywords used in the search

Analyzing the Web’s link structure When one web page contains a link to another, this can be considered an endorsement of the linked page Collected endorsements of the same page from many different web authors leads to an authoritative web page A hub is a single web page that contains a collection of links to authoritative web pages

Classifying Web documents automatically Generally, human readers classify Web documents, but an automatic classification is highly desirable Hyperlinks contain high-quality semantic clues to a page’s topic, which can help achieve accurate classifications However, links to unrelated sites can cloud the classification i.e. many sites have a link to weather.com, but generally are not weather sites Automatic classification can determine what classification a web page belongs to, but not to which classification it does not belong to

Mining Web page semantics structures and page contents Fully automatic extraction of Web page structures and semantic contents can be difficult due to the limitations on automated natural-languages parsing Semiautomatic methods can recognize a portion of such structures Then further analysis can see how the contents fit into these structures

Mining Web page semantics structures and page contents To identify the structures to extract, either an expert manually specifies the structures, or techniques must be developed to automatically produce the structures Or developers can use Web page classes for automatic extraction Semantic page structure and content recognition will provide for more in-depth analysis of Web pages

Mining Web dynamics Contents, structures, and access patterns change on the Web Storing historical data about Web pages assists in finding changes in content and links But due to phenomenal breadth of the Web, it is impossible to store images and updates Mining web logs records can provide quality results This data needs to be analyzed and transformed into useful, significant information

Building a multilayered, multidimensional Web Systematically analyze a set of Web pages Group closely related local Web pages or an individual page into a cluster, called a semantic page The analysis provides a descriptor for the cluster Then create a semantics-based, evolving, multidimensional, multilayered Web information directory

Questions? Comments?

Jiawei H. & Chang, K.C.-C. "Data mining for Web intelligence" IEEE Computer, Volume 35, Issue 11, Nov pp