A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
A Web-based Collaboratory for Supporting Environmental Science Research Xiaorong Xiang Yingping Huang Greg Madey Department of Computer Science and Engineering.
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.
Social Network Analysis: Tasks and Tools Steven Loscalzo and Lei Yu Department of Computer Science Watson School of Engineering and Applied Science State.
Proxy Cache Leonid Romanovsky Olga Fomenko Winter 2003 Instructor: Konstantin Sinyuk.
DESIGN AND IMPLEMENTATION OF A WEB MINING RESEARCH SUPPORT SYSTEM
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
1 Web Development Life Cycle  Ensures project consistency and completeness –Planning –Analysis –Design and Development –Testing –Implementation and Maintenance.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Using Social Networks to Harvest Addresses Reporter: Chia-Yi Lin Advisor: Chun-Ying Huang Mail: 9/14/
MINING AND MODELING THE OPEN SOURCE SOFTWARE COMMUNITY
Project MLExAI Machine Learning Experiences in AI Ingrid Russell, University.
Master Thesis Defense Jan Fiedler 04/17/98
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Data Mining with Oracle using Classification and Clustering Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden.
A Self-Manageable Infrastructure for Supporting Web-based Simulations Yingping Huang Xiaorong Xiang Gregory Madey Computer Science & Engineering University.
Data Mining By Dave Maung.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Agent 2004Scott Christley, Public Goods Theory of Open Source Community Public Goods Theory of the Open Source Development Community using Agent-based.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Design a full-text search engine for a website based on Lucene
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
By R. O. Nanthini and R. Jayakumar.  tools used on the web to find the required information  Akeredolu officially described the Web as “a wide- area.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Collaborative Development Services Learning From the Open Source Agile Development Process Richard Kilmer, InfoEther LLC.
Erik Jonsson School of Engineering and Computer Science The University of Texas at Dallas Cyber Security Research on Engineering Solutions Dr. Bhavani.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
A Research Collaboratory for Open Source Software Research Yongqin Gao, Matt van Antwerp, Scott Christley, Greg Madey Computer Science & Engineering University.
Lake Arrowhead 2005Scott Christley, Understanding Open Source Understanding the Open Source Software Community Presented by Scott Christley Dept. of Computer.
Computer Science Department 1 Studying the Impact of More Complete Server Information on Web Caching Craig E. Wills and Mikhail Mikhailov Worcester Polytechnic.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Glencoe Introduction to Multimedia Chapter 2 Multimedia Online 1 Internet A huge network that connects computers all over the world. Show Definition.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Cloud-Computing Cloud Web-Blog Software Application Download Software.
DATA MINING © Prentice Hall.
Introduction to Data Mining
Knowledge Discovery from Users Web-Page Navigation
RELATIONSHIP MINING IN SOCIAL NETWORKS
Huayi Zhagn and Haiyan Liang
Understanding the Features of a Web Site
Web Mining Department of Computer Science and Engg.
17th APAN Meetings & Joint Techs Workshop
Web Mining Research: A Survey
Information Retrieval and Web Design
Web Application Development Using PHP
Representing Higher-order Dependencies in Networks: Hands-on Tasks
Presentation transcript:

A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN WSS ’ 03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support Systems October 13, 2003, Halifax This research was partially supported by NSF, CISE/IIS-Digital Society and Technology

OUTLINE INTRODUCTION INTRODUCTION FRAMEWORK OVERVIEW FRAMEWORK OVERVIEW INFORMATION RETRIEVAL INFORMATION RETRIEVAL DATA MINING TECHNIQUES DATA MINING TECHNIQUES CASE CASE CONCLUSIONS & FUTURE WORK CONCLUSIONS & FUTURE WORK

INTRODUCTION World Wide Web World Wide Web Abundant information Abundant information Important resource for research Important resource for research Web Data Features Web Data Features Semi-structured Semi-structured Heterogeneous Heterogeneous Dynamic Dynamic A Research Support System for Web Data Mining A Research Support System for Web Data Mining

FRAMEWORK Web Source Identification Content Selection Information Retrieval Data Mining Discovery

INFORMATION RETRIEVAL Searching Tools Searching Tools Directory Directory Search engine Search engine Web Crawler Web Crawler URL access method URL access method Web page parser Web page parser Table extractor Table extractor Link extractor – absolute links/relative links Link extractor – absolute links/relative links Word extractor Word extractor

DATA MINING FUNCTIONS Association Rules Association Rules Find interesting association or correlation relationship among data items Find interesting association or correlation relationship among data items Classification Classification Predict classes Predict classes Two steps – build model, apply model Two steps – build model, apply model Clustering Clustering Find natural groups of data Find natural groups of data

OPEN SOURCE SOFTWARE Open Source Software (OSS) Open Source Software (OSS) Apache, Perl, Linux Apache, Perl, Linux Developed by part time contributors Developed by part time contributors SourceForge Developer Site SourceForge Developer Site Sponsored by VA Software Sponsored by VA Software Largest OSS development site Largest OSS development site 70,000 projects 70,000 projects 90,000 developers 90,000 developers 700,000 users 700,000 users

DATA COLLECTON Data sources Data sources Statistics, forums Statistics, forums Project statistics Project statistics 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS Developer statistics Developer statistics Project ID and developer ID Project ID and developer ID

DATA COLLECTON (Cont.) Web Crawler Web Crawler Perl and CPAN Perl and CPAN LWP – fetch pages LWP – fetch pages HTML parser – parse pages HTML parser – parse pages HTML::TableExtract – extract information HTML::TableExtract – extract information Link extractor – extract links Link extractor – extract links

DATA MINING Association Rules Association Rules “all tracks”, “downloads” and “CVS” are associated “all tracks”, “downloads” and “CVS” are associated Classification Classification Predict “downloads” Predict “downloads” Naïve Bayes – Build Time 30 sec, accuracy 9% Naïve Bayes – Build Time 30 sec, accuracy 9% Adaptive Bayes Network - Build Time 20 min, accuracy 63% Adaptive Bayes Network - Build Time 20 min, accuracy 63% Clustering Clustering K-means: User specified number of clusters K-means: User specified number of clusters O-cluster: Automatically detect the number of clusters O-cluster: Automatically detect the number of clusters

CONCLUSIONS Conclusions Conclusions Build a framework Build a framework Describe procedures Describe procedures Discuss techniques Discuss techniques Provide a case study Provide a case study Future Work Future Work Exploratory study Exploratory study Implement all stages Implement all stages