Download presentation
Presentation is loading. Please wait.
Published byMargery Robinson Modified over 9 years ago
1
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 WSS ’ 03: WI/IAT 2003 Workshop on Applications, Products of Web-based Support Systems October 13, 2003, Halifax This research was partially supported by NSF, CISE/IIS-Digital Society and Technology
2
OUTLINE INTRODUCTION INTRODUCTION FRAMEWORK OVERVIEW FRAMEWORK OVERVIEW INFORMATION RETRIEVAL INFORMATION RETRIEVAL DATA MINING TECHNIQUES DATA MINING TECHNIQUES CASE CASE CONCLUSIONS & FUTURE WORK CONCLUSIONS & FUTURE WORK
3
INTRODUCTION World Wide Web World Wide Web Abundant information Abundant information Important resource for research Important resource for research Web Data Features Web Data Features Semi-structured Semi-structured Heterogeneous Heterogeneous Dynamic Dynamic A Research Support System for Web Data Mining A Research Support System for Web Data Mining
4
FRAMEWORK Web Source Identification Content Selection Information Retrieval Data Mining Discovery
5
INFORMATION RETRIEVAL Searching Tools Searching Tools Directory Directory Search engine Search engine Web Crawler Web Crawler URL access method URL access method Web page parser Web page parser Table extractor Table extractor Link extractor – absolute links/relative links Link extractor – absolute links/relative links Word extractor Word extractor
6
DATA MINING FUNCTIONS Association Rules Association Rules Find interesting association or correlation relationship among data items Find interesting association or correlation relationship among data items Classification Classification Predict classes Predict classes Two steps – build model, apply model Two steps – build model, apply model Clustering Clustering Find natural groups of data Find natural groups of data
7
OPEN SOURCE SOFTWARE Open Source Software (OSS) Open Source Software (OSS) Apache, Perl, Linux Apache, Perl, Linux Developed by part time contributors Developed by part time contributors SourceForge Developer Site SourceForge Developer Site Sponsored by VA Software Sponsored by VA Software Largest OSS development site Largest OSS development site 70,000 projects 70,000 projects 90,000 developers 90,000 developers 700,000 users 700,000 users
8
DATA COLLECTON Data sources Data sources Statistics, forums Statistics, forums Project statistics Project statistics 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS 9 fields – project ID, lifespan, rank, page views, downloads, bugs, support, patches and CVS Developer statistics Developer statistics Project ID and developer ID Project ID and developer ID
9
DATA COLLECTON (Cont.) Web Crawler Web Crawler Perl and CPAN Perl and CPAN LWP – fetch pages LWP – fetch pages HTML parser – parse pages HTML parser – parse pages HTML::TableExtract – extract information HTML::TableExtract – extract information Link extractor – extract links Link extractor – extract links
10
DATA MINING Association Rules Association Rules “all tracks”, “downloads” and “CVS” are associated “all tracks”, “downloads” and “CVS” are associated Classification Classification Predict “downloads” Predict “downloads” Naïve Bayes – Build Time 30 sec, accuracy 9% Naïve Bayes – Build Time 30 sec, accuracy 9% Adaptive Bayes Network - Build Time 20 min, accuracy 63% Adaptive Bayes Network - Build Time 20 min, accuracy 63% Clustering Clustering K-means: User specified number of clusters K-means: User specified number of clusters O-cluster: Automatically detect the number of clusters O-cluster: Automatically detect the number of clusters
11
CONCLUSIONS Conclusions Conclusions Build a framework Build a framework Describe procedures Describe procedures Discuss techniques Discuss techniques Provide a case study Provide a case study Future Work Future Work Exploratory study Exploratory study Implement all stages Implement all stages
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.