A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Personalized Presentation in Web-Based Information Systems Institute of Informatics and Software Engineering Faculty of Informatics and Information Technologies.
WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEBSITE DONE BY: AYESHA NUSRATH 07L51A0517 FIRDOUSE AFREEN 07L51A0522.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 13-1 COS 346 Day 25.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Design of Web-based Systems IS Development: lecture 10.
Requirements Specification
Web Mining Research: A Survey
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Web Mining Research: A Survey
© Copyright Eliyahu Brutman Programming Techniques Course.
Chapter 14 The Second Component: The Database.
Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Software Development Stephenson College. Classic Life Cycle.
Oracle Application Express (Oracle APEX), formerly called HTML DB, is a Free rapid web application development tool for the Oracle database.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Lesley Charles November 23, 2009.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Okalo Daniel Ikhena Dr. V. Z. Këpuska December 7, 2007.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
Complex Queries over Web Repositories Sriram Raghavan and Hector Garcia-Molina Computer Science Department Stanford University Gülfem IŞIKLAR M.Mirac KOCATÜRK.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
A Generalized Architecture for Bookmark and Replay Techniques Thesis Proposal By Napassaporn Likhitsajjakul.
Conceptualization Relational Model Incomplete Relations Indirect Concept Reflection Entity-Relationship Model Incomplete Relations Two Ways of Concept.
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Managing Data Resources File Organization and databases for business information systems.
Fundamental of Database Systems
Data mining in web applications
Information Retrieval in Practice
آشنایی با نرم افزار Microsoft Access
Search Engine Architecture
McGraw-Hill Technology Education
Presentation transcript:

A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Motivation Our focus: –Web mining as the process of discovering useful information in Web data by means of data mining techniques Web mining –Computation-intensive task –Iterative process Prototyping plays an important role –Experimenting with different alternatives –Incorporating the knowledge from previous iterations Mining softwares are developed ad-hoc –Time-consuming tasks –Not scalable –Not reusable

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Main Objective: Design and Development of WIM WIMWIM WIM – Web Information Mining model WIM goal: facilitate fast Web mining prototyping Main research challenges: –Data model –Algebra –Software prototype Architecture and implementation issues

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Web Mining Problems WIM Has Been Applied So Far Study of genealogical trees on the Web (WWW'08) –A study on how the Web textual content evolves A usage pagerank for ranking improvement –A logical graph is created based on usage data Linkage Evolution for New Pages –Hypothesis: duplicates tend to have no evolution of links (inlinks) A user intent study –Identifying queries that cannot be classified as either navigational or informational Creation of a reference dataset for learning to rank

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Outline Related work WIM data model WIM algebra Software architecture Conclusions and future work

Related Work

nd ACM International Conference on Web Search and Data Mining – WSDM'09 First Research Line: Data Mining Tools Business-driven solutions Not specially designed for Web data SQL extensions Examples: –Microsoft SQL Server –Oracle Data Mining –IBM DB2 Intelligent Miner –BI tools: Angoss, Infor CRM Epiphany, Portrait Software, SAS –Weka

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Second Research Line: Query Languages for Web Data Not for mining Web data manipulation –Acquisition, storage, management Examples: –TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW

Data Model

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Data Model – Design Goals Feasibility Simplicity Extensibility Data representativity Uniformity among operators Applicability to other scenarios

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Relation Type Node relations represent nodes of a graph, such as: –Documents of a Web dataset –Terms of a document –Queries of a query log –Sessions of a query log Link relations represent edges of a graph, such as: –Links between Web documents –Word distance among terms of a document –Similarity among queries –Clicks of a query log –Association between queries and sessions Usage data can be represented as both node or link relations

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Node Relation

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Link Relation Main difference: link relations must represent start and end nodes of a graph

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compatibility A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Operation The act of applying an operator to a relation An operator is a function defined by the WIM algebra –Unary or binary

nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Sequence of operations applied to relations –Result of users' interaction through the WIM language The WIM language: –Is built upon the WIM algebra –Is declarative –Is a dataflow programming language Facilitates parallelism Allows graphical implementation

nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study

nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study

WIM Algebra

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Two Classes of Operators Seven data manipulation operators –Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize Eight data mining operators –Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Select Select tuples from the input

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Calculate For mathematical and statistical calculations

nd ACM International Conference on Web Search and Data Mining – WSDM'09 CalcGraph For calculations between nodes of the graph

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Aggregate group tuples with the same value for one or two attributes

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Set For union, intersection and difference of tuples in two relations

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Join Add an external attribute into a given relation

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Search Used for querying (TF-IDF, BM-25, AND, OR)

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compare Compare elements of a textual attribute

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Disconnect Identify clusters in a graph

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Analyze For link analysis (Pagerank, Authority, Indegree)

Software Architecture

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Software Architecture

Conclusions and Future Work

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions WIM – a model and software for fast Web mining prototyping –Data model –Algebra –A software prototype Efficient –Several tens of million of tuples –Running time is higher for the mining operations Ad-hoc solutions also need the mining step Scalable –Future implementation could have the attributes stored in different servers and different parts of programs running distributively

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions Extensible –New operators, and new options/methods for the current operators, can be added We have designed and implemented an extension of operator Analyze –calculate pagerank taking into account the label of the graph Effective for a set of Web mining applications

nd ACM International Conference on Web Search and Data Mining – WSDM'09 Future Work on WIM Finish the implementation and make a version of the prototype available –Users would contribute with extensions –Improve the prototype to become a tool Design new operators for other mining tasks Aggregate a Web crawler and a data visualization interface Implement a graphical interface to the WIM language

Thank you!