A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Motivation Our focus: –Web mining as the process of discovering useful information in Web data by means of data mining techniques Web mining –Computation-intensive task –Iterative process Prototyping plays an important role –Experimenting with different alternatives –Incorporating the knowledge from previous iterations Mining softwares are developed ad-hoc –Time-consuming tasks –Not scalable –Not reusable
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Main Objective: Design and Development of WIM WIMWIM WIM – Web Information Mining model WIM goal: facilitate fast Web mining prototyping Main research challenges: –Data model –Algebra –Software prototype Architecture and implementation issues
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Web Mining Problems WIM Has Been Applied So Far Study of genealogical trees on the Web (WWW'08) –A study on how the Web textual content evolves A usage pagerank for ranking improvement –A logical graph is created based on usage data Linkage Evolution for New Pages –Hypothesis: duplicates tend to have no evolution of links (inlinks) A user intent study –Identifying queries that cannot be classified as either navigational or informational Creation of a reference dataset for learning to rank
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Outline Related work WIM data model WIM algebra Software architecture Conclusions and future work
Related Work
nd ACM International Conference on Web Search and Data Mining – WSDM'09 First Research Line: Data Mining Tools Business-driven solutions Not specially designed for Web data SQL extensions Examples: –Microsoft SQL Server –Oracle Data Mining –IBM DB2 Intelligent Miner –BI tools: Angoss, Infor CRM Epiphany, Portrait Software, SAS –Weka
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Second Research Line: Query Languages for Web Data Not for mining Web data manipulation –Acquisition, storage, management Examples: –TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW
Data Model
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Data Model – Design Goals Feasibility Simplicity Extensibility Data representativity Uniformity among operators Applicability to other scenarios
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Relation Type Node relations represent nodes of a graph, such as: –Documents of a Web dataset –Terms of a document –Queries of a query log –Sessions of a query log Link relations represent edges of a graph, such as: –Links between Web documents –Word distance among terms of a document –Similarity among queries –Clicks of a query log –Association between queries and sessions Usage data can be represented as both node or link relations
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Node Relation
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Link Relation Main difference: link relations must represent start and end nodes of a graph
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compatibility A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Operation The act of applying an operator to a relation An operator is a function defined by the WIM algebra –Unary or binary
nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Sequence of operations applied to relations –Result of users' interaction through the WIM language The WIM language: –Is built upon the WIM algebra –Is declarative –Is a dataflow programming language Facilitates parallelism Allows graphical implementation
nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study
nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study
WIM Algebra
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Two Classes of Operators Seven data manipulation operators –Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize Eight data mining operators –Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Select Select tuples from the input
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Calculate For mathematical and statistical calculations
nd ACM International Conference on Web Search and Data Mining – WSDM'09 CalcGraph For calculations between nodes of the graph
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Aggregate group tuples with the same value for one or two attributes
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Set For union, intersection and difference of tuples in two relations
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Join Add an external attribute into a given relation
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Search Used for querying (TF-IDF, BM-25, AND, OR)
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compare Compare elements of a textual attribute
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Disconnect Identify clusters in a graph
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Analyze For link analysis (Pagerank, Authority, Indegree)
Software Architecture
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Software Architecture
Conclusions and Future Work
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions WIM – a model and software for fast Web mining prototyping –Data model –Algebra –A software prototype Efficient –Several tens of million of tuples –Running time is higher for the mining operations Ad-hoc solutions also need the mining step Scalable –Future implementation could have the attributes stored in different servers and different parts of programs running distributively
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions Extensible –New operators, and new options/methods for the current operators, can be added We have designed and implemented an extension of operator Analyze –calculate pagerank taking into account the label of the graph Effective for a set of Web mining applications
nd ACM International Conference on Web Search and Data Mining – WSDM'09 Future Work on WIM Finish the implementation and make a version of the prototype available –Users would contribute with extensions –Improve the prototype to become a tool Design new operators for other mining tasks Aggregate a Web crawler and a data visualization interface Implement a graphical interface to the WIM language
Thank you!