Download presentation
Presentation is loading. Please wait.
Published byBriana Jones Modified over 9 years ago
1
A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain
2
- 2 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Motivation Our focus: –Web mining as the process of discovering useful information in Web data by means of data mining techniques Web mining –Computation-intensive task –Iterative process Prototyping plays an important role –Experimenting with different alternatives –Incorporating the knowledge from previous iterations Mining softwares are developed ad-hoc –Time-consuming tasks –Not scalable –Not reusable
3
- 3 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Main Objective: Design and Development of WIM WIMWIM WIM – Web Information Mining model WIM goal: facilitate fast Web mining prototyping Main research challenges: –Data model –Algebra –Software prototype Architecture and implementation issues
4
- 4 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Web Mining Problems WIM Has Been Applied So Far Study of genealogical trees on the Web (WWW'08) –A study on how the Web textual content evolves A usage pagerank for ranking improvement –A logical graph is created based on usage data Linkage Evolution for New Pages –Hypothesis: duplicates tend to have no evolution of links (inlinks) A user intent study –Identifying queries that cannot be classified as either navigational or informational Creation of a reference dataset for learning to rank
5
- 5 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Outline Related work WIM data model WIM algebra Software architecture Conclusions and future work
6
Related Work
7
- 7 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 First Research Line: Data Mining Tools Business-driven solutions Not specially designed for Web data SQL extensions Examples: –Microsoft SQL Server –Oracle Data Mining –IBM DB2 Intelligent Miner –BI tools: Angoss, Infor CRM Epiphany, Portrait Software, SAS –Weka
8
- 8 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Second Research Line: Query Languages for Web Data Not for mining Web data manipulation –Acquisition, storage, management Examples: –TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW
9
Data Model
10
- 10 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Data Model – Design Goals Feasibility Simplicity Extensibility Data representativity Uniformity among operators Applicability to other scenarios
11
- 11 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Relation Type Node relations represent nodes of a graph, such as: –Documents of a Web dataset –Terms of a document –Queries of a query log –Sessions of a query log Link relations represent edges of a graph, such as: –Links between Web documents –Word distance among terms of a document –Similarity among queries –Clicks of a query log –Association between queries and sessions Usage data can be represented as both node or link relations
12
- 12 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Node Relation
13
- 13 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Link Relation Main difference: link relations must represent start and end nodes of a graph
14
- 14 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compatibility A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation
15
- 15 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Operation The act of applying an operator to a relation An operator is a function defined by the WIM algebra –Unary or binary
16
- 16 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Sequence of operations applied to relations –Result of users' interaction through the WIM language The WIM language: –Is built upon the WIM algebra –Is declarative –Is a dataflow programming language Facilitates parallelism Allows graphical implementation
17
- 17 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study
18
- 18 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study
28
WIM Algebra
29
- 29 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Two Classes of Operators Seven data manipulation operators –Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize Eight data mining operators –Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink
30
- 30 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Select Select tuples from the input
31
- 31 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Calculate For mathematical and statistical calculations
32
- 32 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 CalcGraph For calculations between nodes of the graph
33
- 33 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Aggregate group tuples with the same value for one or two attributes
34
- 34 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Set For union, intersection and difference of tuples in two relations
35
- 35 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Join Add an external attribute into a given relation
36
- 36 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Search Used for querying (TF-IDF, BM-25, AND, OR)
37
- 37 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compare Compare elements of a textual attribute
38
- 38 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Disconnect Identify clusters in a graph
39
- 39 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Analyze For link analysis (Pagerank, Authority, Indegree)
40
Software Architecture
41
- 41 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Software Architecture
42
Conclusions and Future Work
43
- 43 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions WIM – a model and software for fast Web mining prototyping –Data model –Algebra –A software prototype Efficient –Several tens of million of tuples –Running time is higher for the mining operations Ad-hoc solutions also need the mining step Scalable –Future implementation could have the attributes stored in different servers and different parts of programs running distributively
44
- 44 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions Extensible –New operators, and new options/methods for the current operators, can be added We have designed and implemented an extension of operator Analyze –calculate pagerank taking into account the label of the graph Effective for a set of Web mining applications
45
- 45 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Future Work on WIM Finish the implementation and make a version of the prototype available –Users would contribute with extensions –Improve the prototype to become a tool Design new operators for other mining tasks Aggregate a Web crawler and a data visualization interface Implement a graphical interface to the WIM language
46
Thank you! alvaro@dcc.ufmg.br
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.