Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain.

Similar presentations


Presentation on theme: "A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain."— Presentation transcript:

1 A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain

2 - 2 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Motivation Our focus: –Web mining as the process of discovering useful information in Web data by means of data mining techniques Web mining –Computation-intensive task –Iterative process Prototyping plays an important role –Experimenting with different alternatives –Incorporating the knowledge from previous iterations Mining softwares are developed ad-hoc –Time-consuming tasks –Not scalable –Not reusable

3 - 3 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Main Objective: Design and Development of WIM WIMWIM WIM – Web Information Mining model WIM goal: facilitate fast Web mining prototyping Main research challenges: –Data model –Algebra –Software prototype Architecture and implementation issues

4 - 4 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Web Mining Problems WIM Has Been Applied So Far Study of genealogical trees on the Web (WWW'08) –A study on how the Web textual content evolves A usage pagerank for ranking improvement –A logical graph is created based on usage data Linkage Evolution for New Pages –Hypothesis: duplicates tend to have no evolution of links (inlinks) A user intent study –Identifying queries that cannot be classified as either navigational or informational Creation of a reference dataset for learning to rank

5 - 5 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Outline Related work WIM data model WIM algebra Software architecture Conclusions and future work

6 Related Work

7 - 7 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 First Research Line: Data Mining Tools Business-driven solutions Not specially designed for Web data SQL extensions Examples: –Microsoft SQL Server –Oracle Data Mining –IBM DB2 Intelligent Miner –BI tools: Angoss, Infor CRM Epiphany, Portrait Software, SAS –Weka

8 - 8 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Second Research Line: Query Languages for Web Data Not for mining Web data manipulation –Acquisition, storage, management Examples: –TSIMMIS, W3QL, WebLog, WebSQL, ARANEUS, StruQL, WebOQL, Whoweda, WEBMINER, WUM, Squeal, WebBase, WEBVIEW

9 Data Model

10 - 10 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Data Model – Design Goals Feasibility Simplicity Extensibility Data representativity Uniformity among operators Applicability to other scenarios

11 - 11 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Relation Type Node relations represent nodes of a graph, such as: –Documents of a Web dataset –Terms of a document –Queries of a query log –Sessions of a query log Link relations represent edges of a graph, such as: –Links between Web documents –Word distance among terms of a document –Similarity among queries –Clicks of a query log –Association between queries and sessions Usage data can be represented as both node or link relations

12 - 12 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Node Relation

13 - 13 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Link Relation Main difference: link relations must represent start and end nodes of a graph

14 - 14 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compatibility A link relation is compatible to a node relation if the nodes of the graph (link relation) are foreign keys in the node relation

15 - 15 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Operation The act of applying an operator to a relation An operator is a function defined by the WIM algebra –Unary or binary

16 - 16 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Sequence of operations applied to relations –Result of users' interaction through the WIM language The WIM language: –Is built upon the WIM algebra –Is declarative –Is a dataflow programming language Facilitates parallelism Allows graphical implementation

17 - 17 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study

18 - 18 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 WIM Program Example – Genealogical Tree Study

19

20

21

22

23

24

25

26

27

28 WIM Algebra

29 - 29 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Two Classes of Operators Seven data manipulation operators –Select, Calculate, CalcGraph, Aggregate, Set, Join, Materialize Eight data mining operators –Search, Compare, CompGraph, Cluster, Disconnect, Associate, Analyze, Relink

30 - 30 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Select Select tuples from the input

31 - 31 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Calculate For mathematical and statistical calculations

32 - 32 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 CalcGraph For calculations between nodes of the graph

33 - 33 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Aggregate group tuples with the same value for one or two attributes

34 - 34 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Set For union, intersection and difference of tuples in two relations

35 - 35 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Join Add an external attribute into a given relation

36 - 36 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Search Used for querying (TF-IDF, BM-25, AND, OR)

37 - 37 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Compare Compare elements of a textual attribute

38 - 38 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Disconnect Identify clusters in a graph

39 - 39 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Analyze For link analysis (Pagerank, Authority, Indegree)

40 Software Architecture

41 - 41 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Software Architecture

42 Conclusions and Future Work

43 - 43 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions WIM – a model and software for fast Web mining prototyping –Data model –Algebra –A software prototype Efficient –Several tens of million of tuples –Running time is higher for the mining operations Ad-hoc solutions also need the mining step Scalable –Future implementation could have the attributes stored in different servers and different parts of programs running distributively

44 - 44 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Conclusions Extensible –New operators, and new options/methods for the current operators, can be added We have designed and implemented an extension of operator Analyze –calculate pagerank taking into account the label of the graph Effective for a set of Web mining applications

45 - 45 - 2 nd ACM International Conference on Web Search and Data Mining – WSDM'09 Future Work on WIM Finish the implementation and make a version of the prototype available –Users would contribute with extensions –Improve the prototype to become a tool Design new operators for other mining tasks Aggregate a Web crawler and a data visualization interface Implement a graphical interface to the WIM language

46 Thank you! alvaro@dcc.ufmg.br


Download ppt "A Model for Fast Web Mining Prototyping Nivio Ziviani UFMG – Brazil Álvaro Pereir a Ricardo Baeza-Yates Jesus Bisbal UPF – Spain."

Similar presentations


Ads by Google