Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J.

Similar presentations


Presentation on theme: "Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J."— Presentation transcript:

1 Ranking Ida Mele

2 Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J WebGraph the LAW software These software components have been developed by the DSI of the University of Milan Ida MeleRanking1

3 Fastutil Fastutil 6 is a free software, developed in Java Technical requirement: Java >= 6 Useful links: http://fastutil.di.unimi.it/ http://fastutil.di.unimi.it/docs/ Ida MeleRanking2

4 Fastutil Fastutil extends Java Collections, and it provides: Type-specific maps, sets, and lists Priority queues with a small memory footprint and fast access and insertion 64-bit arrays, sets, and lists Fast I/O classes for text and binary files Ida MeleRanking3

5 Fastutil: Advantages Advantages in using Fastutil: Classes of Fastutil are implemented in order to work on huge collections of data in an efficient way Fastutil provides a new set of classes to deal with collections whose size exceeds 2 31 Ida MeleRanking4

6 Fastutil: Advantages Advantages in using Fastutil: There are additional features (e.g., bidirectional iterators) that are not available in the standard classes Classes can be plugged into existing code, because they implement their standard counterpart (e.g., Map is used for maps) Ida MeleRanking5

7 Fastutil: Big Arrays BigArrays: This class provides static methods and objects for working with big arrays Big arrays are arrays-of-arrays. For example, a big array of integers has type int[][] Methods handle these arrays-of-arrays as if they are monodimensional arrays with 64-bit indices The length of a big array is bounded by Long.MAX_VALUE rather than Integer.MAX_VALUE Ida MeleRanking6

8 Fastutil: Big Arrays Given a big array a, a[0], a[1], … a[n] are called segments. Each one has length SEGMENT_SIZE (the last segment can have a smaller size) Each index i is associated with a segment and a displacement into the segment Methods segment/displacement compute the segment/displacement associated with a given index Method index receives the segment and the displacement and returns the corresponding index Methods get/set allow to return/set the value of a given element in the big array Ida MeleRanking7

9 Fastutil Big Arrays – example We want to scan the big array a First solution: for( int s = 0; s < a.length; s++ ) { final int[] t = a[ s ]; for( int d = 0; d < t.length; d++ ) { //do something with t[ d ] } Ida MeleRanking8

10 Fastutil Big Arrays – example Second solution: for( int s = a.length; s-- != 0; ) { final int[] t = a[ s ]; for( int d = t.length; d-- != 0; ) { //do something with t[ d ] } Ida MeleRanking9

11 Fastutil Big Arrays – example Third solution: for( int s = a.length; s-- != 0; ) { final long[] t = a[ s ]; for( int d = t.length; d-- != 0; ) t[d] = index( s, d ); } We can use the index method, which returns the index associated with a segment and displacement Ida MeleRanking10

12 Fastutil: Big Data Structures Fastutil provides classes also for other data structures: BigList: a list with indices. The instances of this class implement the same semantics of traditional List HashBigSet: the instances of this class use a hash table to represent a big set. The number of elements in the set is limited only by the amount of core memory Ida MeleRanking11

13 Dsiutils The DSI utilities are a mishmash of classes Free software Developed in Java Useful links: http://dsiutils.di.unimi.it/ http://dsiutils.di.unimi.it/docs/ Ida MeleRanking12

14 Dsiutils: MultipleString In large-scale text indexing we want to use a mutable string that, once frozen, can be used in the same optimized way of an immutable string In Java we have String and StringBuffer, which can be used for immutable and mutable strings respectively The solution is MultipleString MultipleString does not need synchronization Ida MeleRanking13

15 Dsiutils: Packages Some important packages: it.unimi.dsi.bits contains main classes for manipulating bits. For example, the class BitVectors provides static methods and objects that do useful things with bit vectors it.unimi.dsi.compression provides word-based compression/decompression classes it.unimi.dsi.util offers implementations of BloomFilters, PrefixMaps, StringMaps, BinaryTries and others Ida MeleRanking14

16 WebGraph WebGraph is a framework for graph compression It exploits modern compression techniques to manage very large graphs Useful links: http://webgraph.di.unimi.it/ http://webgraph.di.unimi.it/docs/ Ida MeleRanking15

17 WebGraph WebGraph provides: ζ-codes, which are suitable for storing web graphs Algorithm for compressing the graph that exploit gap compression as well as ζ-codes. The parameters provide different tradeoffs between access speed and compression ratio Algorithms to access to compressed graphs without decompression. The lazy techniques delay the decompression until it is necessary Ida MeleRanking16

18 WebGraph: Classes Some important classes: ImmutableGraph is an abstract class representing an immutable graph BVGraph allows to store and access web graphs in a compressed form ASCIIGraph is used to store the graph in a human- readable ASCII format Ida MeleRanking17

19 WebGraph: Classes Some important classes: ArcLabelledImmutableGraph is an abstract implementation of a graph with labeled arcs Transform returns the transformed version of an immutable graph. We can use the transpose method of this class if we want to create the transpose graph Ida MeleRanking18

20 LAW Java software developed by the Laboratory for Web Algorithms It is free and contains several implementations of the Pagerank algorithm Useful links: http://law.di.unimi.it/software.php http://law.di.unimi.it/software/docs/index.html Ida MeleRanking19

21 LAW: PageRank PageRank of the package it.unimi.dsi.law.rank is an abstract class that defines methods and attributes for the PageRank algorithm Provided features: we can set the preference vectors we can set the damping factor we can program stopping criteria step-by-step execution reusability Ida MeleRanking20

22 Exercise Download the archive with libraries: lib.zip Download the files: set-classpath.sh example Text2ASCII.class and PrintRanks.class available at: http://www.dis.uniroma1.it/~mele/WebIR.html Set the classpath using the command: source set-classpath.sh Ida MeleRanking21

23 Build the Graph: Step1 Step 1 - Create the file in the format ASCIIGraph using the command: java Text2ASCII example Output: example.graph-txt: the first line contains the number of nodes (e.g., n). The following n lines contain the list of out-neighbours of the nodes. In particular, the line i-th contains the successors of the node i, sorted in an increasing order and separated by a space Ida MeleRanking22

24 output: example.graph-txt Ida MeleRanking23 Build the Graph: Step1 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 0 1 0 8 0 9 1 4 1 7 1 9 2 1 2 3 2 4 2 5 … 0 1 0 8 0 9 1 4 1 7 1 9 2 1 2 3 2 4 2 5 … input: example Text2ASCII

25 more example.graph-txt Ida MeleRanking24 Build the Graph: Step1 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 10 1 8 9 4 7 9 1 3 4 5 6 7 8 9 1 4 5 6 9 1 2 1 1 2 3 4 5 5 9 0 1 3 4 6 Num of nodes Lists of successors 01234567890123456789 01234567890123456789 Node id......

26 Build the Graph: Step2 We can use the main method of the BVGraph class to load and compress an ImmutableGraph The compressed graph is described by: basename.graph: the graph file. It contains the successor lists, one for each node. Each list is a sequence of natural number that are coded as sequence of bits in a efficient way basename.offsets: the offset file. It stores the offset for each node of the graph basename.properties: the file with properties and statistics Ida MeleRanking25

27 Build the Graph: Step2 Step 2 - Conversion from the ASCIIGraph to the BVGraph: java it.unimi.dsi.big.webgraph.BVGraph -g ASCIIGraph example exampleBV Output: exampleBV.graph exampleBV.offsets exampleBV.properties Ida MeleRanking26

28 more exampleBV.properties Ida MeleRanking27 … compratio=1,89 bitsforblocks=22 residualarcs=15 version=0 … nodes=10 compressionflags= intervalisedarcs=10 bitspernode=16,8 arcs=34 … compratio=1,89 bitsforblocks=22 residualarcs=15 version=0 … nodes=10 compressionflags= intervalisedarcs=10 bitspernode=16,8 arcs=34 … Build the Graph: Step2

29 Compute PageRank To compute the PageRank we can use the following implementations: PowerMethod PageRankPowerSeries GaussSeidel Jacobi The output is made of 2 files: basename.ranks: binary file with the results of computation basename.properties: text files with general info Ida MeleRanking28

30 Compute PageRank: Step 1 Step 1 - We use the main method of the class PageRankPowerMethod by issuing the following command: java it.unimi.dsi.law.rank.PageRankPowerMethod exampleBV examplePR Output: examplePR.ranks examplePR.properties Ida MeleRanking29

31 more examplePR.properties Ida MeleRanking30 rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = 8.396275630317973E-7 graph.nodes = 10 graph.fileName = example rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = 8.396275630317973E-7 graph.nodes = 10 graph.fileName = example Compute PageRank: Step 1

32 Compute PageRank: Step 2 Step 2 – Print the scores The file.ranks is a binary file with the scores of the nodes, so we can print PageRank scores by using the class PrintRanks: java PrintRanks examplePR.ranks > ranks Output: ranks: file with n lines, one for each node. The i-th line contains the score of node number i Ida MeleRanking31

33 Compute PageRank: Step 2 more ranks Ida MeleRanking32 0.0515659940361598 0.20197850631669495 0.07982657817906964 0.07587785830476211 0.14600457683651308 0.08608501191896127 0.07294688611466064 0.0931194920828582 0.05050241152172527 0.14209268468859523 0.0515659940361598 0.20197850631669495 0.07982657817906964 0.07587785830476211 0.14600457683651308 0.08608501191896127 0.07294688611466064 0.0931194920828582 0.05050241152172527 0.14209268468859523 PageRank values 01234567890123456789 01234567890123456789 Node id......

34 Compute PageRank: sorting sort –r ranks Ida MeleRanking33 0.20197850631669495 0.14600457683651308 0.14209268468859523 0.0931194920828582 0.08608501191896127 0.07982657817906964 0.07587785830476211 0.07294688611466064 0.0515659940361598 0.05050241152172527 0.20197850631669495 0.14600457683651308 0.14209268468859523 0.0931194920828582 0.08608501191896127 0.07982657817906964 0.07587785830476211 0.07294688611466064 0.0515659940361598 0.05050241152172527 PageRank values sorted in decreasing order

35 Homework 1.Repeat the exercise with the graphs: WikiIT WikiPT available at: http://www.dis.uniroma1.it/~mele/WebIR.html 2.Create a new graph by using synthetic or real data, and repeat the exercise Ida MeleRanking34


Download ppt "Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J."

Similar presentations


Ads by Google