Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: – MG4J, – Fastutil, – the DSI Utilities,

Slides:



Advertisements
Similar presentations
Introduction to Java 2 Programming Lecture 3 Writing Java Applications, Java Development Tools.
Advertisements

Information Retrieval in Practice
Lists and the Collection Interface Chapter 4. Chapter Objectives To become familiar with the List interface To understand how to write an array-based.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
C++ Templates. What is a template? Templates are type-generic versions of functions and/or classes Template functions and template classes can be used.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
The Assembly Language Level
Hashing as a Dictionary Implementation
CompSci Searching & Sorting. CompSci Searching & Sorting The Plan  Searching  Sorting  Java Context.
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
BTrees & Bitmap Indexes
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 17: Linked Lists.
03/09/2007CSCI 315 Operating Systems Design1 Memory Management Notice: The slides for this lecture have been largely based on those accompanying the textbook.
Introduction and a Review of Basic Concepts
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
The Design and Analysis of Algorithms
DATA STRUCTURE Subject Code -14B11CI211.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
CSE Lectures 22 – Huffman codes
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
Programming With Java ICS201 University Of Hail1 Chapter 12 UML and Patterns.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.
Introduction. 2COMPSCI Computer Science Fundamentals.
Final Review Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010.
Ranking Ida Mele. Introduction The set of software components for the management of large sets of data is made of: MG4J Fastutil the DSI Utilities Sux4J.
Big Java Chapter 16.
Data Structure & File Systems Hun Myoung Park, Ph.D., Public Management and Policy Analysis Program Graduate School of International Relations International.
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
Data structures Abstract data types Java classes for Data structures and ADTs.
Overview of Course Java Review 1. This Course Covers, using Java Abstract data types Design, what you want them to do (OOD) Techniques, used in implementation.
Working with arrays (we will use an array of double as example)
DATA STRUCTURE & ALGORITHMS (BCS 1223) NURUL HASLINDA NGAH SEMESTER /2014.
LECTURE 34: MAPS & HASH CSC 212 – Data Structures.
Data Structure Introduction.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Hashing as a Dictionary Implementation Chapter 19.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Collections Data structures in Java. OBJECTIVE “ WHEN TO USE WHICH DATA STRUCTURE ” D e b u g.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Java Software Solutions Lewis and Loftus Chapter 6 1 Copyright 1997 by John Lewis and William Loftus. All rights reserved. Objects for Organizing Data.
DATA STRUCTURES (CS212D) Overview & Review Instructor Information 2  Instructor Information:  Dr. Radwa El Shawi  Room: 
Introduction to Objects and Encapsulation Computer Science 4 Mr. Gerb Reference: Objective: Understand Encapsulation and abstract data types.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
3-1 Java's Collection Framework Another use of polymorphism and interfaces Rick Mercer.
Data Structures and Algorithm Analysis Dr. Ken Cosh Linked Lists.
LINKED LISTS.
Introduction toData structures and Algorithms
Information Retrieval in Practice
Chapter 3 Data Representation
Lecture 10 Collections Richard Gesick.
Regarding homework 9 Many low grades
CHP - 9 File Structures.
Data Structure and Algorithms
Efficient implementations of Alignment-based algorithms
13 Text Processing Hongfei Yan June 1, 2016.
MG4J – Managing GigaBytes for Java Introduction
Review CSE116 2/21/2019 B.Ramamurthy.
Introduction to Data Structure
Arrays.
Presentation transcript:

Ranking Ida Mele

Introduction The set of software components for the management of large sets of data is made of: – MG4J, – Fastutil, – the DSI Utilities, – Sux4J, – WebGraph, – the LAW software. These software components have been developed by the DSI of the University of Milan. Ida MeleRanking1

Fastutil Fastutil 6 is a free software, developed in Java. Technical requirement: – Java >= 6 Useful links: – – Ida MeleRanking2

Fastutil Fastutil extends Java Collections, and it provides: – Type-specific maps, sets, and lists; – Priority queues with a small memory footprint and fast access and insertion; – 64-bit arrays, sets, and lists; – Fast I/O classes for text and binary files. Ida MeleRanking3

Fastutil Advantages in using Fastutil: – Classes of Fastutil are implemented in order to work on huge collections of data in an efficient way. – Fastutil provides a new set of classes to deal with collections whose size exceeds Ida MeleRanking4

Fastutil Advantages in using Fastutil: – There are additional features (ex. bidirectional iterators) that are not available in the standard classes. – Classes can be plugged into existing code, because they implement their standard counterpart (ex. Map for Maps). Ida MeleRanking5

Fastutil: Big Arrays BigArrays: class that provides static methods and objects for working with big arrays. Big arrays are arrays-of-arrays. For example, a big array of integers has type int[][]. Methods handle these arrays-of-arrays as if they are monodimensional arrays with 64-bit indices. The length of a big array is bounded by Long.MAX_VALUE rather than Integer.MAX_VALUE. Ida MeleRanking6

Fastutil: Big Arrays Given a big array a, a[0], a[1], … a[n] are called segments. Each one has length: SEGMENT_SIZE (the last segment can have a smaller size). Each index i is associated with a segment and a displacement into the segment. – Methods segment/displacement compute the segment/displacement associated with a given index. – Method index receives the segment and the displacement and returns the corresponding index. – Methods get/set allow to return/set the value of a given element in the big array. Ida MeleRanking7

Fastutil Big Arrays - example We want to scan the big array a. First solution: for( int s = 0; s < a.length; s++ ) { final int[] t = a[ s ]; for( int d = 0; d < t.length; d++ ) { //do something with t[ d ] } Ida MeleRanking8

Fastutil Big Arrays - example Second solution: for( int s = a.length; s-- != 0; ) { final int[] t = a[ s ]; for( int d = t.length; d-- != 0; ) { //do something with t[ d ] } Ida MeleRanking9

Fastutil Big Arrays - example Third solution: for( int s = a.length; s-- != 0; ) { final long[] t = a[ s ]; for( int d = t.length; d-- != 0; ) t[d] = index( s, d ); } We can use the index method, which returns the index associated with a segment and displacement. Ida MeleRanking10

Fastutil: Big data structures Fastutil provides classes also for other data structures: – BigList: a list with indices. The instances of this class implement the same semantics of traditional List. – HashBigSet: the instances of this class use a hash table to represent a big set. The number of elements in the set is limited only by the amount of core memory. Ida MeleRanking11

Dsiutils The DSI utilities are a mish mash of classes. Free software. Developed in Java. Useful links: – – Ida MeleRanking12

Dsiutils: MultipleString In large-scale text indexing we want to use a mutable string that, once frozen, can be used in the same optimized way of an immutable string. In Java we have String and StringBuffer, which can be used for immutable and mutable strings respectively. The solution is MultipleString. MultipleString does not need synchronization. Ida MeleRanking13

Dsiutils: packages Some important packages: – it.unimi.dsi.bits contains main classes for manipulating bits. Example: the class BitVectors provides static methods and objects that do useful things with bit vectors. – it.unimi.dsi.compression provides word-based compression/decompression classes. – it.unimi.dsi.util offers implementations of BloomFilters, PrefixMaps, StringMaps, BinaryTries and others. Ida MeleRanking14

WebGraph WebGraph is a framework for graph compression. It exploits modern compression techniques to manage very large graphs. Useful links: – – Ida MeleRanking15

WebGraph WebGraph provides: – ζ-codes, which are suitable for storing web graphs. – Algorithm for compressing the graph that exploit gap compression as well as ζ-codes. The parameters provide different tradeoffs between access speed and compression ratio. – Algorithms to access to compressed graphs without decompression. The lazy techniques delay the decompression until it is necessary. Ida MeleRanking16

WebGraph: classes Some important classes: – ImmutableGraph is an abstract class representing an immutable graph. – BVGraph allows to store and access web graphs in a compressed form. – ASCIIGraph is used to store the graph in a human- readable ASCII format. Ida MeleRanking17

WebGraph: classes Some important classes: – ArcLabelledImmutableGraph is an abstract implementation of a graph with labeled arcs. – Transform returns the transformed version of an immutable graph. We can use the transpose method of this class if we want to create the transpose graph. Ida MeleRanking18

LAW Java software developed by the Laboratory for Web Algorithms. It is free and contains several implementations of the Pagerank algorithm. Useful links: – – Ida MeleRanking19

LAW: Pagerank PageRank of the package it.unimi.dis.law.rank is an abstract class that defines methods and attributes for Pagerank algorithm. Provided features: – we can set the preference vectors; – we can set the damping factor; – we can program stopping criteria; – step-by-step execution; – reusability. Ida MeleRanking20

Exercise Download the files: – law-1.4.jar and webgraph jar – example – Text2ASCII.class and PrintRanks.class available at: ml ml Add law-1.4.jar and webgraph jar to the directory containing all jar files (ex. lib_mg4j). Update file set-classpath.sh, and set the classpath: source set-classpath.sh Ida MeleRanking21

Build the graph: step1 Create the file in the format ASCIIGraph: java Text2ASCII example Output: – example.graph-txt: the first line contains the number of nodes, ex n. The following n lines contain the list of out- neighbours of the nodes. In particular, the line i-th contains the successors of the node i, sorted in an increasing order and separated by a space. Ida MeleRanking22

more example.graph-txt Ida MeleRanking23 Build the graph: step Num of nodes Lists of successors Node id......

We can use the main method of the BVGraph class to load and compress an ImmutableGraph. The compressed graph is described by: basename.graph: the graph file. It contains the successor lists, one for each node. Each list is a sequence of natural number that are coded as sequence of bits in a efficient way. basename.offsets: the offset file. It stores the offset for each node of the graph. basename.properties: the file with properties and statistics. Ida MeleRanking24 Build the graph: step2

Step 2: Conversion from the ASCIIGraph to the BVGraph: java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph example example Output: example.graph example.offsets example.properties Ida MeleRanking25 Build the graph: step2

more example.properties Ida MeleRanking26 Build the graph: step2 #BVGraph properties #Wed Nov 21 12:48:44 CET 2012 compratio=1,89 bitsforblocks=22 … version=0 … nodes=10 … arcs=34 … #BVGraph properties #Wed Nov 21 12:48:44 CET 2012 compratio=1,89 bitsforblocks=22 … version=0 … nodes=10 … arcs=34 …

To compute the Pagerank we can use the implementations: PowerMethod GaussSeidel Jacobi The output is made of 2 files: basename.ranks: binary file with the results of computation. basename.properties: text files with general info. Ida MeleRanking27 Compute Pagerank

We use the main method of the class PageRankPowerMethod by issuing the following command: java it.unimi.dsi.law.rank.PageRankPowerMethod example examplePR Output: examplePR.ranks examplePR.properties Ida MeleRanking28 Compute Pagerank: step1

more examplePR.properties Ida MeleRanking29 Compute Pagerank: step1 rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = E-7 graph.nodes = 10 graph.fileName = example rank.alpha = 0.85 rank.stronglyPreferential = false method.numberOfIterations = 12 method.norm.type = INFTY method.norm.value = E-7 graph.nodes = 10 graph.fileName = example

The file.ranks is a binary file with the scores of the nodes. We can print these scores by using the class PrintRanks: java PrintRanks examplePR.ranks > ranks Output: ranks. This file has n lines, one for each node. The i- th line contains the score of node number i. Ida MeleRanking30 Compute Pagerank: step2

more ranks Ida MeleRanking31 Compute Pagerank: step PageRank values Node id......

1)Repeat the exercise with the graphs: WikiIT WikiPT available at: html html 2)Create a new graph by using synthetic or real data, and repeat the exercise with this new graph. Ida MeleRanking32 Homework