Presentation is loading. Please wait.

Presentation is loading. Please wait.

UbiCrawler : a scalable fully distributed Web crawler P. Boldi, B. Codenotti, M. Santini, and S. Vigna, SPE Vol.34 No.2 pages 213-237, Feb. 2004 Kyoung.

Similar presentations


Presentation on theme: "UbiCrawler : a scalable fully distributed Web crawler P. Boldi, B. Codenotti, M. Santini, and S. Vigna, SPE Vol.34 No.2 pages 213-237, Feb. 2004 Kyoung."— Presentation transcript:

1 UbiCrawler : a scalable fully distributed Web crawler P. Boldi, B. Codenotti, M. Santini, and S. Vigna, SPE Vol.34 No.2 pages 213-237, Feb. 2004 Kyoung Hoon Kwak (May 02, 2006)

2 Kyoung Hoon Kwak2 Contents  The Assignment Function Background Consistent hashing Identifier-seeded consistent hashing  Implementation Issues Space/time-efficient type-specific collections Robust, fast, error-tolerant HTML parsing String and StringBuffer  Appendix

3 Kyoung Hoon Kwak3 4. The Assignment Function (1/3)  Set up a function δ that delegates the responsibility of fetching host to the agent L ⊆ A, δ L (h) ∈ L Balancing – each agent should get the same number of hosts for each a ∈ L A : set of agent identifiers L : set of alive agents h : host δ : delegation function m : the total number of hosts |L| : the number of alive agent a : agent EX> m = 100, |L| = 10 m / |L| = 10

4 Kyoung Hoon Kwak4 4. The Assignment Function (2/3)  Modulo-based hash function Good balancing properties When an agent crashes, the content of sets would change in a completely chaotic way EX> L = {a, b, c} L’= {a, b} host = {1, 2, 3, 4, 5, 6, 7} L : a = {1, 4, 7} b = {2, 5} c = {3, 6} L’ : a = {1, 3, 5, 7} b = {2, 4, 6}

5 Kyoung Hoon Kwak5 4. The Assignment Function (3/3) Contravariance – if the number of agents grows, the portion of the Web crawled by each agent must shrink EX> L = {a, b} L’= {a, b, c} host = {1, 2, 3, 4, 5, 6, 7} L : a = {1, 2, 3, 4} b = {5, 6, 7} L’ : a = {1, 2, 3} b = {5, 7} c = {4, 6}

6 Kyoung Hoon Kwak6 Background  A simple technique to obtain a balanced, contravariant assignment function consists of trying to generate permutations Using some bits extracted from a host name to seed a random generator Permute randomly the set of possible agents  This solution has the big disadvantage Running in time and space proportional to the set of possible agent (which one wants to keep as large as possible)

7 Kyoung Hoon Kwak7 Consistent hashing (1/2)  Using consistent hashing for consistency and contravariance  Each bucket is replicated k times and each replica is mapped randomly on the unit circle  The way to hash a key - compute a point on the unit circle and find the nearest replica  All agents should compute the same set of replicas corresponding to a given agent * bucket denotes agents * key denotes hosts

8 Kyoung Hoon Kwak8 Consistent hashing (2/2)  Example L = {a,b}, L‘ = {a,b,c}, k = 3, hosts = {0,1,..,9} a a a b b b c c c 1 2 3 4 5 6 7 8 9 0 2 a a a b b b 1 3 4 5 6 7 8 9 0  L‘ -1 (a) = {4,5,6,8}  L‘ -1 (b) = {0,2,7}  L‘ -1 (c) = {1,3,9}  L -1 (a) = {1,4,5,6,8,9}  L -1 (b) = {0,2,3,7}

9 Kyoung Hoon Kwak9 Identifier-seeded consistent hashing (1/3)  Derive the set of replicas from a good random number generator (Mersenne Twister) seeded with the agent identifier Fix the set of replicas associated to an agent Try to maintain the good randomness properties of consistent hashing  Birthday-paradox Birthday-paradox Even with a very large number of points, the probability that two replicas overlap will become non-negligible When a new agent is started, its identifier is used to generate the replica for the agent If the replica is already assigned to some other agent, the new agent must choose another identifier

10 Kyoung Hoon Kwak10 Identifier-seeded consistent hashing (2/3)  Unit circle is mapped on the whole set of representable integers  All replicas are stored in a balanced tree  Hashing hosts in logarithmic time of alive agents  Leaves are kept in a doubly linked chain to search the next nearest replica very fast  Number of replicas depends on the capacity of hardware bbbaaaccc

11 Kyoung Hoon Kwak11 Identifier-seeded consistent hashing (3/3)  k = 100, the deviation from perfect balancing is less than 6%  k = 200, the deviation decreases to 4.5%

12 Kyoung Hoon Kwak12 5. Implementation Issues  Authors decided to develop UbiCrawler as a pure 100% java application Platform independence Fully distributed P2P-like application Java has a certain system overhead, but the speed of UbiCrawler is limited by network bandwidth, and not by CPU power Java made it possible to adopt Remote Method Invocation (RMI) Enables one to create distributed application in which the methods of remote java objects can be invoked from other java virtual machines

13 Kyoung Hoon Kwak13 Space/time-efficient type-specific collections  The Collection and Map hierarchies in the java.util package A basic tool for code development Due to the awkward management of primitive types those hierarchies are not suitable for handling primitive types  Develop a package named fastUtil 537 classes that offer type-specific mappings All classes implement the standard Java interfaces They offer polymorphic methods for easier access and reduced object creation

14 Kyoung Hoon Kwak14 Robust, fast, error-tolerant HTML parsing  Every crawling thread needs to parse page before storing Extract hyperlinks for the crawling to proceed Obtain other relevant information  The current version of UbiCrawler uses a highly optimized HTML/XHTML parser Be able to work around most common errors On a standard PC, performance is about 600 page/s (including URL parsing and word occurrence extraction)

15 Kyoung Hoon Kwak15 String and StringBuffer  The Java string classes are a well-known cause of inefficiency StringBuffer implies a huge performance hit in a multithreaded application  Synchronization  Equality defined by reference (i.e. two buffers with the same content are not equal) The same problems have been reported by the authors of Mercator  The authors rewrote a string class lying halfway between String and StringBuffer

16 Kyoung Hoon Kwak16 Appendix  Birthday-paradox Birthday-paradox 같은 생일을 가진 사람이 존재할 확률 Ex> 50 명의 사람 중 생일이 같은 사람이 존재하지 않을 확률 {1 - (1/365)} X {1 - (2/365)} X {1 - (3/365} X … X {1 - (49/365)} ≒ 2.9% ( ☞ 생일이 같은 사람이 존재할 확률 : 약 97%)


Download ppt "UbiCrawler : a scalable fully distributed Web crawler P. Boldi, B. Codenotti, M. Santini, and S. Vigna, SPE Vol.34 No.2 pages 213-237, Feb. 2004 Kyoung."

Similar presentations


Ads by Google