Download presentation
Presentation is loading. Please wait.
Published byLia Herrin Modified over 9 years ago
1
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results
2
Introduction Peer-to-Peer (P2P) Information Retrieval framework Peers that share information Cumulative bandwidth High processing power and storage Absence of high cost hardware Three generations of P2P networks
3
1 st Generation Centralized DB for coordinated look up Napster 2 nd Generation Flooding to search every node on the network Gneutella 3 rd Generation’ Distributed Hash Tables Tapestry, Chord, Pastry, CAN, Kademlia Uses routing tables to maintain the addresses of its neighbours
4
In 3G P2P networks log N to N nodes have to be contacted to reach destination. Proposed method, the target peer can be contacted directly from the source peer. Search occurs within the target peer to retrieve file reference using keyword indices in a B+ tree
5
System Architecture P2P cluster and Hadoop cluster Hadoop cluster Extract keywords for efficient searching MapReduce programming paradigm P2P cluster Upload files Servicing search requests
6
Map reduce Master (Job Tracker) DFS Master (Name node) Map reduce Slave (Task Tracker) DFS Slave (Data node) Map reduce Slave (Task Tracker) DFS Slave (Data node) HADOOP CLUSTER P2P CLUSTER Keyword extraction SYSTEM ARCHITECTURE
7
Hadoop Software platform to handle vast amounts of data Moving computation to the place of data rather than moving large data blocks to the place of computation HDFS and MapReduce framework HDFS – NameNode and DataNode MapReduce computation Map – splits input data set into fragments and assigns each fragment to a map task. (K,V) Reduce – Merges all intermediate values associated with a key
8
D1,B1 D2,B1 D1,B2D1,B3 D3,B1D2,B2 D3,B2 MMMMMMM K 1,C 1 K 2,C 1 K 3,C 1 K 2,C 2 K 5,C 2 K 3,C 2 K 6,C 3 K 3,C 3 K 4,C 3 K 5,C 4 K 2,C 4 K 4,C 4 K 4,C 5 K 1,C 5 K 6,C 5 K 6,C 6 K 3,C 6 K 1,C 6 K 5,C 7 K 6,C 7 K 4,C 7 Sort and Group (D2) K 1,[C 6 ] K 2,[C 2 ] K 3,[C 2,C 6 ] K 5,[C 2 ] K 6,[C 6 ] Sort and Group (D1) RRR R RR K 1,[C 1 ] K 2,[C 1,C 4 ] K 3,[C 1,C 3 ] K 4,[C 4,C 3 ] K 5,[C 4 ] K 6,[C 3 ] R R R R R K 1,I K 2,I K 3, I K 4, I K 5, I K 6,I K 1, I K 2, I K 3, I K 5, I K 6, I Map Task 1Map Task 2 Map Task 3 Reduce Task 1 Reduce Task 2
9
B+ Tree – IP and its hash Represents sorted data indexed by a key for efficient insertion, retrieval and removal of records. Inserting / Searching a record requires O(log B N) operations in the worst case B - order, N - nodes
10
DLS Components Start up component: Starting up the Hadoop cluster Identifying nodes to participate in the P2P cluster. Determining the IP hash values for the peers Using SHA1 (160-bit 40-bit) Forming the B+ tree. Uploading B+ trees in other peers. Starting the Web Server.
11
DB Distribution Component Keyword extraction using Hadoop cluster Hashing keywords (SHA1 (160-bit 40-bit) Find peer with relatively close match Upload in target peer Update B+ tree (Keyword – file-ref) in target
12
HADOOP CLUSTER Doc 1 Doc 2Doc n File name, list of keywords Hash search keys Target Identification Upload the document in target node PEERS in P2P network
13
Search Component Process keywords Find 40-bit hash value Search the B+ tree in peer to identify target node Search B+ tree in target node to retrieve file reference
14
list of keywords Hash search keys Identify the search node using Relative difference between hash vales of keywords and IP address in B+ tree Search the document in target peer PEER2 in P2P network Search request PEER1 in P2P network
15
Add/Delete Peer Update IP address table Compute IP-hash of newly added peer Reconstruct the B+ tree and update in peers Relocate appropriate files to new peer Modify metadata in peers
16
Experimental Results – Keyword Extraction from multiple files(1MB each) Observation – depends on no of keywords
17
Cluster Set up Time It is a factor of No.of nodes
18
Add a new Peer It is a factor of No. of keywords (for 1 peer)
19
Performance of data distribution Component Load time is a factor of No.of keywords
20
Performance of Search Component Search time remains a constant (9 msec) - B+ tree and search distribution 2 4 6 8 10
21
Conclusion P2P Information Retrieval Framework uses 3G P2P DHT approach B+ trees are maintained in peers Hadoop is used for keyword extraction from multiple files in parallel Efficient search on peers
22
THANK YOU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.