Trustworthy Information Distribution and Retrieval Michael Melliar-Smith University of California, Santa Barbara Research conducted in collaboration with Louise E. Moser, Isai Michel Lombera and Yung-Ting Chuang Supported in part by NSF Grant CNS
METIS'2011 iTrust Michael Melliar-Smith 2 Information Access over the Internet Modern society and commerce depend on access to information over the Internet Information is accessed over the Internet using centralized search engines and search indexes –Internet search engines are centralized for efficiency and scalability We cannot assume that centralized search engines will always deliver the information we seek, uncensored and unbiased iTrust is a system for publishing, searching for, and retrieving information over the Internet that provides trustworthy access to information
METIS'2011 iTrust Michael Melliar-Smith 3 What Is Trust? "Trust is the mental background to delegation" (Castelfranchi and Falcone [1]) When you delegate an activity, you trust that: –Good things will happen –Bad things will not happen However, what is good and what is bad is a matter of your intent Thus, it is not possible to provide a universal definition of trust Several useful surveys [2], [3] on trust, related to reliability, security, and privacy
METIS'2011 iTrust Michael Melliar-Smith 4 Trust in Algorithms Trust in a strong encryption algorithm is based on estimates of the cost of breaking the keys –Publication of the encryption algorithm –Independent validation of the strength of the algorithm –No need to trust the people who created the algorithm We aim to provide the same kind of trust for iTrust –Published algorithms –Statistical analysis –No need to trust an administrator Trust a social network with a large number of users
METIS'2011 iTrust Michael Melliar-Smith 5 Related Work Useful surveys [4], [5], [6] on publish/subscribe and distributed search, categorized as: –Structured – Require nodes to be organized in an overlay network, e.g., Distributed Hash Tables (DHTs), rings, trees More efficient than unstructured Involve administrative control and additional overhead for constructing and maintaining the overlay network Still incur a trust risk through administrative control –Unstructured – Gossip-based; typically use randomization Gnutella [7] - Great grandfather of unstructured systems; uses flooding of requests to find information Freenet [8] - More sophisticated and efficient than Gnutella, because it learns from previous requests Zhong and Shen [9] - Uses random walks; number of nodes visited by a request is proportional to the square root of the content popularity Ferreira, et al. [10] - Uses random walks; replicates both the queries and the data in a sparse network
METIS'2011 iTrust Michael Melliar-Smith 6 Related Work Several recent information distribution and retrieval systems are concerned with security, privacy, and trust –Quasar [11] – A probabilistic publish/subscribe system, using a sparse structured overlay that is concerned with the release of sensitive information by a central node –OneSwarm [12] – A peer-to-peer data sharing system that uses a combination of trusted and untrusted peers Part of an effort to provide an alternative to cloud computing that does not depend on centralized trust Initial goal is to protect the privacy of the users Uses trusted intermediary nodes to preserve anonymity
METIS'2011 iTrust Michael Melliar-Smith 7 Objectives of iTrust Main Objective: Provide users with information that is important to them Publish information for other people to access Search for, and retrieve, published information Detect that the system is under attack When the system is under attack, adapt to increase the probability of information distribution and retrieval, with some increase in costs
METIS'2011 iTrust Michael Melliar-Smith 8 Non-Objectives of iTrust No attempt to prevent the distribution of misinformation No attempt to maintain secrecy and privacy No attempt to minimize communication, processing and storage costs –Costs are greater than for conventional, centralized Internet search engines –We assume that the additional costs are acceptable, given the primary objective –Information distribution and retrieval are probabilistic
METIS'2011 iTrust Michael Melliar-Smith 9 Characteristics of iTrust Falls into the general category of random walk publish/subscribe systems –All nodes are equal –No central search engine –Replication of metadata and requests –Random search or random walk –No supernodes –Distributed membership –No central control over membership
METIS'2011 iTrust Michael Melliar-Smith 10 Basic Idea of iTrust The participating nodes constitute the membership of iTrust A source node is a participating node that –Produces information, which it makes available to other participating nodes –Produces metadata (keywords) describing the information, and distributes the metadata to a subset of participating nodes chosen at random A requesting node is a participating node that –Generates requests containing metadata and distributes the requests to a subset of participating nodes chosen at random A participating node that receives a request –Compares the metadata in the request with the metadata it holds –If it finds a match, it returns the URL of the associated information to the requesting node The requesting node then uses the URL to retrieve the information from the source node
METIS'2011 iTrust Michael Melliar-Smith 11 Distribution of Metadata Source of Information
METIS'2011 iTrust Michael Melliar-Smith 12 Distribution of a Request Source of Information Requester of Information Request Encounters Metadata
METIS'2011 iTrust Michael Melliar-Smith 13 Retrieval of Information Source of Information Requester of Information Request Matched
METIS'2011 iTrust Michael Melliar-Smith 14 Probabilistic Analysis Membership contains n participating nodes Metadata are distributed to m nodes Requests are distributed to r nodes Of the participating nodes, a proportion x are operational
METIS'2011 iTrust Michael Melliar-Smith 15 Probabilistic Analysis A request is distributed to r nodes (r trials) Probability of no match on 1st trial: n-m Probability of no match on 2nd trial: n-m-1 Probability of no match on rth trial: n-m-r+1 Probability q of no match on r trials: n-m n-m-1 … n-m-r+1 = (n-m)! (n-r)! n n-1 n-r+1 n n-1 n-r+1 n! (n-m-r)!
METIS'2011 iTrust Michael Melliar-Smith 16 Probabilistic Analysis Probability p of a match on r trials: 1 - n-m n-m-1 … n-m-r+1 = 1 - (n-m)! (n-r)! m=r = n p > 1 – e -1 m=r = 2n p > 1 – e -2 m=r =2 n p > 1 – e -4 If m+r > n, then p = 1 n n-1 n-r+1 n! (n-m-r)! where n ≥ m+r
METIS'2011 iTrust Michael Melliar-Smith 17 Probabilistic Analysis Now assume that only a proportion x of the participating nodes are operational Probability that 1st node has the metadata: m Probability that 1st node has the metadata and is operational: mx Probability of no match on 1st trial: 1- mx = n-mx Probability of no match on 2nd trial: n-mx-1 Probability of no match on rth trial: n-mx-r+1 n n n n n-r+1 n-1
METIS'2011 iTrust Michael Melliar-Smith 18 Probability q of no match on r trials: n-mx n-mx-1 … n-mx-r+1 = (n-mx)! (n-r)! Probability p of a match on r trials: 1 - n-mx n-mx-1 … n-mx-r+1 = 1 - (n-mx)! (n-r)! If mx+r > n, then p = 1 Probabilistic Analysis n n-1 n-r+1 n! (n-mx-r)! n n-1 n-r+1 n! (n-mx-r)! where n ≥ mx+r
METIS'2011 iTrust Michael Melliar-Smith 19 Probability of a Match
METIS'2011 iTrust Michael Melliar-Smith 20 Probability of a Match
METIS'2011 iTrust Michael Melliar-Smith 21 Time-to-Live Metadata can be provided with a time-to-live –Receiver of the metadata deletes the metadata when the time-to-live expires Similarly, a request can be provided with a time-to-live –Receiver of the request stores the request until the time-to-live and then deletes the request –Receiver attempts to match newly arrived metadata with the metadata in the request until the time-to-live
METIS'2011 iTrust Michael Melliar-Smith 22 Small Information Many information items are small Distribute the information itself, rather than the metadata about the information
METIS'2011 iTrust Michael Melliar-Smith 23 Different Classes of Nodes Some nodes are less capable, or are only intermittently connected to the network Distribute the metadata and the requests only to the more capable nodes Less capable nodes might have more powerful proxy nodes or home agents
METIS'2011 iTrust Michael Melliar-Smith 24 Forwarding Metadata and Requests To exploit the parallelism of the Internet, the originator of the metadata or request does not necessarily send the metadata or request to all m or r of the participating nodes When a node receives the metadata or request, with some probability, it forwards the metadata or request to another participating node selected at random Doing so introduces some variability in the number of nodes to which the metadata and requests are distributed
METIS'2011 iTrust Michael Melliar-Smith 25 Differential Distribution If there are many more requests than metadata, it might be appropriate to distribute the metadata to more nodes and the requests to fewer nodes Similarly, long-lived metadata and requests might be distributed to more nodes than short-lived metadata and requests Likewise, frequently requested metadata might be distributed to more nodes than rarely requested metadata
METIS'2011 iTrust Michael Melliar-Smith 26 Network Load We are investigating the effects of metadata and request distribution on the network load and also on the load of participating nodes If the network load is too high, it might be necessary to reduce the number of nodes to which the metadata and requests are distributed
METIS'2011 iTrust Michael Melliar-Smith 27 Membership The membership of participating nodes need not be exact and up-to-date Small differences in the membership are equivalent to small proportions of non-operational nodes It is essential, to the iTrust strategy, that the membership should not be centrally managed Thus, we employ a membership algorithm that is based on iTrust itself
METIS'2011 iTrust Michael Melliar-Smith 28 Membership Algorithm A node wishing to join the membership contacts any current member to obtain the current membership –It does so using mechanisms that are outside the iTrust strategy, perhaps , Twitter, etc. It then publishes its joining the membership, through the iTrust distribution and retrieval mechanisms All nodes periodically request and retrieve information about new nodes that have joined the membership
METIS'2011 iTrust Michael Melliar-Smith 29 Joining the Membership
METIS'2011 iTrust Michael Melliar-Smith 30 Discovering New Members
METIS'2011 iTrust Michael Melliar-Smith 31 Leaving the Membership
METIS'2011 iTrust Michael Melliar-Smith 32 Rapidly Changing Memberships At times of rapid membership change, it might be appropriate to request and retrieve membership information more frequently, with increased computation and communication costs At times of rapid membership change, it might be appropriate to distribute the metadata and requests to more nodes to compensate for inaccurate membership information
METIS'2011 iTrust Michael Melliar-Smith 33 Large Memberships Large memberships (perhaps millions of nodes) might be expensive to retrieve and store The potentially high rate of notifications of membership changes for a large membership might impose a heavy load on the network We are investigating strategies for creating and maintaining memberships in which each node is aware of only a small subset of the membership We are also investigating the effects of small subsets of the membership, on the effectiveness of information distribution and retrieval
METIS'2011 iTrust Michael Melliar-Smith 34 Encryption In iTrust, there is no intention to use encryption to ensure secrecy or privacy at the node level Necessarily, metadata and requests must be readable by large numbers of nodes and, thus, they are public However, encryption can be used to make it prohibitively expensive for routers to use deep packet inspection to censor metadata or requests For this purpose, iTrust uses standard public key encryption
METIS'2011 iTrust Michael Melliar-Smith 35 Encrypted Metadata and Requests When a node sends metadata or requests, it encrypts the message with –Its private key –Destination’s public key The sending node includes its public key in the message –Some receiver nodes might not yet have its information in their membership tables When a node finds that a request matches its metadata, –It uses the public key in the request to encrypt the response reporting the match to the requester –The response supplies the URL that directs the requester to the source of the information, and also the source’s public key
METIS'2011 iTrust Michael Melliar-Smith 36 Potential Malicious Attacks A malicious attacker might seed the network with covertly subverted nodes that behave normally, except that they fail to report matches involving information that the attacker wants to suppress A malicious attacker must ensure that a large number of nodes that participate in matching have been subverted In iTrust, it is important to detect a malicious attack, and to prevent the malicious attack from being effective
METIS'2011 iTrust Michael Melliar-Smith 37 Detecting a Malicious Attack In iTrust, it is likely that a request will result in several reports of matches The probability of multiple reports of matches depends on: –Number n of participating nodes –Number m of nodes to which the metadata are distributed –Number r of nodes to which the requests are distributed –Probability x that a node is operational The effect of a malicious attack is to increase the probability that a substantial number of subverted nodes appear to be non-operational for certain metadata or requests
METIS'2011 iTrust Michael Melliar-Smith 38 Detecting a Malicious Attack k=0k=1k=2k=3k=4k=5 80% 60% 40% 20% 1000 Node Network with Distribution to 60 Nodes Percentage of nodes operational 100% Number of Matches Probability of Number of Matches
METIS'2011 iTrust Michael Melliar-Smith 39 Responding to a Malicious Attack If the iTrust network is under attack, it is appropriate to increase the number of nodes to which the metadata and requests are distributed We are investigating an adaptive algorithm that increases the number of nodes to which the metadata and requests are distributed, as the probability of an attack increases, i.e., our estimate of the number of subverted nodes increases
METIS'2011 iTrust Michael Melliar-Smith 40 Prototype Implementation of iTrust Based on the Apache Web server, compiled with several PHP standard modules and library extensions Uses HTTP for distribution of metadata and requests, and for retrieval of documents Multiple iTrust nodes can be installed on a single Web server by creating multiple virtual Web sites on that server Comprises three components: –Web Server Foundation –Application Infrastructure –Public Interface
METIS'2011 iTrust Michael Melliar-Smith 41 The Components of iTrust
METIS'2011 iTrust Michael Melliar-Smith 42 Web Server Foundation cURL is used for inter-node communication and resource-specific actions SQLite database tables are used to store node, metadata, and resource information –A node uses the SQLite LIKE function to match the metadata in a request with the metadata that it holds The session module tracks and distinguishes users The log module is used for debugging and for simulation The PHP Extension Community Library (PECL) for HTTP is used for inter-node search and requests
METIS'2011 iTrust Michael Melliar-Smith 43 Application Infrastructure The metadata XML engine scans the resources and creates an XML list describing the relationship between the metadata and the resource The node and resource-related helper functions insert nodes into the membership, insert keywords into the database, and upload or fetch resources The Apache Tika and Lucene packages are used to generate metadata from resources, if the user opts not to generate the metadata manually The WordNet dictionary is used to provide spell checking and synonym suggestions
METIS'2011 iTrust Michael Melliar-Smith 44 Public Interface Comprises two kinds of interfaces: –Computer interfaces – Handle all inter-node communication such as queries, resource distribution, and metadata list distribution Request is sent to participating nodes using computer interfaces in a simple inbox-type fashion A participating nodes reads its inbox for queries, and sends back a response if it has a match –Human interfaces – Consist of PHP HTML Web pages Administrator can add nodes or metadata keywords using HTML form text boxes User generates requests using HTML form text boxes User settings and statistics Web pages provide feedback on the membership size, resource count, etc.
METIS'2011 iTrust Michael Melliar-Smith 45 Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 46 Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 47 Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 48 Simulation Results Based on the Implementation
METIS'2011 iTrust Michael Melliar-Smith 49 Conclusion and Future Work We have described iTrust [13], [14], a trustworthy information distribution and retrieval network We plan to do experimental evaluations of the prototype implementation using PlanetLab We are investigating other implementations of iTrust based on: –SMS –Twitter –What else? We need your advice. We plan to make the iTrust source code, tools, documentation, etc. freely available for all to use
METIS'2011 iTrust Michael Melliar-Smith 50 References [1] C. Castelfranchi and R. Falcone, Principles of Trust for MAS: Cognitive Anatomy, Social Importance, and Quantification, Proceedings of the International Conference on Multi-Agent Systems, Paris, France, [2] D. Artz and Y. Gil, A Survey of Trust in Computer Science and the Semantic Web, Journal of Web Semantics, 5:58-71, Elsevier, [3] T. Grandison and M. Sloman, A Survey of Trust in Internet Applications, IEEE Communications Survey Tutorials, 4(4):2-16, [4] P. T. Eugster, P. A. Felber, R. Guerraoui and A. M. Kermarrec, The Many Faces of Publish/Subscribe, ACM Computing Surveys, 35(2): , June [5] J. Mischke and B. Stiller, A Methodology for the Design of Distributed Search in P2P Middleware, IEEE Network 19(1):30-37, January [6] J. Risson and T. Moors, Survey of Research Towards Robust Peer-to-Peer Networks: Search Methods, Technical Report UNSW-EE-P2P-1-1. University of New South Wales, September 2007, RFC 4982, [7] Gnutella, [8] I. Clarke, O. Sandberg, B. Wiley and T. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System, Proceedings of the Workshop on Design Issues in Anonymity and Unobservability, Lecture Notes in Computer Science, Berkeley, CA, July 2000, [9] M. Zhong and K. Shen, Popularity-Based Random Walks for Peer-to-Peer Search under the Square-Root Principle, Lecture Notes in Computer Science 4490, 2007, [10] R. A. Ferreira, M. K. Ramanathan, A. Awan, A. Grama and S. Jagannathan, Search with Probabilistic Guarantees in Unstructured Peer-to-Peer Networks, Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing, Konstance, Germany, August 2005, [11] B. Wong and S. Guha, Quasar: A Probabilistic Publish-Subscribe System for Social Networks, Proceedings of the 7th International Workshop on Peer-to-Peer Systems, Tampa Bay, FL, February [12] T. Isdal, M. Piatek, A. Krishnamurthy and T. Anderson, Privacy Preserving P2P Data Sharing with OneSwarm, Technical Report UW-CSE, Department of Computer Science, University of Washington, [13] I. Michel Lombera, Y. T. Chuang, P. M. Melliar-Smith and L. E. Moser, Trustworthy Distribution and Retrieval of Information over HTTP and the Internet, Proceedings of the Third International Conference on the Evolving Internet, INTERNET 2011, Luxembourg, June [14] Y. T. Chuang, I. Michel Lombera, L. E. Moser and P. M. Melliar-Smith, Trustworthy Distributed Search and Retrieval over the Internet, Proceedings of the 2011 WORLDCOMP, International Conference on Internet Computing, ICOMP, Las Vegas, NV, July 2011.
METIS'2011 iTrust Michael Melliar-Smith 51 Thank You! Questions? Comments…