12/13/2002CS262A - ATA and Spam Filtering on P2P Systems1 Approximate Text Addressing and Spam Filtering on P2P Systems Feng Zhou

Slides:



Advertisements
Similar presentations
IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
Advertisements

Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,
Massively Distributed Database Systems Distributed Hash Spring 2014 Ki-Joune Li Pusan National University.
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
1 Towards a Common API for Structured Peer-to-Peer Overlays Frank Dabek, Ben Zhao, Peter Druschel, John Kubiatowicz, Ion Stoica Presented for Cs294-4 by.
Applications over P2P Structured Overlays Antonino Virgillito.
P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
Improving Lookup Performance over a Widely-Deployed DHT Daniel Stutzbach Reza Rejaie The ION P2P Project University of.
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Looking Up Data in P2P Systems Hari Balakrishnan M.Frans Kaashoek David Karger Robert Morris Ion Stoica.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Secure routing for structured peer-to-peer overlay networks (by Castro et al.) Shariq Rizvi CS 294-4: Peer-to-Peer Systems.
Object Naming & Content based Object Search 2/3/2003.
1/13/2003Approximate Object Location and Spam Filtering on Tapestry1 Feng Zhou Li Zhuang
Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January.
CITRIS Poster Supporting Wide-area Applications Complexities of global deployment  Network unreliability.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.
Tapestry An off-the-wall routing protocol? Presented by Peter, Erik, and Morten.
SIMULATING A MOBILE PEER-TO-PEER NETWORK Simo Sibakov Department of Communications and Networking (Comnet) Helsinki University of Technology Supervisor:
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 10: Peer-to-Peer.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
NIBEDITA MAULIK GRAND SEMINAR PRESENTATION OCT 21 st 2002.
Peer-to-Peer Name Service (P2PNS) Ingmar Baumgart Institute of Telematics, Universität Karlsruhe IETF 70, Vancouver.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Distributed Quota Enforcement for Spam Control Jee Whan Choi Chaoting Xuan.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Stefanos Antaris A Socio-Aware Decentralized Topology Construction Protocol Stefanos Antaris *, Despina Stasi *, Mikael Högqvist † George Pallis *, Marios.
Peer to Peer Network Design Discovery and Routing algorithms
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Bruce Hammer, Steve Wallis, Raymond Ho
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
Peer-to-peer systems ”Sharing is caring”. Why P2P? Client-server systems limited by management and bandwidth P2P uses network resources at the edges.
Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.
Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc Ohara, Kami-fukuoka, Saitama , Japan
CS 268: Lecture 22 (Peer-to-Peer Networks)
CHAPTER 3 Architectures for Distributed Systems
Improving Digest-Based Collaborative Spam Detection
Early Measurements of a Cluster-based Architecture for P2P Systems
EE 122: Peer-to-Peer (P2P) Networks
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Presentation transcript:

12/13/2002CS262A - ATA and Spam Filtering on P2P Systems1 Approximate Text Addressing and Spam Filtering on P2P Systems Feng Zhou Li Zhuang

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 2 DOLR and DHT Background Existing P2P Overlay Networks  Map unique IDs to network locations: Decentralized Object Location and Routing layer  Locate nearby copies of object replicas: Distributed Hash Table  Example: Tapestry, Chord, CAN, Pastry  Tapestry: prefix routing Benefits of DOLR and DHT  Excel at locating objects by ID and locating object replicas  Example: Tapestry Publish / Unpublish (Object ID) RouteToNode (Node ID) RouteToObject (Object ID) xEF31 0xE932 0xE324 0xEF32 0x0999 0x099F 0xE399 0xEF40 0xEF34 0xEFBA 0xEF37 0x

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 3 Approximate DOLR Problem of Current DOLR and DHT  Locating relies on Globally Unique IDs  hashing to get GUID  can only find exactly identical objects. Approximate DOLR (ADOLR)  Approximate objects : similar in content but not identical  Approximate objects are described using a set of features: AO ≡ Feature Vector (FV) = {f 1, f 2, f 3, …, f n }  Locate AOs in P2P Network ≡ find all AOs in the network with |FV * ∩FV|≥THRES, 0<THRES≤|FV| Primitives  PublishApproxObject(Object ID, FV) / UnpublishApproxObject (Object ID, FV)  RouteToApproxObject (FV, THRES)

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 4 Potential Applications of ADOLR Approximate Text Addressing  Problem: find similar document copies in the P2P network  Feature Vector: a text fingerprint vector  Application: P2P spam filter, P2P content based “e-pinions”, etc. Database Queries on P2P  Hash values of a tuple into a feature vector  Feature Vector: hashes of values to query  Approximate query: THRES < |FV| Media Retrieving on P2P  Most of pattern recognition results of image or video are represented as a feature vector.  Discretize feature values to use ADOLR

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 5 Tapestry Overlay ADOLR on Tapestry A substrate on top of Tapestry Feature Object  The set of IDs of all objects matching a feature value. PublishApproxObject  Add Object ID to all involved Feature Objects  Publish new Feature Objects if needed RouteToApproxObject  Lookup all involved Feature Objects  Count occurrence of each ID and compare with THRES  RouteToObject 10  A 11  A,B 20  D,E RouteToApproxObj ({10,11,12}, 2) 12  C 23  F A 1. Lookup 10, Lookup  A, 11  A,B 2.12  C 3. RouteToObj(A)

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 6 Approximate Text Addressing Fingerprint Vector [manber94finding]  Divide document into (n-L+1) overlapping substrings (length L)  Calculate checksums of substrings  The largest N checksums  FV Parameters  Length of substrings: L  Length of checksums: L ck  FV Size: |FV|=N Two sets of experiments  “Similarity”: digest slightly changed documents into the same or similar FV ? – The higher the better! (1 - False-negative) We developed an analytical model for this  False-positive: digest totally different documents into the same or similar FV ? – The lower the better!

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 7 Spam Filtering on P2P Networks Fingerprint Vectors for Spam Filtering  Length of substring (L): one to several phases  |FV|: large enough to avoid collisions  THRES is decided by doing “Similarity” Test and “False Positive” Test Locating Spam Using Extended API  Vote: RouteToApproxObject() to vote or PublishApproxObject() new one  Check: RouteToApproxObject() to get current votes  Performance Considerations N↑  Accuracy↑, Network bandwidth consumption↑ Non-Spam: never published in the network  need to route to ROOT before getting negative result  Solution: TTL (tradeoff between accuracy and bandwidth)

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 8 Evaluation of FV on Random Text

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 9 Evaluation of FV on Real s Spam (29631 Junk s from  (unique) 5630 (exact copies) 9076 (modified copies of 4585 unique ones) 86% of spam ≤ 5K Normal s  9589 (total)=50% newsgroup posts + 50% personal s THRESDetectedFail% 3/ / / “Similarity” Test 3440 modified copies of 39 s, 5~629 copies each Match FP# pairprobability 1/ e-6 2/ e-8 >2/1000 “False Positive” Test 9589(normal)×14925(spam) pairs

12/13/2002 CS262A - ATA and Spam Filtering on P2P Systems 10 Evaluation and Status Effective Fingerprint Routing w/ TTL  Network of 5000 nodes  Diameter latency=400ms  4096 Tapestry nodes Status  Approximate Text Addressing prototype implemented on Tapestry.  SpamWatch – P2P spam filtering system prototype implemented  Outlook add-in usable!  See our DEMO!  Website: