Download presentation
Presentation is loading. Please wait.
Published bySherman McCormick Modified over 9 years ago
1
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao, Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley
2
ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of Spam Spam Unsolicited, automated emails Radicati Group: $20B cost in 2003, $198B in 2007 Proposed solutions Economic model for spam prevention Attach cost to mass email distribution Weakness: needs wide-spread deployment, prevent but not filter Bayesian network / machine learning (independent) “Train” mailer with spam, rely on recognizing words / patterns Weakness: key words can be masked (images, invis. characters) Collaborative filtering Store / query for spam signatures on central repository Other users query signatures to filter out incoming spam Weakness: central repository limited in bandwidth, computation
3
ACM Middleware 2003 ravenben@eecs.berkeley.edu 3 Our Contribution Can signatures effectively detect modified spam? Goals: Minimize false positives (marking good email as spam) Recognize modified/customized spam as same as original Present signature scheme based on approx. fingerprints Evaluate against random text and real email messages Can we build a scalable, resilient signature repository Leverage structured peer-to-peer networks Constrain query latency and limit bandwidth usage Orthogonal issues we do not address: Preprocessing emails to extract content Interpreting collective votes via reputation systems
4
ACM Middleware 2003 ravenben@eecs.berkeley.edu 4 Outline Introduction An Approximate Signature Scheme Evaluation using random text and real emails Approximate object location Similarity search on P2P systems Constraining latency and bandwidth usage Conclusion
5
ACM Middleware 2003 ravenben@eecs.berkeley.edu 5 Collaborative Spam Filtering 1. Spam sent to user A 2. Signatures stored 3. Spam sent to user B 4. B queries repository 5. B filters out spam store signature query response
6
ACM Middleware 2003 ravenben@eecs.berkeley.edu 6 An Approximate Signature Scheme How to match documents with very similar content Calculate checksums of all substrings of length L Select deterministic set of N checksums A matches B iff |sig(A) sig(B)| > Threshold Computation tput: 13MByte/s on P-III 1Ghz A BCDE F Sort by value checksum Choose NRandomize Vector Email Message
7
ACM Middleware 2003 ravenben@eecs.berkeley.edu 7 Accuracy of Signature Vectors 10000 random text documents, size = 5KB, calculate 10 signatures Compare signatures of before and after modifications Analytical results match experimental results
8
ACM Middleware 2003 ravenben@eecs.berkeley.edu 8 Eliminating False positives Compare pair-wise signatures between 10000 random docs None matched 3 of 10 signatures (100,000,000 pairs)
9
ACM Middleware 2003 ravenben@eecs.berkeley.edu 9 Evaluation on Real Messages 29631 Spam Emails from www.spamarchive.orgwww.spamarchive.org Processed visually by project members 14925 (unique), 86% of spam ≤ 5K Robustness to modification test Most popular 39 msgs have 3440 modified copies Examine recognition between copies and originals THRESDetectedFailed% 3/1033568497.56 4/10317226892.21 5/10296747386.25
10
ACM Middleware 2003 ravenben@eecs.berkeley.edu 10 False Positive Test Non-spam emails 9589 messages: 50% newsgroup posts + 50% personal emails Compare against 14925 unique spam messages Sweet spot, using threshold of 3/10 signatures Recognition rate > 97.5% False positive rate < 1 in 140 million pairs THRES# of pairsProbability 1/102701.89e-6 2/1042.79e-8 3/1000
11
ACM Middleware 2003 ravenben@eecs.berkeley.edu 11 A Distributed Signature Repository? store signature query response How do we build a scalable distributed repository? How do we limit bandwidth consumption and latency?
12
ACM Middleware 2003 ravenben@eecs.berkeley.edu 12 Structured Peer-to-Peer Overlays Storage / query via structured P2P overlay networks Large sparse ID space N (160 bits: 0 – 2 160 ) Nodes in overlay network have nodeIDs N Given k N, overlay deterministically maps k to its root node (a live node in the network) E.g. Chord, Pastry, Tapestry, Kademlia, Skipnet, etc… Decentralized Object Location and Routing (DOLR) Objects identified by Globally Unique IDs (GUIDs) N Decentralized directory service for endpoints/objects Route messages to nearest available endpoint Object location with locality: routing stretch (overlay location / shortest distance) O(1)
13
ACM Middleware 2003 ravenben@eecs.berkeley.edu 13 DOLR on Tapestry Routing Mesh Object O Server Client Root(O) Client
14
ACM Middleware 2003 ravenben@eecs.berkeley.edu 14 More Than Just Unique Identifiers Objects named by Globally Unique ID (GUID) Application maps secondary characteristics to ID: versioning, modified replicas, app-specific info Simplify the search problem out of m search fields, or “features,” find objects matching at least n exactly Simpler Queries More Powerful QLs DHT/DOLRs Databases Search on Features
15
ACM Middleware 2003 ravenben@eecs.berkeley.edu 15 A Layered Perspective ADOLR layer Introduce naming mapping from feature vector to GUIDs Rely on overlay infrastructure for storage Abstraction of feature vectors as approximate names for object(s) IP Network Layer Structured P2P Overlay Network (DOLR) Approximate DOLR/DHT route (IPAddr, msg) Publish (GUID) route (GUID, msg) approxPublish(FV,GUID) approxSearch(FV) route (FV, msg)
16
ACM Middleware 2003 ravenben@eecs.berkeley.edu 16 Marking a New Spam Message Signatures stored as inverted index (feature object) inside overlay User on C gets spam E 2, calculates signatures S: {S 1, S 2, S 3 } For each feature in S, if feature object exists, add E 2 If no feature object exists, create one locally and publish node C E 2 Overlay S 2 E 2 put (S 1, E 2 ) S 3 E 2 node A E 3 S 1 E 3 X X put (S 2, E 2 ) S 1 E 3, E 2 put (S 3, E 2 )
17
ACM Middleware 2003 ravenben@eecs.berkeley.edu 17 Filtering New Emails for Spam User at node D receives new email E 2 ’ with signatures {S 1, S 2, S 4 } Queries overlay for signatures, retrieve matching GUIDs for each Threshold = 2/3, contact GUIDs that occur in 2 of 3 result sets Contact E 2 via overlay for any additional info (votes etc) node C E 2 Overlay S 2 E 2 S 3 E 2 node A E 3 S 1 E 3, E 2 node D X S4S4 S2S2 S1S1 E 2 ’ E 2
18
ACM Middleware 2003 ravenben@eecs.berkeley.edu 18 Constraining Bandwidth and Latency Need to constrain bandwidth and latency Limit signature query to h overlay hops Return null set if h hops reached without result Simulation on transit-stub topologies 5K nodes, 4K overlay nodes, diameter = 400ms Each spam message only reaches small group For each message: % of users seen and marked = mark rate Measure tradeoff between latency and success rate of locating known spam, for different mark rates
19
ACM Middleware 2003 ravenben@eecs.berkeley.edu 19 Simulation Result network diameter
20
ACM Middleware 2003 ravenben@eecs.berkeley.edu 20 Feature-based Queries Approximate Text Addressing Text objects: features hashed signatures Applications: plagiarism detection, replica management Other feature-based searching Music similarity search Extract musical characteristics Signatures: {hash(field1=value1), hash(field2=value2)…} E.g. Fourier transform values, # of wavetable entries Image similarity search Locate files with similar geometric properties, etc.
21
ACM Middleware 2003 ravenben@eecs.berkeley.edu 21 Finally… Status ADOLR infrastructure implemented on Tapestry SpamWatch: P2P spam filter implemented, including Microsoft Outlook plug-in Available for download: http://www.cs.berkeley.edu/~zf/spamwatch http://www.cs.berkeley.edu/~zf/spamwatch Tapestry http://www.cs.berkeley.edu/~ravenben/tapestry http://www.cs.berkeley.edu/~ravenben/tapestry Thank you…Questions?
22
ACM Middleware 2003 ravenben@eecs.berkeley.edu 22 Backup Slides Follow
23
ACM Middleware 2003 ravenben@eecs.berkeley.edu 23 DHT on Routing Mesh
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.