Presentation is loading. Please wait.

Presentation is loading. Please wait.

LSDS-IR’08 www.ir.iit.edu 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology.

Similar presentations


Presentation on theme: "LSDS-IR’08 www.ir.iit.edu 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology."— Presentation transcript:

1 LSDS-IR’08 www.ir.iit.edu 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology jia@ir.iit.edu

2 LSDS-IR’08 www.ir.iit.edu 2 Goal Create cost-effective ways of automatically detecting P2P spam results w/o actual file downloading

3 LSDS-IR’08 www.ir.iit.edu 3 Introduction Spam: –Any file that is misrepresented deliberately or in a way of manipulating established retrieval and ranking techniques Spam is harmful –Degrade user search experience –Assist the propagation of viruses in network –Have significant impact on P2P traffic load

4 LSDS-IR’08 www.ir.iit.edu 4 Problem Statement Naïve spam detection method –Download and manually check –Cons: Time and labor consuming Wastes bandwidth and storage resources Risks of opening malware Hence, automatic spam detection is needed!

5 LSDS-IR’08 www.ir.iit.edu 5 Emule Example Query (number of results) Descriptors Group Size File Key Hard to detect spam automatically in query result set!

6 LSDS-IR’08 www.ir.iit.edu 6 Types of Spam Type 1: Files whose replicas have semantically different descriptors –E.g., different song titles for a same key 26NZUBS655CC66COLKMWHUVJGUXRPVUF: “12 days after christmas.mp3” “i want you thalia.mp3” “comon be my girl.mp3” …

7 LSDS-IR’08 www.ir.iit.edu 7 Types of Spam (Cont’d) Type 2: Files with long descriptors that contain semantically nonsensical term combinations –Single-descriptor problem –E.g., a single replica descriptor for key 1200473A4BB17724194C5B9C271F3DC4: “Aerosmith,Van Halen,Quiet Riot,Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”

8 LSDS-IR’08 www.ir.iit.edu 8 Types of Spam (Cont’d) Type 3: Files with descriptors that contain no query terms –Ads or warning on the illegal distribution of copyrighted materials –E.g., “Can you afford 0.09 www.BuyLegalMP3.com.mp3”

9 LSDS-IR’08 www.ir.iit.edu 9 Types of Spam (Cont’d) Type 4: Files that are highly replicated on a single peer –Normal users do not create multiple replicas of a same file on a single server –Manipulate “group size” ranking –E.g., 177 replicas of the file DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer

10 LSDS-IR’08 www.ir.iit.edu Feature-Based Spam Detection Basic idea –To detect spam results by P2P features that are strongly correlated with spam Vocabulary size of a file’s group descriptor Variance of terms in replica descriptors D of a file group G –Jaccard distance: 1 - |D ∩ G| / |D  G | –Cosine distance: 1 - (V G ·V D ) / (|V G | |V D |) Per-host replication degree of a file –numRep / numHost … 10

11 LSDS-IR’08 www.ir.iit.edu 11 Probe Query Problem: –Results have insufficient and biased description info Conjunctive query matching Solution: –Gather more info for a result from network Other replica descriptors of the file Statistics of peers who share the file –Num of files, num of unique files, peer ID –Implementation Contains only a file key, not a “term” query –Intuition Probing helps to create a more complete view of a file Ranking is more effective with more adequate file info

12 LSDS-IR’08 www.ir.iit.edu Evaluation Dataset –P2P audio files crawled from Gnutella network: numRep = 25,137,217; numFile = 9,575,113; numPeer = 226,786 –50 most popular queries in the crawled dataset Representative of most users, more likely target for spam Metric –Num spam in top-N ranked results, esp. for a small N Effectiveness –Improves performance by 9% for top-200 results, by 92.5% for top-20 results Base case: noprobe+numRep 12

13 LSDS-IR’08 www.ir.iit.edu Cost Control Tradeoff –Performance vs. cost Cost –Num of responses for regular query and probe query Problem –Network cost is dramatically increased by probing How to reduce the cost? 13

14 LSDS-IR’08 www.ir.iit.edu Cost Control Approaches Random sampling of probe query results Piggy-backing of descriptor data in probe queries Limiting the scope of probing 14

15 LSDS-IR’08 www.ir.iit.edu Random Sampling Server-side random sampling of probe query results –A predefined probability P, 0 ≤ P ≤ 1 –Reduces cost by a factor P predictably –Impact on effectiveness of spam detection? 15

16 LSDS-IR’08 www.ir.iit.edu 16 Experimental Results Cost is reduced significantly by sampling fewer probe results In all sampling cases, overall performance is still 1.7%-9% better than noprobe But the cost is still high With 25% sampling, cost is ~7 times higher than noprobe Performance for top-20 results is 71%-92% better than noprobe `

17 LSDS-IR’08 www.ir.iit.edu Piggy-backing of Descriptor Data Piggy-backing of descriptor data in probe queries –New type of probe query file key + descriptor of result file being probed –Server’s descriptor will not respond if it contains no new term compared with the descriptor in probe query To limit num of probe results returned to client 17

18 LSDS-IR’08 www.ir.iit.edu 18 Experimental Results Compared with the original type of probe, total cost is decreased by 35%-39% for all sampling rates Compared with the original type of probe, overall performance is dropped by ~15% E.g., the cost with sampling rate 0.25 is ~4 times higher than noprobe ` However, performance for top-20 results is improved by 71%-88% in all sampling cases

19 LSDS-IR’08 www.ir.iit.edu Limiting Probing Scope Limiting the scope of probing –Only probe a few top-ranked (i.e., top-20) regular query results –Intuition User tends to only consider downloading a file from a few top-ranked results 19

20 LSDS-IR’08 www.ir.iit.edu 20 Experimental Results Performance of probing only top- 20 results is always 22%-56% better over noprobe Probing only the top-20 results significantly reduces cost E.g., cost with sampling rate 0.25 is only twice as much as that of noprobe `

21 LSDS-IR’08 www.ir.iit.edu 21 Conclusion Feature-based spam detection techniques successfully decrease the amount of spam –9% in top-200 results; 92% in top-20 results Cost control methods are effective in reducing network cost –Factor increase of cost is dropped from 7 to 2 over noprobe –At the same time, performance is at least 22% better over noprobe for top-20 results

22 LSDS-IR’08 www.ir.iit.edu 22 References Limewire junk filter. http://wiki.limewire.org/index.php?title=Junk_Filterhttp://wiki.limewire.org/index.php?title=Junk_Filter J. Liang, R. Kumar, Y. Xi and K. Ross. Pollution in P2P File Sharing Systems. In INFOCOM’05, May 2005. K. Svore, Q. Wu, C.J.C. Burges and A. Raman. Improving Web spam classification using Rank-time features. In Proc. AIRWeb workshop in WWW, 2007 Shlomo Hershkop, Salvatore j Stolfo. Combining Email Models for False Positive Reduction. In proc. KDD’05. Chicago, Aug. 2005. P. A. Chirita, J. Diederich, and W. Nejdl. MailRank: Using ranking for spam detection. In proc. CIKM’05, Bremen, Germany, 2005. Alexandros Ntoulas, Marc Najork, Mark Manasse, Dennis Fetterly. Detecting spam web pages through content analysis. In Proc. of WWW'06. Sepandar D. Kamvar, Mario T. Schlosser, and Hector Garcia-Molina. The EigenTrust Algorithm for Reputation Management in P2P Networks. In Proc. of WWW, 2003. Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., Pedersen, J. Link spam detection based on mass estimation. In Proc. of the 32nd International Conference on Very Large Data Bases (VLDB), ACM Press (2006), 439-450. Limewire. www.limewire.orgwww.limewire.org Runfang Zhou and Kai Hwang. Gossip-based Reputation Aggregation for Unstructured Peer-to-Peer Networks. 21th IEEE International Parallel & Distributed Processing Symposium (IPDPS'07), Los Angeles, March 26-30, 2007 Kevin Walsh, Emin Gun Sirer. Experience with an Object Reputation System for Peer-to- Peer Filesharing. In 3rd Symposium on Networked Systems Design & Implementation (NSDI), 2006 Uichin Lee, Min Choi, Junghoo Cho, Medy. Y. Sanadidi, Mario Gerla. Understanding Pollution Dynamics in P2P File Sharing. In Proc. IPTPS'06.

23 LSDS-IR’08 www.ir.iit.edu 23 Questions? Contact info: –WWW: www.ir.iit.edu –Email: jia@ir.iit.edu Thanks from IIT’s IR Lab!

24 LSDS-IR’08 www.ir.iit.edu 24 Related Work Email spam detection –Hershkop et al., KDD’05 Analyze email content and syntax –Chirita et al., CIKM’05 Construct social networks for email address Web spam detection –Ntoulas et al., WWW’06 Analyze content of Web pages –Gyongyi et al., VLDB’06 Analyze link structure of Web pages

25 LSDS-IR’08 www.ir.iit.edu 25 Related Work (Cont’d) P2P spam detection –Spam filter in Limewire User-controlled spam learning –Liang et al., INFOCOM’05 Detect spam using extra info, i.e., official CD length of a media file –Kamvar et al., WWW’03 Build reputation systems to rank peers

26 LSDS-IR’08 www.ir.iit.edu 26 Simulating P2P search Built a system to simulate P2P search on client side Simulating query routing –A query is randomly sent to 50 peers –Repeat until either stop condition is satisfied Condition 1: num of unique results reaches 200 results Condition 2: num of peers that have received query reaches 50K peers –Threshold values chosen based on specifications of real-world P2P systems (i.e. Limewire’s Gnutella)

27 LSDS-IR’08 www.ir.iit.edu 27 Experimental Results Compared with noprobe+numRep, probe+Cosine improves performance by 9% for top-200 results, by 92.5% for top-20 results Compared with noprobe+CosineQD, 21.6% and 97.8% noprobe+numRep probe+Cosine noprobe+CosineQD probe+numUniqueTerms probe+Jaccard

28 LSDS-IR’08 www.ir.iit.edu 28 Experimental Results (Cont’d) Compare Cosine/Jaccard distance with numUniqueTerms in a fair way by only considering multi-replica files


Download ppt "LSDS-IR’08 www.ir.iit.edu 1 Cost-Effective Spam Detection in P2P File-Sharing Systems Dongmei Jia Information Retrieval Lab Illinois Institute of Technology."

Similar presentations


Ads by Google