SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24. papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Understanding and Detecting Malicious Web Advertising
ABUSING BROWSER ADDRESS BAR FOR FUN AND PROFIT - AN EMPIRICAL INVESTIGATION OF ADD-ON CROSS SITE SCRIPTING ATTACKS Presenter: Jialong Zhang.
Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)
Search Engines and Information Retrieval
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Personalized Cybersecurity for Dummies Jaime G. Carbonell Eugene Fink Mehrbod Sharifi Application of machine learning and crowdsourcing to adapt cybersecurity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Defect prediction using social network analysis on issue repositories Reporter: Dandan Wang Date: 04/18/2011.
Lie Detection using NLP Techniques
Detection of Internet Scam Using Logistic Regression
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
Detecting Spammers on Social Networks Gianluca Stringhini, Christopher Kruegel, Giovanni Vigna (University of California) Annual Computer Security Applications.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Authors: Gianluca Stringhini Christopher Kruegel Giovanni Vigna University of California, Santa Barbara Presenter: Justin Rhodes.
Search Engines and Information Retrieval Chapter 1.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Fraud Detection with Data Mining IN COLLABORATION WITH “IDEAMART( DIALOG AXIATA PLC)” Roshanth Gardiarachchi (Dialog) Sampath Deegalla UoP Mohammed Fawsan.
An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Learning to Associate: HybridBoosted Multi-Target Tracker for Crowded Scene Present by 陳群元.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Presented By :Ayesha Khan. Content Introduction Everyday Examples of Collaborative Filtering Traditional Collaborative Filtering Socially Collaborative.
Web Spoofing Steve Newell Mike Falcon Computer Security CIS 4360.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Prediction of Influencers from Word Use Chan Shing Hei.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Application of Machine Learning and Crowdsourcing to Detection of Cyber Threats Jaime G. Carbonell Eugene Fink Mehrbod Sharifi.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Machine Learning Methods for Cybersecurity Jaime G. Carbonell Eugene Fink Mehrbod Sharifi.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Learning to Detect and Classify Malicious Executables in the Wild by J
Queensland University of Technology
Detection of Internet Scam Using Logistic Regression
By : Namesh Kher Big Data Insights – INFM 750
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
iSRD Spam Review Detection with Imbalanced Data Distributions
Pooria Taghizadeh : Dr. Hadi Tabatabaee : Dr. Mona Ghassemian :
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Pay Me and I’ll Follow You: Detection of Crowdturfing Following Activities in Microblog Environment Liu Yuli 2016/05/22.
Presentation transcript:

SDP-MARCH-Talk 恶意任务检测 姚大海 2013/11/24

papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

outline malicious crowdsourcing measured datasets some initial results

malicious crowdsourcing increasing secrecy –tracking jobs is more difficult and easier to detect –details of the task are only revealed to workers that take on a task. –worker accounts require association with phone numbers or bank accounts.

malicious crowdsourcing behavioral signatures –ouput from crowdturfing tasks are likely to display specific patterns that distinguish them from "organically" generated content. –signatures worker account (their behavior) cotent (bursts of content generation when tasks are first posted)

malicious crowdsourcing our methodology –we limit our scope to campaigns that target microblogging platforms (Sina Weibo). –First, we gather "ground truth" content generated by turfer and "organic" content generated by normal users. –Second, we compare and contrast these datasets. –Our end goal is to develop detectors by testing them against new crowdturfing campaigns as they arrive.

measured datasets crowdturf accounts on Weibo –download full user profiles of Weibo accounts IDs crowdturf campaigns –crawled tweets, retweets and comments of campaigns –61.5 million tweets, 118 million comments and 86 million rerweets ( ~2013.1)

some initial results turkers tend to straddle the line between malicious and normal users. crowdturfing campaigns have a higher ratio of repeated users.

papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

outline introduction data preparation human assessor measurements writing style measurements classifier measurements hybrid measurements conclusion

introduction review spam –hyper spam——positive review –defaming spam——negative review limitation of related work –focus on hyper spam

data preparation truthful reviews (each of 8 products) : 25 highly-rated reviews : 25 low-rated reviews fake reviews (created on AMT) : 25 highly-rated reviews : 25 low-rated reviews

human assessor measurements balanced: 5 truthful and 5 deceptive reviews random: n deceptive reviews and (10-n) truthful reviews 1.students performed better than the crowd, but not significant. 2.detecting high- ralted reviews is easier than low- rated reviews. an assessor has a "default" belief that a review must be true.

writing style measurements three linguistic qualities ( 语言指标 ) –polarity –sentiment –readability——ARI C——#characters W——#words S——#sentences sentiment API in text-processing.com

writing style measurements truth reviews require higher readability highly-rated reviews require higher readability

classifier measurements QuickLM language model toolkit language model, sentiment score, ARI as feature set inputs to SVM our classifier outperfomed our human and crowd assessors.

hybrid measurements providing students and thd crowd with additional measurement data: sentiment scores and ARI scores providing assessors with meaningful metrics is likely to improve the quality of assessment.

conclusion 展望:如果使用对相关问题很熟悉的众包 工人,效果是否比自动分类要好? 疑问: SVM 的效果比混合方法的效果好, 为啥还要用混合方法?

papers Characterizing and Detecting Malicious Crowdsourcing Detecting Deceptive Opinion Spam Using Human Computation SmartNotes: Application of Crowdsourcing to the Dectection of Web Threats

outline introduction related work design of SmartNotes web scam detection technique

introduction two types of cybersecurity threats –threats created by factors outside the end user's control, such as security flaws in application and protocols. –threats caused by the user's actions, such as phishing. the way to identifying the these websites –statistic –blacklist

introductin our crowdsourcing approach –users report security theats –machine learning to integrate their responses features –combining data from multiple sources –combining social bookmarking with questiong- answering –appling machine learning and natural- language processing

related work social bookmarking –sharing bookmarks among users question answering –post questions and answer question posed by others safe browsing——browser extensions web scam detection –closely related to spam detection –content based

design of SmartNotes user interface –Chrome browser extension –post a comment or ask a question –share your notes and questions with others –analyze the current wbsite

design of SmartNotes read and write notes, account... javascript and Chrome extension API machine learning algorithms collecting 43 features from 11 sources

web scam detection technique We need a training set of websites labeled scam or non-scam to apply our supervised machine learning technique. approaches to construct a training set 1. Scam queries (random) –select 100 domain names from each query and summitted them to AMT. 2. Web of Trust (scam) –200 most recent discussion threats

web scam detection technique 3. Spam s (scam) –1551 spam s from a corporate system. 4. hpHosts (scam) –top 100 most recent reported website on the blacklist 5. Spam s (non-scam) –top 100 websites according to the ranking on alexa.com.

validation & result harmonic mean of the precision and the recallthe area under the ROC curve

Q&A