 Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications  Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn.

Slides:



Advertisements
Similar presentations
Scalable Parallel Intrusion Detection Fahad Zafar Advising Faculty: Dr. John Dorband and Dr. Yaacov Yeesha 1 University of Maryland Baltimore County.
Advertisements

JStylo: An Authorship-Attribution Platform and its Applications
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Computer and Programming
Dissecting Android Malware : Characterization and Evolution
Fast and Precise In-Browser JavaScript Malware Detection
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
FEAL FEAL 1.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Ragib Hasan Johns Hopkins University en Spring 2011 Lecture 10 04/18/2011 Security and Privacy in Cloud Computing.
Metamorphic Malware Research
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Authorship Attribution Erik Goldman & Abel Allison.
Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.
Indexing Techniques Mei-Chen Yeh.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Automated malware classification based on network behavior
MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research.
CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:
DroidKungFu and AnserverBot
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Copyright ©: SAMSUNG & Samsung Hope for Youth. All rights reserved Tutorials Software: Building apps Suitable for: Advanced.
STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.
D2Taint: Differentiated and Dynamic Information Flow Tracking on Smartphones for Numerous Data Sources Boxuan Gu, Xinfeng Li, Gang Li, Adam C. Champion,
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Presented by: Kushal Mehta University of Central Florida Michael Spreitzenbarth, Felix Freiling Friedrich-Alexander- University Erlangen, Germany michael.spreitzenbart,
Department of Computer Science Yasmine Kandissounon.
Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Presented by: Akbar Saidov Authors: M. Polychronakis, K. G. Anagnostakis, E. P. Markatos.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Mastering Windows Network Forensics and Investigation Chapter 10: Introduction to Malware.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University 1 Classification.
1 End-to-End Learning for Automatic Cell Phenotyping Paolo Emilio Barbano, Koray Kavukcuoglu, Marco Scoffier, Yann LeCun April 26, 2006.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.
Project 3 SIFT Matching by Binary SIFT
SEMINAR - SCALABLE, BEHAVIOR-BASED MALWARE CLUSTERING GUIDES : BOJAN KOLOSNJAJI, MOHAMMAD REZA NOROUZIAN, GEORGE WEBSTER PRESENTER RAMAKANT AGRAWAL.
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
October 20-23rd, 2015 FEEBO: A Framework for Empirical Evaluation of Malware Detection Resilience Against Behavior Obfuscation Sebastian Banescu Tobias.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
G2 - Keit Team members: ●Siyang Piao ●Peter Huang ●Bojun Jin ●Ivy Wang ●Jing Wang.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Learning to Detect and Classify Malicious Executables in the Wild by J
TriggerScope: Towards Detecting Logic Bombs in Android Applications
POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms
Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques Presented by Vikraman Mohan.
Supervised Time Series Pattern Discovery through Local Importance
TriggerScope Towards Detecting Logic Bombs in Android Applications
Ganapathy Mani, Bharat Bhargava, Jason Kobes*
Evaluation of a Stylometry System on Various Length Portions of Books
STATUS CLASSIFICATION USING NETWORK
CS-3013 Operating Systems Hugh C. Lauer
CSC-682 Advanced Computer Security
Towards Obfuscation Resilient Software Plagiarism Detection
Presentation transcript:

 Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications  Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn Song  On the Feasibility of Internet-Scale Author Identification  Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Eui Chul Richard Shin, Dawn Song, Emil Stefanov

 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely known s/w similarity detection tool  Can provide valuable insight into malware detection

 Generally not true  In the android apps domain, it can be!  86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “ Dissecting android malware:characterization and evolution”)  Similarity detection is crucial

 Each android app is an apk file, ends with a.apk extension  Each apk file has.dex file which is a dalvik executable file and is executed by the dalvik virtual machine  Fingerprint the apk using bithashing

 Application preprocessing Each app is segmented into basic blocks. Only the opcodes are retained, the exception being opcodes storing constant data, e.g. const- string opcode. In this case the opcode is concatenated with the value it references  Feature Extraction K-grams of opcodes are extracted by sliding a window of size k and hashing it with djb2 hash function. For each hash value, corresponding bit in the bitvector is set.

 Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation  Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains opcodes

 The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets  apps were used to determine m. m = N 90 x 9 = 240,007, a prime number

 Given two bitvector representations of two apps A and B, their similarity is computed by the given formula: J(A,B) = |A ∧ B| / |A ⋁ B| This formula Is a variation of the original Jaccard similarity.

 If the app is heavily obfuscated, then juxtapp may not perform well  Use of third-party libraries can add a lot of noise and adversely affect the similarity score

 Who wrote it?  Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship  Primary application has shifted from literary domain to forensics : terrorist threats, harassment

 2.4 million posts from 100,000 blogs (almost a billion words)  Stylometry : Identify author based on writing style  Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author

 Prepare test set and training set  Build a classifier with the training set  Test the classifier with the test set  Which features should be considered?

Syntax tree by Stanford parser Yule’s K k = 10000*(M-N)/(N*N) N= Total number of words in the text M = ∑ i * i * V i where V i is the number of words that occur i times

 In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors  In 35% of cases the correct author is one of the top 20 guesses

 Malware author identification from :  Plain-text source code  Binary executables  Intermediate-code