 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

JStylo: An Authorship-Attribution Platform and its Applications

Computer and Programming

Dissecting Android Malware : Characterization and Evolution

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

Computer Vision Detecting the existence, pose and position of known objects within an image Michael Horne, Philip Sterne (Supervisor)

 Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications  Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn.

Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Metamorphic Malware Research

Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.

Automated malware classification based on network behavior

MutantX-S: Scalable Malware Clustering Based on Static Features Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Grifﬁn, Symantec Research.

CAS: A FRAMEWORK OF ONLINE DETECTING ADVANCE MALWARE FAMILIES FOR CLOUD-BASED SECURITY From: First IEEE International Conference on Communications in China:

CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:

Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.

DroidKungFu and AnserverBot

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Copyright ©: SAMSUNG & Samsung Hope for Youth. All rights reserved Tutorials Software: Building apps Suitable for: Advanced.

STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA Kardelen HATUN.

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Presented by: Kushal Mehta University of Central Florida Michael Spreitzenbarth, Felix Freiling Friedrich-Alexander- University Erlangen, Germany michael.spreitzenbart,

Department of Computer Science Yasmine Kandissounon.

Paradyn Project Dyninst/MRNet Users’ Meeting Madison, Wisconsin August 7, 2014 The Evolution of Dyninst in Support of Cyber Security Emily Gember-Jacobson.

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

Computer Programming A program is a set of instructions a computer follows in order to perform a task. solve a problem Collectively, these instructions.

Mining and Analysis of Control Structure Variant Clones Guo Qiao.

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.

AccessMiner Using System- Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu and Engin Kirda.

Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Mastering Windows Network Forensics and Investigation Chapter 10: Introduction to Malware.

CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

 Programming - the process of creating computer programs.

Zozzle: Low-overhead Mostly Static JavaScript Malware Detection.

Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,

1 End-to-End Learning for Automatic Cell Phenotyping Paolo Emilio Barbano, Koray Kavukcuoglu, Marco Scoffier, Yann LeCun April 26, 2006.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

: Chapter 5: Image Filtering 1 Montri Karnjanadecha ac.th/~montri Image Processing.

Project 3 SIFT Matching by Binary SIFT

SEMINAR - SCALABLE, BEHAVIOR-BASED MALWARE CLUSTERING GUIDES : BOJAN KOLOSNJAJI, MOHAMMAD REZA NOROUZIAN, GEORGE WEBSTER PRESENTER RAMAKANT AGRAWAL.

CRYPTOVIROLOGY by Ramu Muthuraman Cpsc 620. Overview  Introduction  Justification of Cryptovirology?  Key Terms  Cryptoviral Extortion Attack  Gpcode.ag.

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.

October 20-23rd, 2015 FEEBO: A Framework for Empirical Evaluation of Malware Detection Resilience Against Behavior Obfuscation Sebastian Banescu Tobias.

G2 - Keit Team members: ●Siyang Piao ●Peter Huang ●Bojun Jin ●Ivy Wang ●Jing Wang.

The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.

Common System Exploits Tom Chothia Computer Security, Lecture 17.

Learning to Detect and Classify Malicious Executables in the Wild by J

TriggerScope: Towards Detecting Logic Bombs in Android Applications

Chapter 2: Operating-System Structures

AUDACIOUS: USER DRIVEN ACCESS CONTROL WITH UNMODIFIED OPERATING SYSTEM

Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques Presented by Vikraman Mohan.

Supervised Time Series Pattern Discovery through Local Importance

Twitter Augmented Android Malware Detection

TriggerScope Towards Detecting Logic Bombs in Android Applications

Ganapathy Mani, Bharat Bhargava, Jason Kobes*

CS-3013 Operating Systems Hugh C. Lauer

Towards Obfuscation Resilient Software Plagiarism Detection

Presentation transcript:

 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely known s/w similarity detection tool  Can provide valuable insight into malware detection

 Generally not true  In the android apps domain, it can be!  86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “ Dissecting android malware:characterization and evolution”)  Similarity detection is crucial

 Each android app is an apk file, ends with a.apk extension  Each apk file has.dex file which is a dalvik executable file and is executed by the dalvik virtual machine  Fingerprint the apk using bithashing

 Application preprocessing Each app is segmented into basic blocks. Only the opcodes are retained, the exception being opcodes storing constant data, e.g. const- string opcode. In this case the opcode is concatenated with the value it references  Feature Extraction K-grams of opcodes are extracted by sliding a window of size k and hashing it with djb2 hash function. For each hash value, corresponding bit in the bitvector is set.

 Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation  Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains opcodes

 The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets  apps were used to determine m. m = N 90 x 9 = 240,007, a prime number

 Given two bitvector representations of two apps A and B, their similarity is computed by the given formula: J(A,B) = |A ∧ B| / |A ⋁ B| This formula Is a variation of the original Jaccard similarity.

 If the app is heavily obfuscated, then juxtapp may not perform well  Use of third-party libraries can add a lot of noise and adversely affect the similarity score

 Who wrote it?  Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship  Primary application has shifted from literary domain to forensics : terrorist threats, harassment

 2.4 million posts from 100,000 blogs (almost a billion words)  Stylometry : Identify author based on writing style  Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author

 Prepare test set and training set  Build a classifier with the training set  Test the classifier with the test set  Which features should be considered?

Syntax tree by Stanford parser Yule’s K k = 10000*(M-N)/(N*N) N= Total number of words in the text M = ∑ i * i * V i where V i is the number of words that occur i times

 In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors  In 35% of cases the correct author is one of the top 20 guesses

 Malware author identification from :  Plain-text source code  Binary executables  Intermediate-code