Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.

Slides:



Advertisements
Similar presentations
Tamper Resistant Software An Implementation By David Aucsmith, IAL “This paper describes a technology for the construction of tamper resistant software.”
Advertisements

COMPOSING RESEARCH PROPOSAL Meeting 3 Subject: G-1342 Research Seminar Year: 2008/2009.
Fast and Precise In-Browser JavaScript Malware Detection
Problem Semi supervised sarcasm identification using SASI
2 The key challenge to maintain a robust petroleum industry is ensuring an adequate supply of well trained professionals now and in future The development.
Software Hardening & FIPS 140 Eugen Bacic & Gary Maxwell September 27th, 2005.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
MCS 2005 Round Table In the context of MCS, what do you believe to be true, even if you cannot yet prove it?
Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29.
Essential Software Architecture Ian Gorton CS590 – Winter 2008.
Software Architecture in Practice
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
ARTIFICIAL INTELLIGENCE AND CREATIVE THINKING Michael Paul – CS210 –
Applied Software Project Management Andrew Stellman & Jennifer Greene Applied Software Project Management Applied Software.
Automated malware classification based on network behavior
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Problem Solving Methodology
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Microsoft ® Office 2007 Training Security II: Turn off the Message Bar and run code safely presents:
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Ethics, Technology, and Qualitative Research: Thinking through the Implications of New Technology Sandra Spickard Prettyman Kristi Jackson.
ERP. What is ERP?  ERP stands for: Enterprise Resource Planning systems  This is what it does: attempts to integrate all data and processes of an organization.
Identification of the authors of short messages portals on the Internet using the methods of mathematical linguistics. Postgraduate:Sukhoparov M.E. Supervisor:doctor.
Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #8 Computer Forensics Data Recovery and Evidence Collection September.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Experimental Evaluation of Learning Algorithms Part 1.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
A Language Independent Method for Question Classification COLING 2004.
Authorship Attribution By Allison Pollard. What is Authorship Attribution? The way of determining who wrote a text when it is unclear who wrote it. It.
From Quality Control to Quality Assurance…and Beyond Alan Page Microsoft.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Using Language to Persuade Language that YOU can use!
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 27 th, 2001 Final Presentation.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Preventing Private Information Inference Attacks on Social Networks.
A Generic Approach to Automatic Deobfuscation of Executable Code Paper by Babak Yadegari, Brian Johannesmeyer, Benjamin Whitely, Saumya Debray.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
CSCI N100 Principles of Computing Basic Problem-Solving.
Thesis Statements, Paragraphs, etc. What Are They? How can I write them?
High Assurance Products in IT Security Rayford B. Vaughn, Mississippi State University Presented by: Nithin Premachandran.
Code Obfuscation Tool for Software Protection. Outline  Why Code Obfuscation  Features of a code obfuscator Potency Resilience Cost  Classification.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Elasticity of Supply Unit 5.4. Elasticity and Supply Elasticity with supply works just like elasticity with demand. Suppliers look at the amount of change.
 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.
Using extracts from student essays as teaching materials. Chris Nelson INTO Newcastle University.
Noticing language The strength of claims. The effects of musculoskeletal resistance training (RT) on the development of strength and power in a healthy.
A Partial Survey of the Perfect Digital Watermark Problem.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
October 20-23rd, 2015 FEEBO: A Framework for Empirical Evaluation of Malware Detection Resilience Against Behavior Obfuscation Sebastian Banescu Tobias.
The Toulmin Method. Why Toulmin…  Based on the work of philosopher Stephen Toulmin.  A way to analyze the effectiveness of an argument.  A way to respond.
Compilers and Security
Restaurant Revenue Prediction using Machine Learning Algorithms
Authorship Attribution Using Probabilistic Context-Free Grammars
Attacking an obfuscated cipher by injecting faults
Vincent Fiore, Ange Assoumou, Debarshi Dutta, Kenneth Almodovar
Article Review Todd Hricik.
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Adversarial Evasion-Resilient Hardware Malware Detectors
INTRODUCTION.
CIC Attacking developer’s identity in open-source projects
CIC Identifying smart contract users by analyzing their coding style
Team Skill 6 - Building The Right System Part 1: Applying Use Cases
Learning outcomes By the end of this chapter you should: • understand the importance and purpose of the critical literature review to your research project;
Evaluating Rust Ethan Larkham & Todd Gaunt Department of Computer Science, University of New Hampshire, Durham, NH Abstract Objectives Results Compare.
Presentation transcript:

Presented by Teererai Marange

According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when code is run through commercial obfuscation software with no significant change in accuracy. Is this a good thing? But is it a great thing? In this presentation, I will discuss the implications of this statement as well as its relevance to the problem of authorship attribution and to the software security field.

Given: A set of authors O. A labelled set of code samples from various owners C where the labels represents the authors. An unlabeled set of code c. Find: The author o who wrote c.

Code set Owners set Unattributed code Syntactic feature set Owner set Classifier

According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when code is run through commercial obfuscation software with no significant change in accuracy. Is this a good thing?

Code set Owners set Unattributed code Syntactic feature set Owner set Classifier Obfuscator

1.Such a system would have more data since it can also consider obfuscated software. 2.Results are not affected by tampering with lexical features and changing names in the code. 3.This also implies that hiding one’s authorship from such a system would require skill.

Machine learning algorithms perform better when they have a larger training dataset. Such a system would be able to train on obfuscated code and hence would be more accurate. It would also be possible to supply a piece of obfuscated code for authorship attribution, hence a larger number of possible problems solved.

Speaks to robustness of the system to changes in lexical and presentation features of code. Utilizing such a system in practice requires trust. If adding random whitespace throughout code would affect the results then trust goes out of the window. Thus this form of robustness is a good thing.

Useful for plagiarism detection where the offender does not understand the code for which they are trying to hide the true authorship. In such cases, running such a system would potentially be sufficient. But what about if the offender is skilled? According to the author, “We do not claim that our feature set resists attempts at manipulating one’s coding style. However, we do find that our syntactic feature set is impervious to off-the shelf code obfuscators which only change layout and SOME lexical features”

By definition, obfuscation software is designed to make code difficult for humans to understand. This is in order to prevent reverse engineering of code. Obfuscation is not intended for hiding authorship. Thus if one used such software to hide authorship, they are using the wrong software for the wrong purpose.

However to the author’s knowledge no software that is meant to hide authorship or obfuscate style has been written. Thus the topic of determining performance when code is run through such software is left to future work.

The author’s claim is a good thing because: Results are more robust. Potentially larger training set for the algorithm. Potential to solve a larger set of problems. Authorship hiding now takes skill. However Obfuscation is not intended to hide author(left to future research).

Any questions???