P REDICTING ZERO - DAY SOFTWARE VULNERABILITIES THROUGH DATA - MINING --T HIRD P RESENTATION Su Zhang 1.

Slides:

Advertisements

Similar presentations

Su Zhang 1. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing.

Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.

Frequent Closed Pattern Search By Row and Feature Enumeration

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

On the Privacy of Private Browsing Kiavash Satvat, Matt Forshaw, Feng Hao, Ehsan Toreini Newcastle University DPM’13.

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.

Yue Han and Lei Yu Binghamton University.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

P REDICTING ZERO - DAY SOFTWARE VULNERABILITIES THROUGH DATA MINING Su Zhang Department of Computing and Information Science Kansas State University 1.

Decision Tree Rong Jin. Determine Milage Per Gallon.

Neural Technology and Fuzzy Systems in Network Security Project Progress 2 Group 2: Omar Ehtisham Anwar Aneela Laeeq

An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.

Biol 500: basic statistics

1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.

1 Seventh Lecture Error Analysis Instrumentation and Product Testing.

Software Process CS 414 – Software Engineering I Donald J. Bagert Rose-Hulman Institute of Technology December 17, 2002.

Testing Intrusion Detection Systems: A Critic for the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory By.

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

© Sam Ransbotham The Impact of Immediate Disclosure on Attack Diffusion and Volume Sam Ransbotham Boston College Sabyasachi Mitra Georgia Institute of.

1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.

Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.

Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.

Active Learning Lecture Slides

1 Security Risk Analysis of Computer Networks: Techniques and Challenges Anoop Singhal Computer Security Division National Institute of Standards and Technology.

Mining Binary Constraints in the Construction of Feature Models Li Yi Peking University March 30, 2012.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.

Chapter 1: Introduction to Statistics

A Large Scale Exploratory Analysis of Software Vulnerability Life Cycles Muhammad Shahzad Dept. of Computer Science and Engineering Michigan State University.

CORRELATION & REGRESSION

Chapter 15 Correlation and Regression

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.

Evaluation of Alternative Methods for Identifying High Collision Concentration Locations Raghavan Srinivasan 1 Craig Lyon 2 Bhagwant Persaud 2 Carol Martell.

Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.

1 The Likelihood of Vulnerability Rediscovery and the Social Utility of Vulnerability Hunting Andy Ozment Computer Security Group Computer Laboratory University.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Software Security Weakness Scoring Chris Wysopal Metricon August 2007.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.

In section 11.9, we were able to find power series representations for a certain restricted class of functions. Here, we investigate more general problems.

THE IRISH SOFTWARE ENGINEERING RESEARCH CENTRELERO© What we currently know about software fault prediction: A systematic review of the fault prediction.

Automating Analysis of Large-Scale Botnet Probing Events Zhichun Li, Anup Goyal, Yan Chen and Vern Paxson* Lab for Internet and Security Technology (LIST)

Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.

1 Report on results of Discriminant Analysis experiment. 27 June 2002 Norman F. Schneidewind, PhD Naval Postgraduate School 2822 Racoon Trail Pebble Beach,

CSc 461/561 Information Systems Engineering Lecture 5 – Software Metrics.

Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.

Chapter 14 Correlation and Regression

© ABB Corporate Research January, 2004 Experiences and Results from Initiating Field Defect Prediction and Product Test Prioritization Efforts at.

+ Moving Targets: Security and Rapid-Release in Firefox Presented by Carlos Bernal-Cárdenas.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

Hussein Alhashimi. “If you can’t measure it, you can’t manage it” Tom DeMarco,

Example x y We wish to check for a non zero correlation.

 2006 National Council on Compensation Insurance, Inc. Slide 1 of 17 A Claim Counts Model for Discerning the Rate of Inflation from Raw Claims Data Spring.

Keeping Updated Ensuring hospital IT systems support ePortfolio.

CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.

보안 취약점 비교 Linux vs. Windows

Master thesis: Automatic Extraction of Design Decision Relationships from a Task Management System Kick-Off Matthias Ruppel, 8th of May 2017, Munich.

Supervised Time Series Pattern Discovery through Local Importance

Regularized risk minimization

Elementary Statistics

Predict Failures with Developer Networks and Social Network Analysis

Classification and Prediction

Autonomous Network Alerting Systems and Programmable Networks

QoI: Assessing Participation in Threat Information Sharing

Presentation transcript:

P REDICTING ZERO - DAY SOFTWARE VULNERABILITIES THROUGH DATA - MINING --T HIRD P RESENTATION Su Zhang 1

Outline Quick Review. Data Source – NVD. Data Preprocessing. Experimental Results. An Essential Limitation. An Alternative Feature. Conclusion. Future Work. 2

Quick Review 3

Source Database – NVD National Vulnerability Database – U.S. government repository of standard vulnerability management data. – Data included in each NVD entry Published Date Time Vulnerable software’s CPE Specification CVSS (Common Vulnerability Scoring System) External links/reference/summary 4

Instances An instance is a tuple including configuration information and vulnerability. – – e.g. (Microsoft, windows7, sp1, CVSS, vulnerability1) 5

Number of Instances 6

Number of CVEs 7

Data Preprocessing NVD data—Training/Testing dataset – Starting from 2005 since before that the data looks unstable. – Remove some obvious errors in NVD (e.g. “cpe:/o:linux:linux_kernel:390”). Attributes – Published time : Month and day/epoch time. – Version: discretization/binning. – Versiondiff: A normalized difference between two versions. Radix-based versiondiff. Counter (Rank) - based versiondiff. – Vendor: Removed (For each vendor we only built one model). 8

Predictive & Predicted Attributes Predictive feature – Time – Versiondiff – TTPV (Time to previous Vulnerability) – CVSS (Common vulnerability scoring system) Predicted feature (intermediate result) – TTNV (Time to next vulnerability) We believe this feature could quantify the risk level of software. Final result – Quantitative risk level indicator 9

Fitness Indicator - Correlation Coefficient [13] 10

Training/Testing dataset We used ratio of training : testing = 2 : 1 for our experiments All training data is earlier than testing data. 11

Correlation Coefficient for Linux Vulnerabilities Using Two Formats of Time 12

Counter (Rank) Based Versiondiff We rank all versions regardless of their values – If one only have three versions: 5.0, 2.2 and 2.1, then their values will be replaced by 3, 2 and 1. – i.e. versiondiff (5.0, 2.2) = versiondiff (2.2, 2.1), versiondiff (5.0, 2.1) = 2*versiondiff (2.2, 2.1). Characteristic: – This schema neglects the quantitative differences between versions. The radix is a “dynamic” number depending on how many version possibilities it has. 13

Fixed Radix (100) Versiondiff The radix for each sub version is a fixed value – 100. – Versiondiff(2.1, 3.1) = 100 – Versiondiff(3.3, 3.1) = 2 Underlying principle : – Difference between major versions suggests a higher degree of dissimilarity than difference between (relative) minor versions.[14] 14

Correlation Coefficient for Linux Vulnerabilities Using Two Formats of Versiondiff 15

CVSS Metrics Access vector {ADJACENT_NETWORK, NETWORK, LOCAL} Confidentiality {COMPLETE, PARTIAL, NONE} Integrity {COMPLETE, PARTIAL, NONE} Availability {COMPLETE, PARTIAL, NONE} 16

Correlation Coefficient for Adobe Vulnerabilities Using CVSS Metrics or Not 17

Software(Linux Kernel) Version Discretization/Binning Rationale: Group values with high similarity. How? – Rounding all the sub versions to its third significant major version. – E.g. Bin ( ) =

Software Version (Linux Kernel)Discretization/Binning (Cont) Why & Why not? – Why 3? More than half instances (31834/56925) have a version longer than 3. – Why not 4? Only 1% (665/56925) instances’ versions longer than 4. – Why not 2? Difference on the third subversion will be regarded as a huge dissimilarity for Linux kernel. [1] – Why not Microsoft? Versions of Microsoft products are naturally discrete. (all of them have numeric versions less than 20) 19

Correlation Coefficient for Linux Vulnerabilities Using Binned Versions or Not 20

An Essential Problem of Versiondiff Most of the new vulnerabilities affecting current version will affect previous versions as well. – Microsoft Bulletin. – Adobe Bulletin. – Therefore, most versiondiff are zero (or unknown). Microsoft : 85.2% (14229/16699) Linux: 61.5% (39448/64052) Mozilla: 53.4% (12057/22566) … 21

A Possible Alternative Attribute Occurrences number of each version of each software. – This could somehow illustrate the trend of each version (Since the number of occurrence will keep increasing and most of the instances will have a meaningful value (instead of zero)) – This attribute is just follow our intuition but we couldn’t find any rationale behind it. 22

Microsoft Windows – Instances without version information. Instead of using the aforementioned attribute, we use occurrence number of given software ( windows). Non-windows applications – Instances including version information. We used the aforementioned attribute as one of the predictive feature. 23

Windows and non-windows instances 24

Different Applications Have Quite Different Trends Firefox – It has promising results (correlation coefficient is close to 0.7 for both training and test data)when we tried building models on it. – Adding CVSS or not will not affect the results. Internet Explorer – It has similar results when adding CVSS. – But its results will be extremely bad without CVSS. 25

Correlation Coefficient for IE Vulnerabilities Using CVSS Metrics or Not 26

Correctly Classified Rate for Firefox Vulnerabilities Using CVSS Metrics or Not 27

Google(Chrome) It is becoming more and more vulnerable vendor (in terms of numbers of instances). It has more than 10,000 instances. However, more than half of them appeared within two months (Apr-May 2010). 28

Conclusion Conclusion: Vendor-based Models couldn’t be built now because of the limitation of NVD data. However, group similar application-based models is another possibility. Why? – Trend of TTNV is not stable (have been shown in previous test). – Some errors could dramatically affect the results. – Inconsistent definitions. (Caused by different maintainers)[12]. – Version information couldn’t be used effectively. 29

Future Work Number of zero-day vulnerabilities of each software – This may need life-cycle information. CVSS Score – Indicates the risk levels for different vulnerabilities. 30

Questions & Discussions Thank you! 31

References [1]Andrew Buttner et al, ”Common Platform Enumeration (CPE) – Specification,” [2]NVD, [3]O. H. Alhazmi et al, “Modeling the Vulnerability Discovery Process,” [4]Omar H. Alhazmi et al, “Prediction Capabilities of Vulnerability Discovery Models,” [5]Andy Ozment, “Improving Vulnerability Discovery Models,” [6]R. Gopalakrishna and E. H. Spafford, “A trend analysis of vulnerabilities,” [7]Christopher M. Bishop, “Pattern Recognition andMachine Learning,” [8]Xinming Ou et al, “MulVAL: A logic-based network security analyzer,” [9] Kyle Ingols et al, “Modeling Modern Network Attacks and Countermeasures Using Attack Graphs” [10] Miles A. McQueen et al, “Empirical Estimates and Observations of 0Day Vulnerabilities,” [11] Alex J. Smola et al, “A Tutorial on Support Vector Regression,” [12] Vulnerability Discovery & Software Security Andy Ozment. Ph.D Dissertation. [13] Correlation Coefficient, [14] Microsoft Software Versioning, 32