On the Semantic Patterns of Passwords and their Security Impact RAFAEL VERAS, CHRISTOPHER COLLINS, JULIE THORPE UNIVERSITY OF ONTARIO INSTITUTE OF TECHNOLOGY.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Password Cracking Lesson 10. Why crack passwords?
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Matt Weir, Sudhir Aggarwal, Michael Collins, Henry Stern Presented by Erik Archambault.
Regional Workshop for African Countries on Compilation of Basic Economic Statistics Pretoria, July 2007 Administrative Data and their Use in Economic.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
CS 330 Programming Languages 09 / 13 / 2007 Instructor: Michael Eckmann.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
Authentication for Humans Rachna Dhamija SIMS, UC Berkeley DIMACS Workshop on Usable Privacy and Security Software July 7, 2004.
Creation of a Russian-English Translation Program Karen Shiells.
ENTROPY OF FINGERPRINT SENSORS. Do different fingerprint sensors affect the entropy of a fingerprint? RESEARCH QUESTION/HYPOTHESIS.
Language Translation Principles Part 1: Language Specification.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Cryptanalysis. The Speaker  Chuck Easttom  
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Learning to Transform Natural to Formal Language Presented by Ping Zhang Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney.
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Lecture 2: Introduction to C Programming. OBJECTIVES In this lecture you will learn:  To use simple input and output statements.  The fundamental data.
Chapter 3 Describing Syntax and Semantics
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Measuring Real-World Accuracies and Biases in Modeling Password Guessability Segreti. et al. Usenix Security 2015.
Mr C Johnston ICT Teacher
John Lafferty Andrew McCallum Fernando Pereira
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
"Using An Enhanced Dictionary to Facilitate Auditing Techniques Related to Brute Force SSH and FTP Attacks" Ryan McDougall St. Cloud State University
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
NATURAL LANGUAGE PROCESSING
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Password Cracking Lesson 10.
CHAPTER 11 Inference for Distributions of Categorical Data
CSCI 5832 Natural Language Processing
Probabilistic and Lexicalized Parsing
CSCI 5832 Natural Language Processing
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
David Kauchak CS159 – Spring 2019
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

On the Semantic Patterns of Passwords and their Security Impact RAFAEL VERAS, CHRISTOPHER COLLINS, JULIE THORPE UNIVERSITY OF ONTARIO INSTITUTE OF TECHNOLOGY PRESENTER: KYLE WALLACE

A Familiar Scenario… Password: “What should I pick as my new password?” User Name: CoolGuy90

A Familiar Scenario… “Musical!Snowycat90”

A Familiar Scenario… But how secure is “Musical!Snowycat90” really? (18 chars) ◦“Musical” – Dictionary word, possibly related to hobby ◦“!” – Filler character ◦“Snowy” – Dictionary word, attribute to “cat” ◦“cat” – Dictionary word, animal, possibly pet ◦“90” – Number, possibly truncated year of birth 15/18 characters are related to dictionary words! Why do we pick the passwords that we do?

Password Patterns? “Even after half a century of password use in computing, we still do not have a deep understanding of how people create their passwords” –Authors Are there ‘meta-patterns’ or preferences that can be observed across how people choose their passwords? Do these patterns/preferences have an impact on security?

Contributions Use NLP to segment, classify, and generalize semantic categories Describe most common semantic patterns in RockYou database A PCFG that captures structural, semantic, and syntactic patterns Evaluation of security impact, comparison with previous studies

Contributions Use NLP to segment, classify, and generalize semantic categories Describe most common semantic patterns in RockYou database A PCFG that captures structural, semantic, and syntactic patterns Evaluation of security impact, comparison with previous studies

Segmentation Decomposition of passwords into constituent parts ◦Passwords contain no whitespace characters (usually) ◦Passwords contain filler characters (“gaps”) between segments Ex: crazy2duck93^ -> {crazy, duck} & {2,93^} Issue: What about strings that parse multiple ways?

Coverage Prefer fewer, smaller gaps to larger ones Ex: Anyonebarks98 (13 characters long)

Splitting Algorithm Source corpora: Raw word list ◦Taken from COCA (Contemporary Corpus of American English) Trimmed version of COCA: ◦3 letter words: Frequency of 100+ ◦2 letter words: Top 37 ◦1 letter words: a, I Also collected list of names, cities, surnames, months, and countries

Splitting Algorithm

Common Words

Part-of-Speech Tagging

Semantic Classification Assigns a semantic classifier to each password segment ◦Only assigned to nouns and verbs WordNet: A graph of concepts expressed as a set of synonyms ◦“Synsets” are arranged into hierarchies, more general at top Fall back to source corpora for proper nouns ◦Tag with female name, male name, surname, country, or city

Semantic Classification Tags represented as word.pos.#, where # is the WordNet ‘sense’

Semantic Generalization

W=1000 (gold), W=5000 (red), W=10000(blue)

Contributions Use NLP to segment, classify, and generalize semantic categories Describe most common semantic patterns in RockYou database A PCFG that captures structural, semantic, and syntactic patterns Evaluation of security impact, comparison with previous studies

Classification RockYou leak (2009) contained over 32 million passwords Effect of generalization can be seen in a few cases (in blue) ◦Some generalizations better than others (Ex: ‘looted’ vs ‘bravo100’) Some synsets are not generalized (in red) ◦Ex: puppy.n.01 -> puppy.n.01

Summary of Categories Love (6,7) Places (3, 13) Sexual Terms (29, 34, 54, 69) Royalty (25, 59, 60) Profanity (40, 70, 72) Animals (33, 36, 37, 92, ) Food (61, 66, 76, 82, 93) Alcohol (39) Money (46, 74) *Some categories expanded from two letter acronyms +Some categories contain noise from names dictionary

Top 100 Semantic Categories

Contributions Use NLP to segment, classify, and generalize semantic categories Describe most common semantic patterns in RockYou database A PCFG that captures structural, semantic, and syntactic patterns Evaluation of security impact, comparison with previous studies

Probabilistic Context-Free Grammar

Semantic PCFG

Sample PCFG

RockYou Base Structures (Top 50)

Contributions Use NLP to segment, classify, and generalize semantic categories Describe most common semantic patterns in RockYou database A PCFG that captures structural, semantic, and syntactic patterns Evaluation of security impact, comparison with previous studies

Building a Guess Generator Cracking attacks consist of three steps: ◦Generate a guess ◦Hash the guess using the same algorithm as target ◦Check for matches in the target database Most popular methods (using John the Ripper program) ◦Word lists (from previous breaks) ◦Brute force (usually after exhausting word lists)

Guess Generator At a high level: ◦Output terminals in highest probability order ◦Iteratively replaces higher probability terminals with lower probability ones ◦Uses priority queue to maintain order Will this produce the same list of guesses every time?

Guess Generator Example

Mangling Rules Passwords aren’t always strictly lowercase ◦Beardog123lol ◦bearDOG123LoL ◦BearDog123LoL Three types of rules: ◦Capitalize first word segment ◦Capitalize whole word segment ◦CamelCase on all segments Any others?

Comparison to Weir Approach Author’s approach seen as an evolution of Weir ◦Weir contains far fewer non-terminals (less precise estimates) ◦Weir does not learn semantic rules (fewer overall terminals) ◦Weir treats grammar and dictionary input separately ◦Authors semantic classification needs to be re-run for changes

Password Cracking Experiments Considered 5 methods: ◦Semantic approach w/o mangling rules ◦Semantic approach w/ custom mangling rules ◦Semantic approach w/ JtR’s mangling rules ◦Weir approach ◦Wordlist w/ JtR’s default rules + incremental brute force Attempted to crack LinkedIn and MySpace leaks

Experiment 1: RockYou vs LinkedIn 5,787,239 unique passwords Results: ◦Semantic outperforms non- semantic versions ◦Weir approach is worst (67% improvement) ◦Authors approach is more robust against differing demographics

Experiment 2: RockYou vs MySpace 41,543 unique passwords Results: ◦Semantic approach outperforms all ◦No-rules performs best ◦Weir approach is worst (32% improvement) ◦Password were phished, quality lowered?

Experiment 3: Maximum Crack Rate

Experiment 3: Time to Maximum Crack Fit non-linear regression to sample of guess probs. Results: ◦Semantic method has lower guess/second ◦Grammar is much larger than Weir method

Issues with Semantic Approach Further study needed into performance bottlenecks ◦Though semantic method is more efficient (high guesses/hit) Approach requires a significant amount of memory ◦Workaround involves probability threshold for adding to queue Duplicates could be produced due to ambiguous splits ◦Ex: (one, go) vs (on, ego)

Conclusions There are underlying semantic patterns in password creation These semantics can be captured in a probabilistic grammar This grammar can be used to efficiently generate probable passwords This generator shows (up to) a 67% improvement over previous efforts

Thank you! QUESTIONS?