SESSION 2.5 WHARTON SUMMER TECH CAMP Regex Data Acquisition.

Slides:



Advertisements
Similar presentations
CST8177 awk. The awk program is not named after the sea-bird (that's auk), nor is it a cry from a parrot (awwwk!). It's the initials of the authors, Aho,
Advertisements

Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Perl & Regular Expressions (RegEx)
BBK P1 Module2010/11 : [‹#›] Regular Expressions.
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
THE CHURCH-TURING T H E S I S “ TURING MACHINES” Pages COMPUTABILITY THEORY.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
CS0007: Introduction to Computer Programming Console Output, Variables, Literals, and Introduction to Type.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Introduction to a Programming Environment
Regular Expressions In ColdFusion and Studio. Definitions String - Any collection of 0 or more characters. Example: “This is a String” SubString - A segment.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Group practice in problem design and problem solving
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
More on Regular Expressions Regular Expressions More character classes \s matches any whitespace character (space, tab, newline etc) \w matches.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.
Regular Expressions Dr. Ralph D. Westfall May, 2011.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
CIS 451: Regular Expressions Dr. Ralph D. Westfall January, 2009.
RegExp. Regular Expression A regular expression is a certain way to describe a pattern of characters. Pattern-matching or keyword search. Regular expressions.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
Introduction to Unix – CS 21 Lecture 12. Lecture Overview A few more bash programming tricks The here document Trapping signals in bash cut and tr sed.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
Python uses boolean variables to evaluate conditions. The boolean values True and False are returned when an expression is compared or evaluated.
THE CHURCH-TURING T H E S I S “ TURING MACHINES” Part 1 – Pages COMPUTABILITY THEORY.
Regular Expressions Theory and Practice Jeff Schoolcraft MDCFUG 12/13/2005.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Searching and Regular Expressions. Proteins 20 amino acids Interesting structures beta barrel, greek key motif, EF hand... Bind, move, catalyze, recognize,
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
SESSION 3 WHARTON SUMMER TECH CAMP Regex Data Acquisition.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
Introduction to sed. Sed : a “S tream ED itor ” What is Sed ?  A “non-interactive” text editor that is called from the unix command line.  Input text.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Perl Day 4. Fuzzy Matches We know about eq and ne, but they only match things exactly We know about eq and ne, but they only match things exactly –Sometimes.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
Python Let’s get started!.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
Models of Computing Regular Expressions 1. Formal models of computation What can be computed? What is a valid program? What is a valid name of a variable.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 3 The Church-Turing Thesis Contents Turing Machines definitions, examples,
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
IST 210: PHP Basics IST 210: Organization of Data IST2101.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
CSE 374 Programming Concepts & Tools
Regular Expressions Upsorn Praphamontripong CS 1110
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
RegExps & DFAs CS 536.
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions
CS 1111 Introduction to Programming Fall 2018
Interpreter Pattern.
Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars
REGEX.
Presentation transcript:

SESSION 2.5 WHARTON SUMMER TECH CAMP Regex Data Acquisition

1: REGEX INTRO 2: DATA ACQUISITION Agenda

Regular Expression

What is Regular Expression (RE) ? RE or REGEX is a way to describe string patterns to computers Basically, an advanced “Find” and “Find and Replace” Originated from theoretical comp sci – For the Interested: “Formal Language Theory”, “Chomsky hierarchy”, “Automata theory” Theory that guides programming language Popularized by PERL, Ubiquitous in Unix Almost all programming languages support REGEX and they are mostly the same

What is Regular Expression (RE) ? Given a text T, RE matches the part of T represented by the RE RE(T) = Subset_of_matched(T) Then you can do whatever you wish with the matched part Regular expression can be complicated and can consist of multiple patterns You can match multiple patterns at the same time With the matched part of T, you can do something with it or substitute part of the matched part with something else you wish

“Oh it’s just a text searching tool, so what?”

Well, Google is a text search tool, albeit for different purposes. The power comes from the fact that by learning regex, you are essentially learning to represent complex text patterns to computers efficiently. The size of data may be too big for humans to go through or too tedious Learn their language and tell computers what to do!

True (paraphrased) quotes from some doctoral students/faculties before I introduced them to REGEX “I despise aggregating data from the AMT – it took me a week to go through them all” “[Grunt noise]. I had to filter out IP addresses from surveys by hand and it took me forever” “I have this data with many different ways of representing the same variables and need to do “fuzzy” matching but don’t know a good way to do this”

Reasons to use regex 1. Regular expression will be very useful for data cleaning and aggregating 2. Very useful in basic web scraping. 3. Text data is everywhere and “If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing” (Programming Perl book). 4. Once you learn regex, you can use it in any language since they are similarly implemented. 5. learning regex is one of the first step in learning NLP (natural language processing) 6. You are learning a language of the machines

Usage Examples You get an output from Amazon Mech Turk (or Qualtrics) and need to extract and aggregate data and make it usable by R or Stata You can check survey outcomes for quality control. Useful for checking if the participants are paying attention or quality control at a massive scale. Related use in web development is checking to see if input format is correct (Password requirement). You want to scrape simple information from a website for your project One simple algorithm in NLP is matching and counting words. Regex can do that. You want to obtain addresses for your evil spamming purposes. You can do that but don’t. Etc. Many possibilities for increase in productivity

But it takes some time to master You will need to practice with a cheat sheet next to you. Literally, this is a language (“regular language”) you are learning. Just like any language, this one has vocabularies and grammars to learn.

Tools to practice REGEX There are great tools to practice regex Website If you have mac Reggy If you have windows Regexbuddy

Basics of REGEX Can represent strings literally or symbolically Literal representations are not powerful but convenient for small tasks Symbolic representation is the workhorse There are a few concepts you need to learn to use this representation There are also many special characters with special meanings in REGEX. e.g.,. ^ $ * + ? { } [ ] \ | ( ) regex-cheatsheet/cheatsheet.pdf Cheat sheet regex-cheatsheet/cheatsheet.pdf

Literal Matching Match strings literally. String = “I am a string” RE= “string” Matched string = “string” That’s it

Literal Matching & Quantifiers Symbolic matching has many special characters to learn. Quantifier is one concept + means match whatever comes before match it 1 or more "ba" matches only "ba" "ba+" matches baa, baaa, baaaa, etc ? means match whatever comes before 0 or 1 time "ba?" matches b or ba * means match whatever comes before 0 or more “ba*” matches b or ba or baa or baaa and so on

More Quantifiers {start,end} means match whatever comes before “start” to “end” many times "ba{1,3}" matches ba, baa, baaa “ba{2,}” matches baa, baaa, baaaa and so on

Special Meta characters As you’ve seen, some characters have special meanings. ^ $ * + ? { } [ ] \ | ( ). Means any one character except the newline character \n ^ dictates that the regex pattern should only be matched if it occurs in the beginning String= “the book” RE= “book” YES RE= “^book” NO $ is similar to ^ but for ending [] is used to signify ranges [0-9] means anything from 0 to 9 () used as grouping variable Used to group patterns Can be used to memorize a certain part of the regex | is used as “OR” (5|4) matches 5 or 4 \ <-special character to rule them all – used to escape all special meta characters to represent them as is. \. Matches actual period. [^stuff] means match anything that’s not “stuff” [^9] match anything but 9

Hey Jude Hey Jude, don't make it bad Take a sad song and make it better Remember to let her under your skin Then you'll begin to make it (better ){6}, oh (na ){7}, (na ){4}, Hey Jude

Special Vocabulary Shortcuts Some vocabularies are so common that shortcuts were made \d matches any digit [0-9] \w any alphanumeric plus underscore [a-zA-Z0-9_] \s white spaces – tabs newlines etc. [ \t\n] notice that space in the beginning \W any non alphanumeric plus underscore [^a-zA-Z0-9_] \S guess? \D again?

Flags Changes the way regex works i ignore case s changes the way. works. Usually. Matches anything except new line \n this flag makes. match everything m multiline. Changes the way ^ $ works with newline. Usually, ^ $ matches strictly start or end of string but this flag makes it match on each line.

REGEX in python Python library re import re The function used is re.search(pattern, string, flags=0) Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern. Pattern: specifies what to be matched String: actual string to match from Flags: options – basically changes the way regex works again, flag "i" says ignore case.

REGEX in python re.search(pattern, string, flags=0) re.findall(pattern, string, flags=0) Pattern: always wrap the pattern with r"" for python. r"" says interpret everything between "" to be raw string – particular to python due to the way python interprets some characters. s = "This is an example string" matchedobject=re.search(r"This", s) matchedobject=re.search(r"this", s)

Regex is easy to learn but hard to master Example of complex regex The regex in the next slide is taken from It validates based on RFC822 grammar which is now obsolete. It’s not written by hand. It’s produced by combining set of simpler regex.

(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t] \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:( ?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \000-\0 31]+(?:(?:(?:\r\n)?[ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+ (?:(?:(?:\r\n)?[ (?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z \t]))*"(?:(?:\r\n) ?[ \000-\031]+(?:(?:(?:\ r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000- \031]+(?:(?:(?:\r\n) ?[ \t] \000-\031]+(?:(?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t] \t])*))*) *:(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?: \r\n)?[ \t ]))*"(?:(?:\r\n)?[ \000-\031 ]+(?:(?:(?:\r\n)?[ ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(? :(?:(?:\r\n)?[ :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \000-\031]+(?:(? :(?:\r\n)?[ [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000- \031]+(?:(?:(?:\r\n)?[ (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t] \000-\031]+(?:(?:(?:\r\n)?[ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? \000-\031]+(?:(?:(?:\r\n)?[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \000- \031]+(?:(?:(?:\r\n)?[ ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ ]\r\\]|\\.)*\](?:(?:\r\n)?[ [\] \000-\031]+(?:(?:(?:\r\n)?[ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \0 00-\031]+(?:(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ ;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])* \000-\031]+(?:(?:(?:\r\n)?[ \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ \000-\031]+(?:(?:(?:\r\n)?[ ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( \000-\031]+(?:(?:(?:\r\n)?[ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \t]))*"(?:(?:\r\n)?[ \t \000-\031]+(?:(?:(?:\r\n)?[ \t \t])*)(? :\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \t])*))*|(?: \000-\031]+(?:(?:(?:\r\n)?[ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" \t])*)(?:\.(?:(?:\r\n) ?[ \000- \031]+(?:(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] \000-\031]+(?:(?:(?:\r\n)?[ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? \000-\031]+(?:(?:(?:\r\n)?[ \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ \t]))*"(?:(?:\r\n)?[ \t]) \000- \031]+(?:(?:(?:\r\n)?[ \t]) \t])*)(?:\.(?:(?:\r\n)?[ \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*) NO+!+

Lab Try some REGEX tutorial expressions.info/tutorial.html expressions.info/tutorial.html The scripts I uploaded Play around with the regex tool 5-10 minutes

Fire up the REGEX.py

Next Session Preview

THE BIGGEST concern for doctoral students doing empirical work (year 2-4) excluding the quals/prelims “WHERE AND HOW DO I GET DATA?!“

Data sources 1. Companies 2. Wharton Organizations 3. Scraping Web 4. APIs : application programming interface

We are going to use the following for the next session Download WGET and make sure it works You may already have wget if you use mac (in terminal, type wget) Get Firefox Developer’s Toolbox Data acquisition (Wharton, Company, Scraping, API)

REGEX-FU Contest with small prizes!