PolyAnalyst Web Report Training Manipulating Text Data in PolyAnalyst - Text Extraction and Regular Expressions PolyAnalyst Web Report Training Megaputer Intelligence www.megaputer.com © 2014 Megaputer Intelligence Inc.
Outline Agenda Extract Terms node Basics of Regular Expression Example of Regex with PolyAnalyst
Outline Agenda Extract Terms node Basics of Regular Expression Example of Regex with PolyAnalyst
Extract Terms Node Extract text segments from a column using Regular Expressions
Extract Terms Node Extract text segments from a column using Regular Expressions
Extract Terms Node Select Text or String Columns
Extract Terms Node Add a new rule
Extract Terms Node Simplest Regex Rule Case Insensitive
Extract Terms Node
Outline Agenda Extract Terms node Basics of Regular Expression Example of Regex with PolyAnalyst
Outline Basics of Regular Expression The simplest regex is simply a string of characters: Simplest Regex Rule
Outline Basics of Regular Expression If we expand it to: Then it fails!
Basics of Regular Expression Outline \s represents a space
Basics of Regular Expression Outline PDL Phrase(parking, lot)
Outline Basics of Regular Expression Vertical Bar | represents “or” Parentheses () represent grouping
Outline Basics of Regular Expression \d matches for any digit (0 to 9) Plus sign + denotes one or more matches
Basics of Regular Expression Outline
Outline Basics of Regular Expression Question mark ? denotes: zero or one match Asterisk * denotes: zero or more matches
Basics of Regular Expression Outline
Outline Other Useful Syntax Dot . matches for any character except newline Caret ^ denotes beginning of string Dollar sign $ denotes end of string Curly brackets {} denotes exact number of match. For example: w{3} match for www p{1,5} match for happy or happpppy
{ } [ ] ( ) ^ $ . | * + ? \ \$\d+\.\d+ = $19.99 Outline Metacharacters Some characters are reserved for use in regex notation The metacharacters are: { } [ ] ( ) ^ $ . | * + ? \ For example: \$\d+\.\d+ = $19.99
Outline More? PolyAnalyst Help Manual Online Resources http://en.wikipedia.org/wiki/Regular_expression http://www.regular-expressions.info/ Test and see the highlights http://www.regexr.com/
Outline Agenda Extract Terms node Basics of Regular Expression Example of Regex with PolyAnalyst
Outline Extract [Age] of Suspect Other than groupings, parentheses () are also used for storing
Extract and Sort [Age] Outline
Clean Up Text / String Columns Outline
Outline Clean Up Text / String Columns .* matches for any number of characters except newline
Clean Up Text / String Columns Outline
Clean Up Text / String Columns Outline
Delimiter and Extraction Outline
Outline Delimiter and Extraction \w matches for any alpha numeric character and the underscore character: [A-Z] [a-z] [0-9] _
Delimiter and Extraction Outline
Delimiter and Extraction Outline
Delimiter and Extraction Outline
Outline Delimiter and Extraction Other than groupings, parentheses () are also used for storing
Delimiter and Extraction Outline
Replace Terms Node Find and replace patterns of characters in one or more string or text columns.
Data Redaction Outline
Regex in Replace Terms Node
Data Redaction Outline
Regex in Replace Terms Node
Regex in Replace Terms Node
Contacting Megaputer Questions?
An Example of Regular Expression with a Web Scraping Project Appendix: An Example of Regular Expression with a Web Scraping Project of Glassdoor Data Contacting Megaputer
Polish the Information
Remove Unnecessary Info (?s) denotes “treat everything on the same line”
Find a Delimiter For forums or blogs with multiple posts in one webpage Find ways to identify common patterns
Separate Records of Info
Find a Delimiter
Find a Delimiter
Records Separated!
Different Ways to Extract Data Right from the parsed text Option to work on raw HTML codes
Data Extraction – Parsed Text Title of Review Location Job Title Date & Time
Data Extraction – Parsed Text
Data Extraction – Raw HTML Codes Title of Review Job Title Location
Resulting Dataset Outline
Making Good Use of the Info
Contacting Megaputer Questions?