1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.

Slides:



Advertisements
Similar presentations
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
Advertisements

Python: Regular Expressions
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
Regular Expressions Lecture 3. Regular Expressions Motivation: To search for strings using partially specified patterns. Examples: To validate data fields.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
1 A Quick Introduction to Regular Expressions in Java.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular expression. Validation need a hard and very complex programming. Sometimes it looks easy but actually it is not. So there is a lot of time and.
1 Overview Regular expressions Notation Patterns Java support.
Scripting Languages Chapter 8 More About Regular Expressions.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
REGULAR EXPRESSIONS CHAPTER 14. REGULAR EXPRESSIONS A coded pattern used to search for matching patterns in text strings Commonly used for data validation.
1 Form Validation. Validation  Validation of form data can be cumbersome using the basic techniques  StringTokenizer  If-else statements  Most of.
Last Updated March 2006 Slide 1 Regular Expressions.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Using Regular Expressions in Java for Data Validation Evelyn Brannock Jan 30, 2009.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
BTANT129 w61 Regular expressions step by step Tamás Váradi
Python Regular Expressions Easy text processing. Regular Expression  A way of identifying certain String patterns  Formally, a RE is:  a letter or.
Regular Expression in Java 101 COMP204 Source: Sun tutorial, …
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
BY Sandeep Kumar Gampa.. What is Regular Expression? Regex in.NET Regex Language Elements Examples Regular Expression API How to Test regex in.NET Conclusion.
Regular Expressions – An Overview Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in.
REGEX. Problems Have big text file, want to extract data – Phone numbers (503)
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Regular Expressions Regular Expressions. Regular Expressions  Regular expressions are a powerful string manipulation tool  All modern languages have.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Module 6 – Generics Module 7 – Regular Expressions.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Python for NLP Regular Expressions CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
CS346 Regular Expressions1 Pattern Matching Regular Expression.
GREP. Whats Grep? Grep is a popular unix program that supports a special programming language for doing regular expressions The grammar in use for software.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CIT 383: Administrative ScriptingSlide #1 CIT 383: Administrative Scripting Regular Expressions.
1 Validating user input is the bane of every software developer’s existence. When you are developing cross-browser web applications (IE4+ and NS4+) this.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Regular Expressions /^Hel{2}o\s*World\n$/ SoftUni Team Technical Trainers Software University
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
Chapter 4 © 2009 by Addison Wesley Longman, Inc Pattern Matching - JavaScript provides two ways to do pattern matching: 1. Using RegExp objects.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
CS 1111 Introduction to Programming Fall 2018
CIT 383: Administrative Scripting
- Regular expressions:
Regular Expressions in Java
Regular Expression in Java 101
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions

Information Retrieval vs. Text Mining Information Retrieval (IR) is a task of retrieving documents which are relevant to a query. –So the fundamental technique is document similarity. Text Mining/Analytics is to (a) examine a collection of documents, (b) learn decision criteria/model for classification, and (c) apply these criteria/model to new documents to classify. –So the goal is to classify new documents for prediction (using a model derived by the collection of documents). But in TM/A, (b) decision criteria for classification – uses document similarity for determination. So TM/A use the same techniques as IR (although not all TM/A tasks). 2

3 Document Search ‘Information Retrieval (IR)’ implies a query (e.g. search terms) –For a given query, relevant or similar documents are returned. But most basic document retrieval technique is keyword/search term matching. –Retrieve all (or selected) documents which contain the search terms -- by string matching –Python example: >>> s1 = 'public' >>> s2 = 'public' >>> s2 == s1 True myword = “month python” with open("textfile.txt") as openfile: for line in openfile: if myword in line: print line

4 String Matching Using Patterns Often, we wish to find a substring which matches a pattern e.g. addresses: 1.Any number of alphanumeric characters and/or dots (not a dot at beginning or end) 3.Any number of alphanumeric characters and/or dots (not a dot at beginning or end); must be at least one dot Examples: –valid: But if you want to specify search words by patterns, regular expressions are commonly used.

Regular Expressions (1) Regular expression is an algebra for defining patterns. For example, a regular expression “a*b” matches with a string “aaaab”. But without going through the formal definitions, here is a (partial) summary. 1.Simple Patterns –Characters match themselves. Note the chars are case-sensitive. –Metacharacters – not to be used literally _as is_. ^ $ * + ? { } [ ] \ | ( ) –To use a metacharacter, a back-slash has to be given before it \. \^ \+ etc. –Other special characters \t, \n, \r, \f etc. 5

Regular Expressions (2) 2.Character classes –[abc] – a, b, or c –[^abc] – any character except a, b, or c. –[a-zA-Z] – a throughx, or A through Z inclusive (range) 3.Predefined character classes –. (dot) – any character –\d – a digit ([0-9]) –\D – a non-digit ([^0-9]) –\s – a whitespace character (e.g. space, \t, \n, \r) –\S – a non-whitespace character –\w – a word character ([a-zA-Z_0-9]) –\W – a non-word character ([^\w]) 4.Boundary matchers –^ -- the beginning of a line –$ -- the end of a line 6

Regular Expressions (3) 5.Greedy quantifiers –X? – X, once or not at all –Z* -- X, zero or more times –X+ -- X, one or more times –X{n} – X, exactly n times –X{n,m} – X, at least n but no more than m times 6.Logical operators –XY – X followed by Y –X|Y – either X or Y –(X) – X, as a capturing group 7

Regular Expression in Python (1) Regular expressions are in the ‘re’ package. Notation for patterns is slightly different from other languages – using raw string as an alternative to Regular string. First compile an expression (into an re object). Then match it against a string. –>>> import re >>> p = re.compile('ab*') 8 Regular StringRaw string "ab*"r"ab*" "\\\\section"r"\\section" "\\w+\\s+\\1"r"\w+\s+\1"

Regular Expression in Python (2) Matching a re object against a string is done in several ways. 9 Method/AttributePurpose match() Determine if the RE matches at the beginning of the string. search() Scan through a string, looking for any location where this RE matches. findall() Find all substrings where the RE matches, and returns them as a list. finditer() Find all substrings where the RE matches, and returns them as aniterator.iterator

10 >>> import re >>> sent = "This book on tennis cost $3.99 at Walmart." >>> p1 = re.compile("ten") >>> m1 = p1.match(sent) >>> m1 >>> p2 = re.compile(".*ten.*") >>> m2 = p2.match(sent) >>> m2 >>> m3 = re.search(p1,sent) >>> m3 >>> m4 = re.search(p2,sent) >>> m4 >>> pp1 = re.compile("is") >>> m5 = re.findall(pp1, sent) >>> m5 ['is', 'is'] >>> pp2 = re.compile("\\d") >>> m6 = re.search(pp2, sent) >>> m6 >>> pp3 = re.compile("\\d+") >>> m7 = re.search(pp3, sent) >>> m7

11 >>> pp3 = re.compile("\\$\\d+\\.\\d\\d") >>> m8 = re.search(pp3, sent) >>> m8 >>> pp4 = re.compile(r"\$\d+\.\d\d") >>> m9 = re.search(pp4, sent) >>> m9

Regular Expression in Python (3) Grouping – You can retrieve the matched substrings using parentheses. Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups: –((A)(B(C))) –(A) –(B(C)) –(C) Group zero always stands for the entire expression. 12

13 >>> ppp1 = re.compile("(\\w+) cost (\\$\\d+\\.\\d\\d)") >>> mm1 = re.search(ppp1, sent) >>> mm1 >>> mm1.group(0) 'tennis cost $3.99' >>> mm1.group(1) 'tennis' >>> mm1.group(2) '$3.99'

TutorialsPoint, 14

TutorialsPoint, 15 ModifierDescription re.IPerforms case-insensitive matching. re.LInterprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B). re.MMakes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). re.SMakes a period (dot) match any character, including a newline. re.UInterprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B. re.XPermits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker. Regular Expression Modifiers: Option Flags Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR (|), as shown previously and may be represented by one of these −