REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

KEMENTERIAN PENDIDIKAN DAN KEBUDAYAAN BADAN PENGEMBANGAN SUMBER DAYA MANUSIA PENDIDIKAN DAN KEBUDAYAAN DAN PENJAMINAN MUTU PENDIDIKAN THE CONCEPT OF SCIENTIFIC.
Introduction to Hypothesis Testing Chapter 8. Applying what we know: inferential statistics z-scores + probability distribution of sample means HYPOTHESIS.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Section 9.2: What is a Test of Significance?. Remember… H o is the Null Hypothesis ▫When you are using a mathematical statement, the null hypothesis uses.
Introduction to Hypothesis Testing CJ 526 Statistical Analysis in Criminal Justice.
Introduction to Hypothesis Testing CJ 526 Statistical Analysis in Criminal Justice.
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Regular expression. Validation need a hard and very complex programming. Sometimes it looks easy but actually it is not. So there is a lot of time and.
Python Control of Flow.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Pattern matching with regular expressions A common file processing requirement is to match strings within the file to a standard form, e.g. address.
TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Presentation on Type I and Type II Errors How can someone be arrested if they really are presumed innocent? Why do some individuals who really are guilty.
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
CSCI 1100/1202 January 28, The switch Statement The switch statement provides another means to decide which statement to execute next The switch.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Algebra Problems… Solutions Algebra Problems… Solutions © 2007 Herbert I. Gross Set 10 By Herbert I. Gross and Richard A. Medeiros next.
Hypothesis Testing – A Primer. Null and Alternative Hypotheses in Inferential Statistics Null hypothesis: The default position that there is no relationship.
Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
©Brooks/Cole, 2001 Chapter 9 Regular Expressions ( 정규수식 )
©Brooks/Cole, 2001 Chapter 9 Regular Expressions.
Chapter 5: Making Decisions
REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
A Reading Response Strategy
1 Introduction to Abstract Mathematics Expressions (Propositional formulas or forms) Instructor: Hayk Melikya
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Semantics Day 38 LING Computational Linguistics Harry Howard Tulane University.
Thinking Mathematically Statements, Negations, and Quantified Statements.
Validation using Regular Expressions. Regular Expression Instead of asking if user input has some particular value, sometimes you want to know if it follows.
Conversions & Pumping Lemma CPSC 388 Fall 2001 Ellen Walker Hiram College.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Restrictions Objectives of the Lecture : To consider the algebraic Restrict operator; To consider the Restrict operator and its comparators in SQL.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
An Introduction to Regular Expressions Specifying a Pattern that a String must meet.
Regular expressions Day 11 LING Computational Linguistics Harry Howard Tulane University.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
HYPOTHESIS TESTING E. Çiğdem Kaspar, Ph.D, Assist. Prof. Yeditepe University, Faculty of Medicine Biostatistics.
Supporting Details. Supporting details are specific statements that are related to the topic of the paragraph, but they do more than just restate the.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Criminal Law.
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
TELLING TALES: Putting the stress into teaching
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
A Reading Response Strategy
Computation with strings 2 Day 3 - 9/02/16
Computation with strings 3 Day 4 - 9/07/16
Logic – Bell Ringer.
Regular expressions 2 Day /23/16
Testing Hypotheses about Proportions
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
Errors In Hypothesis tests
Computation with strings 4 Day 5 - 9/09/16
as applied to GOLDILOCKS AND THE THREE BEARS
Character defense.
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 10-Sept-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction. 

Regular expressions Review 10-Sept NLP, Prof. Howard, Tulane University

Regular expressions  re.findall(' be ', S)  Regex meta- characters: ||  ()  [], [^]  {} ..  ` to | be | it | as `  ` (to|be|it|as) `  ` ([a-z][a-z]) `  ` ([a-z]{2}) `  ` (..) `  ` (.{2}) ` 10-Sept NLP, Prof. Howard, Tulane University

Open Spyder 10-Sept NLP, Prof. Howard, Tulane University import re

Will the best regex please stand up? §4. Regular expressions 2 10-Sept NLP, Prof. Howard, Tulane University

Under-fitting vs. over-fitting  This challenge of finding the regular expression that is just right may remind you of the story of Goldilocks and the three bears, in which Goldilocks tried to find the bowl of porridge that was neither too hot nor too cold.  Statisticians have their own version of Goldilocks, which evaluates how well a statistical analysis fits the data that it is applied to.  An analysis that over-fits the data is too specific, in that it excludes data points from a larger data set that should be included.  Conversely, an analysis that under-fits the data is too general, in that it includes data points from a larger data set that should be excluded. In our example, the first two regular expressions over-fit the data set (at should be included), while the last two under-fit it (19 should be excluded). 10-Sept-2014NLP, Prof. Howard, Tulane University 7

False positives and false negatives  Statistical test theory provides an alternative way of conceptualizing the problem, which I unfortunately can’t figure out how to tie in to Goldilocks.  Though it is usually illustrated in terms of medical tests, I believe that explaining it in terms of legal ‘tests’ is easier to understand. 10-Sept-2014NLP, Prof. Howard, Tulane University 8

A trial  Imagine that a person is charged with a crime and goes through a trial.  If she is guilty and the verdict is guilty, the trial has produced a true positive data point: a guilty person is found guilty.  Conversely, if she is innocent and the verdict is not guilty, the trial has produced a true negative data point: a not-guilty person is found not guilty.  We expect that an accurate test only produces true positives and true negatives, but there are two more logical possibilities that leave room for a test to be nearly accurate.  One is for an innocent person to be found guilty.  This is called a false positive data point, because the accused should have failed the test but instead passed it.  Alternatively, if a guilty person is found innocent, the legal test has produced a false negative data point, because the accused should have passed the test but instead failed it. 10-Sept-2014NLP, Prof. Howard, Tulane University 9

Four outcomes of a trial truefalse positive guilty found guiltyinnocent found guilty negative innocent found not guiltyguilty found not guilty 10-Sept-2014NLP, Prof. Howard, Tulane University 10

Summary of the two sorts of regex evaluation truefalse positive evaluation of ‘to’ by [a- z]{2} results in good fit evaluation of ‘at’ by (?:to|be|it|as) results in under-fit negative evaluation of ‘the’ by [a- z]{2} results in bad fit evaluation of ‘19’ by.{2} results in over-fit 10-Sept-2014NLP, Prof. Howard, Tulane University 11

More on ranges and negation >>> S2 = 'otolaryngologist' English only has five letters for vowels, so it would be easy enough list them in a disjunction: >>> re.findall('a|e|i|o|u', S2) ['o', 'o', 'a', 'o', 'o', 'i'] I>>> re.findall('[aeiou]', S2) ['o', 'o', 'a', 'o', 'o', 'i'] >>> re.findall('[^aeiou]', S2) ['t', 'l', 'r', 'y', 'n', 'g', 'l', 'g', 's', 't'] 10-Sept-2014NLP, Prof. Howard, Tulane University 12

A range of repetition with {} character{minimum, maximum} >>> S3 = 'bookkeeper' >>> S4 = 'goddessship' >>> re.findall('[aeiou]{2}', S3) ['oo', 'ee'] >>> re.findall('[^aeiou]{3}', S4) ['sss'] >>> re.findall('[^aeiou]{2,3}', S4) ['dd', 'sss'] 10-Sept-2014NLP, Prof. Howard, Tulane University 13

Match the beginning or end of a string with ^ and $ >>> re.findall('^.|.$', S) ['T', '.'] 10-Sept-2014NLP, Prof. Howard, Tulane University 14

x.html#further-practice-of-fixed-length-matching 4.3. Variable-length matching Next time 10-Sept-2014NLP, Prof. Howard, Tulane University 15