Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing (NLP)

Similar presentations


Presentation on theme: "Natural Language Processing (NLP)"— Presentation transcript:

1 Natural Language Processing (NLP)
Lecture No 5 Institute of Southern Punjab Multan Department of Computer Science

2 Basic Text Processing Regular Expression

3 Basic Process of NLU Phonological / morphological analyser
Spoken input For speech understanding Phonological / morphological analyser Phonological & morphological rules Sequence of words SYNTACTIC COMPONENT Grammatical Knowledge Indicating relns (e.g., mod) between words Syntactic structure (parse tree) Thematic Roles SEMANTIC INTERPRETER Semantic rules, Lexical semantics Selectional restrictions Logical form CONTEXTUAL REASONER Pragmatic & World Knowledge Meaning Representation

4 NLP Representations State Machines Rule Systems Logic-based Formalisms
FSAs, FSTs, HMMs, ATNs, RTNs Rule Systems CFGs, Unification Grammars, Probabilistic CFGs Logic-based Formalisms 1st Order Predicate Calculus, Temporal and other Higher Order Logics Models of Uncertainty Bayesian Probability Theory

5 Regular Expressions Can be viewed as a way to specify:
Search patterns over text string Design of a particular kind of machine, called a Finite State Automaton (FSA) These are really equivalent

6 Regular Expressions Regular Expression: Formula in algebraic notation for specifying a set of strings String: Any sequence of alphanumeric characters Letters, numbers, spaces, tabs, punctuation marks Regular Expression Search Pattern: specifying the set of strings we want to search for Corpus: the texts we want to search through

7 Regular Expressions The “foundational” operations
Concatenation abc abc Disjunction a|b a b (a|bb)d ad bbd Kleene star a* ε a aa aaa ... c(a|bb)* ca cbba Regular expressions / Finite-state automata are “closed under these operations” Pattern Matches The empty string

8 Practical Applications of RegEx’s
• Web search • Word processing, find, substitute • Validate fields in a database (dates, addr, URLs) • Searching corpus for linguistic patterns – and gathering stats...

9 Regular expressions A formal language for specifying text strings
How can we search for any of these? woodchuck woodchucks Woodchuck Woodchucks

10 Regular Expressions: Disjunctions
Letters inside square brackets [] Ranges [A-Z] Pattern Matches [wW]oodchuck Woodchuck, woodchuck [ ] Any digit Pattern Matches [A-Z] An upper case letter Drenched Blossoms [a-z] A lower case letter my beans were impatient [0-9] A single digit Chapter 1: Down the Rabbit Hole

11 Regular Expressions: Negation in Disjunction
Negations [^Ss] Carat means negation only when first in [] Pattern Matches [^A-Z] Not an upper case letter Oyfn pripetchik [^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” [^e^] Neither e nor ^ Look here a^b The pattern a carat b Look up a^b now

12 Regular Expressions: More Disjunction
Woodchucks is another name for groundhog! The pipe | for disjunction Pattern Matches groundhog|woodchuck yours|mine yours mine a|b|c = [abc] [gG]roundhog|[Ww]oodchuck Photo D. Fletcher

13 Regular Expressions: ? * + .
Pattern Matches colou?r Optional previous char color colour oo*h! 0 or more of previous char oh! ooh! oooh! ooooh! o+h! 1 or more of previous char baa+ baa baaa baaaa baaaaa beg.n begin begun begun beg3n Stephen C Kleene Kleene *, Kleene +

14 Regular Expressions: Anchors ^ $
Pattern Matches ^[A-Z] Palo Alto ^[^A-Za-z] 1 “Hello” \.$ The end. .$ The end? The end!

15 Some Examples Note: [^ - is not – i.e., at the beginning of a disjunction. Later in the disjunction it really means the ^.

16 Morphological variants of ‘puppy’
RE Description Uses? /a*/ Zero or more a’s Optional doubled modifiers /a+/ One or more a’s Non-optional... /a?/ Zero or one a’s Optional... /cat|dog/ ‘cat’ or ‘dog’ Words modifying pets /^cat\.$/ A line that contains only cat. ^anchors beginning, $ anchors end of line. ?? /\bun\B/ Beginnings of longer strings Words prefixed by ‘un’ /pupp(y|ies)/ Morphological variants of ‘puppy’ Note: yet another use of ^ but this time in a different context… it means to start a line! \b matches a word boundary; \B matches a word-non-boundary – a word is a sequence of digits, underscores, letters / (.+)ier and \1ier / happier and happier, fuzzier and fuzzier

17 Example Find me all instances of the word “the” in a text. the
Misses capitalized examples [tT]he Incorrectly returns other or theology [^a-zA-Z][tT]he[^a-zA-Z]

18 Optionality and Repetition
/[Ww]oodchucks?/ matches woodchucks, Woodchucks, woodchuck, Woodchuck /colou?r/ matches color or colour /he{3}/ matches heee /(he){3}/ matches hehehe /(he){3,} matches a sequence of at least 3 he’s

19 Operator Precedence Hierarchy
1. Parentheses () 2. Counters * + ? {} 3. Sequence of Anchors the ^my end$ 4. Disjunction | Examples /moo+/ /try|ies/ /and|or/ Because counters have a higher precidence than sequence, this matches mooooooo. The second one probably does not do what is expected – would match try or ies, but not tries. The third does work as expected.

20 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

21 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with southern factions.

22 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

23 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

24 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z][tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions.

25 A Simple Exercise /the/ /[tT]he/ /\b[tT]he\b/
Write a regular expression to find all instances of the determiner “the”: /the/ /[tT]he/ /\b[tT]he\b/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/ The recent attempt by the police to retain their current rates of pay has not gathered much favor with the southern factions. The \b counts words as any sequence of digits, undercores, or letters. Suppose you wanted to find “the” in the context of _the or the28. Could you write that one. Last one: either it starts a line, or the thing in front of it is not a letter; followed by either t or T; followed by he where the following character is not a letter.

26 Errors The process we just went through was based on fixing two kinds of errors Matching strings that we should not have matched (there, then, other) False positives (Type I) Not matching things that we should have matched (The) False negatives (Type II)

27 Errors cont. In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two antagonistic efforts: Increasing accuracy or precision (minimizing false positives) Increasing coverage or recall (minimizing false negatives).

28 Summary Regular expressions play a surprisingly large role
Sophisticated sequences of regular expressions are often the first model for any text processing text For many hard tasks, we use machine learning classifiers But regular expressions are used as features in the classifiers Can be very useful in capturing generalizations


Download ppt "Natural Language Processing (NLP)"

Similar presentations


Ads by Google