Download presentation
Presentation is loading. Please wait.
Published byRebecca Hensley Modified over 9 years ago
1
Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy
2
Text Parsing ► The three W’s! ► Efficiency and Perfection
3
What is Text Parsing? ► common programming task ► extract or split a sequence of characters
4
Why is Text Parsing? ► Simple file parsing A tab separated file ► Data extraction Extract specific information from log file ► Find and replace ► Parsers – syntactic analysis ► NLP Extract information from corpus POS Tagging
5
Text Parsing Methods ► String Functions ► Regular Expressions ► Parsers
6
String Functions ► String module in python Faster, easier to understand and maintain ► If you can do, DO IT! ► Different built-in functions Find-Replace Split-Join Startswith and Endswith Is methods
7
Find and Replace ► find, index, rindex, replace ► EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = 0 lineno = string.find(line, stext) lineno = string.find(line, stext) if lineno >0: if lineno >0: line =line.replace(stext, rtext) line =line.replace(stext, rtext) sys.stdout.write(line) sys.stdout.write(line)
8
startswith and endswith ► Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ print "string with double quotes“ ► Find if the sentences are interrogative or exclamative ► What an amazing game that was! ► Do you like this? endings = ('!', '?') sentence.endswith(endings)
9
isMethods ► to check alphabets, numerals, character case etc m = 'xxxasdf ‘ m.isalpha() False
10
Regular Expressions ► concise way for complex patterns ► amazingly powerful ► wide variety of operations ► when you go beyond simple, think about regular expressions!
11
Real world problems ► Match IP Addresses, email addresses, URLs ► Match balanced sets of parenthesis ► Substitute words ► Tokenize ► Validate ► Count ► Delete duplicates ► Natural Language processing
12
RE in Python ► Unleash the power - built-in re module ► Functions to compile patterns ► complie to perform matches ► match, search, findall, finditer to perform opertaions on match object ► group, start, end, span to substitute ► sub, subn ► - Metacharacters
13
Compiling patterns ► re.complile() ► pattern for IP Address ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ^\d+\.\d+\.\d+\.\d+$ ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$ ([01]?\d\d?|2[0-4]\d|25[0-5])$
14
Compiling patterns ► pattern for matching parenthesis \(.*\) \([^)]*\) \([^()]*\)
15
Substitute ► Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate
16
Count ► Split and count words in the given text p = re.compile(r'\W+') len(p.split('This is a test for split().'))
17
Tokenize ► Parsing and Natural Language Processing s = 'tokenize these words' words = re.compile(r'\b\w+\b|\$') words.findall(s) ['tokenize', 'these', 'words']
18
Common Pitfalls ► operations on fixed strings, single character class, no case sensitive issues ► re.sub() and string.replace() ► re.sub() and string.translate() ► match vs. search ► greedy vs. non-greedy
19
PARSERS ► Flat and Nested texts ► Nested tags, Programming language constructs ► Better to do less than to do more!
20
Parsing Non flat texts ► Grammar ► States ► Generate tokens and Act on them ► Lexer - Generates a stream of tokens ► Parser - Generate a parse tree out of the tokens ► Lex and Yacc
21
Grammar Vs RE ► Floating Point #---- EBNF-style description of Python ---# #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ intpart ::= digit+ fraction ::= "." digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9" digit ::= "0"..."9"
22
Grammar Vs RE pat = r'''(?x) ( # exponentfloat ( # exponentfloat ( # intpart or pointfloat ( # intpart or pointfloat ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat | \d+ # intpart \d+ # intpart ) # end intpart or pointfloat ) # end intpart or pointfloat [eE][+-]?\d+ # exponent [eE][+-]?\d+ # exponent ) # end exponentfloat ) # end exponentfloat | ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat ''' '''
23
PLY - The Python Lex and Yacc ► higher-level and cleaner grammar language ► LALR(1) parsing ► extensive input validation, error reporting, and diagnostics ► Two moduoles lex.py and yacc.py
24
Using PLY - Lex and Yacc ► Lex: ► Import the [lex] module ► Define a list or tuple variable 'tokens', the lexer is allowed to produce ► Define tokens - by assigning to a specially named variable ('t_tokenName') ► Build the lexer mylexer = lex.lex() mylexer.input(mytext) # handled by yacc
25
Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' r'\d+' try: try: t.value = int(t.value) t.value = int(t.value) except ValueError: except ValueError: print "Integer value too large", t.value print "Integer value too large", t.value t.value = 0 t.value = 0 return t return t t_ignore = " \t"
26
Yacc ► Import the 'yacc' module ► Get a token map from a lexer ► Define a collection of grammar rules ► Build the parser yacc.yacc() yacc.parse('x=3')
27
Yacc ► Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' 'statement : NAME "=" expression' names[p[1]] = p[3] names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' 'statement : expression' print p[1] print p[1]
28
Summary ► String Functions A thumb rule - if you can do, do it. ► Regular Expressions Complex patterns - something beyond simple! ► Lex and Yacc Parse non flat texts - that follow some rules
29
References ► http://docs.python.org/ ► http://code.activestate.com/recipes/langs/python/ ► http://www.regular-expressions.info/ ► http://www.dabeaz.com/ply/ply.html ► Mastering Regular Expressions by Jeffrey E F. Friedl ► Python Cookbook by Alex Martelli, Anna Martelli & David Ascher ► Text processing in Python by David Mertz
30
Thank You Q & A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.