Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy.

Similar presentations


Presentation on theme: "Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy."— Presentation transcript:

1 Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy

2 Text Parsing ► The three W’s! ► Efficiency and Perfection

3 What is Text Parsing? ► common programming task ► extract or split a sequence of characters

4 Why is Text Parsing? ► Simple file parsing  A tab separated file ► Data extraction  Extract specific information from log file ► Find and replace ► Parsers – syntactic analysis ► NLP  Extract information from corpus  POS Tagging

5 Text Parsing Methods ► String Functions ► Regular Expressions ► Parsers

6 String Functions ► String module in python  Faster, easier to understand and maintain ► If you can do, DO IT! ► Different built-in functions  Find-Replace  Split-Join  Startswith and Endswith  Is methods

7 Find and Replace ► find, index, rindex, replace ► EX: Replace a string in all files in a directory files = glob.glob(path) for line in fileinput.input(files,inplace=1): lineno = 0 lineno = 0 lineno = string.find(line, stext) lineno = string.find(line, stext) if lineno >0: if lineno >0: line =line.replace(stext, rtext) line =line.replace(stext, rtext) sys.stdout.write(line) sys.stdout.write(line)

8 startswith and endswith ► Extract quoted words from the given text myString = "\"123\""; if (myString.startswith("\"")) print "string with double quotes“ print "string with double quotes“ ► Find if the sentences are interrogative or exclamative ► What an amazing game that was! ► Do you like this? endings = ('!', '?') sentence.endswith(endings)

9 isMethods ► to check alphabets, numerals, character case etc  m = 'xxxasdf ‘  m.isalpha()  False

10 Regular Expressions ► concise way for complex patterns ► amazingly powerful ► wide variety of operations ► when you go beyond simple, think about regular expressions!

11 Real world problems ► Match IP Addresses, email addresses, URLs ► Match balanced sets of parenthesis ► Substitute words ► Tokenize ► Validate ► Count ► Delete duplicates ► Natural Language processing

12 RE in Python ► Unleash the power - built-in re module ► Functions  to compile patterns ► complie  to perform matches ► match, search, findall, finditer  to perform opertaions on match object ► group, start, end, span  to substitute ► sub, subn ► - Metacharacters

13 Compiling patterns ► re.complile() ► pattern for IP Address  ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$  ^\d+\.\d+\.\d+\.\d+$  ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$  ^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$ ([01]?\d\d?|2[0-4]\d|25[0-5])$

14 Compiling patterns ► pattern for matching parenthesis  \(.*\)  \([^)]*\)  \([^()]*\)

15 Substitute ► Perform several string substitutions on a given string import re def make_xlat(*args, **kwargs): adict = dict(*args, **kwargs) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlate(match): return adict[match.group(0)] def xlate(text): return rx.sub(one_xlate, text) return xlate

16 Count ► Split and count words in the given text  p = re.compile(r'\W+')  len(p.split('This is a test for split().'))

17 Tokenize ► Parsing and Natural Language Processing  s = 'tokenize these words'  words = re.compile(r'\b\w+\b|\$')  words.findall(s)  ['tokenize', 'these', 'words']

18 Common Pitfalls ► operations on fixed strings, single character class, no case sensitive issues ► re.sub() and string.replace() ► re.sub() and string.translate() ► match vs. search ► greedy vs. non-greedy

19 PARSERS ► Flat and Nested texts ► Nested tags, Programming language constructs ► Better to do less than to do more!

20 Parsing Non flat texts ► Grammar ► States ► Generate tokens and Act on them ► Lexer - Generates a stream of tokens ► Parser - Generate a parse tree out of the tokens ► Lex and Yacc

21 Grammar Vs RE ► Floating Point #---- EBNF-style description of Python ---# #---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloat floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ intpart ::= digit+ fraction ::= "." digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9" digit ::= "0"..."9"

22 Grammar Vs RE pat = r'''(?x) ( # exponentfloat ( # exponentfloat ( # intpart or pointfloat ( # intpart or pointfloat ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat | \d+ # intpart \d+ # intpart ) # end intpart or pointfloat ) # end intpart or pointfloat [eE][+-]?\d+ # exponent [eE][+-]?\d+ # exponent ) # end exponentfloat ) # end exponentfloat | ( # pointfloat ( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction (\d+)?[.]\d+ # optional intpart with fraction | \d+[.] # intpart with period \d+[.] # intpart with period ) # end pointfloat ) # end pointfloat ''' '''

23 PLY - The Python Lex and Yacc ► higher-level and cleaner grammar language ► LALR(1) parsing ► extensive input validation, error reporting, and diagnostics ► Two moduoles lex.py and yacc.py

24 Using PLY - Lex and Yacc ► Lex: ► Import the [lex] module ► Define a list or tuple variable 'tokens', the lexer is allowed to produce ► Define tokens - by assigning to a specially named variable ('t_tokenName') ► Build the lexer  mylexer = lex.lex()  mylexer.input(mytext) # handled by yacc

25 Lex t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*' def t_NUMBER(t): r'\d+' r'\d+' try: try: t.value = int(t.value) t.value = int(t.value) except ValueError: except ValueError: print "Integer value too large", t.value print "Integer value too large", t.value t.value = 0 t.value = 0 return t return t t_ignore = " \t"

26 Yacc ► Import the 'yacc' module ► Get a token map from a lexer ► Define a collection of grammar rules ► Build the parser  yacc.yacc()  yacc.parse('x=3')

27 Yacc ► Specially named functions having a 'p_' prefix def p_statement_assign(p): 'statement : NAME "=" expression' 'statement : NAME "=" expression' names[p[1]] = p[3] names[p[1]] = p[3] def p_statement_expr(p): 'statement : expression' 'statement : expression' print p[1] print p[1]

28 Summary ► String Functions A thumb rule - if you can do, do it. ► Regular Expressions Complex patterns - something beyond simple! ► Lex and Yacc Parse non flat texts - that follow some rules

29 References ► http://docs.python.org/ ► http://code.activestate.com/recipes/langs/python/ ► http://www.regular-expressions.info/ ► http://www.dabeaz.com/ply/ply.html ► Mastering Regular Expressions by Jeffrey E F. Friedl ► Python Cookbook by Alex Martelli, Anna Martelli & David Ascher ► Text processing in Python by David Mertz

30 Thank You Q & A


Download ppt "Text Parsing in Python - Gayatri Nittala - Gayatri Nittala - Madhubala Vasireddy - Madhubala Vasireddy."

Similar presentations


Ads by Google