Download presentation
Presentation is loading. Please wait.
Published byHarrison Burger Modified over 9 years ago
1
C ORPUS P ROCESSING Kristina Kocijan krkocijan@ffzg.hr Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University of Zagreb Tutorial Part 1
2
O UTLINE 1. Text.not Create a new text Import a text in any file format (ex. pdf / xml) Open text file 2. Corpus.noc Create a corpus Open corpus 3. Query NooJ RegEx 4. Concordances Build concordances Export concordances 5. Statistical analysis 21.5.2012. 2 LREC 2012 - NooJ Tutorial: Corpus Processing
3
T EXT U NITS = TU S Independant areas inside which linguistic resources are applied. *.not 21.5.2012. 3 LREC 2012 - NooJ Tutorial: Corpus Processing
4
T EXT : O PEN 21.5.2012. 4 LREC 2012 - NooJ Tutorial: Corpus Processing
5
T EXT : I MPORT - PDF 21.5.2012. 5 LREC 2012 - NooJ Tutorial: Corpus Processing
6
T EXT : I MPORT - XML 21.5.2012. 6 LREC 2012 - NooJ Tutorial: Corpus Processing
7
T EXT : I MPORT - XML 21.5.2012. 7 LREC 2012 - NooJ Tutorial: Corpus Processing
8
T EXT : C REATE N EW 21.5.2012. 8 LREC 2012 - NooJ Tutorial: Corpus Processing Type new text. Copy -> paste text from another document.
9
C ORPUS Collection of text files that share the same linguistic resources. *.noc 21.5.2012. 9 LREC 2012 - NooJ Tutorial: Corpus Processing
10
C ORPUS : C ONSTRUCTING 21.5.2012. 10 LREC 2012 - NooJ Tutorial: Corpus Processing
11
C ORPUS : C ONSTRUCTING 21.5.2012. 11 LREC 2012 - NooJ Tutorial: Corpus Processing
12
C ORPUS : O PEN E XISTING 21.5.2012. 12 LREC 2012 - NooJ Tutorial: Corpus Processing
13
C ORPUS : N EW 21.5.2012. 13 LREC 2012 - NooJ Tutorial: Corpus Processing
14
L OCATE P ATTERN – 4 TYPES OF PATTERNS 21.5.2012. 14 LREC 2012 - NooJ Tutorial: Corpus Processing
15
L OCATE P ATTERN – 1. STRING OF CHARACTERS 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 15 lady
16
L OCATE P ATTERN – 2. PERL R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 16 [0-9] [1-2][0-9][0-9][0-9] [aeiou] [aeiou] [aeiou] [aeiou]
17
L OCATE P ATTERN – 3. N OO J R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 17 lady young lady AND concatenation ( a | the ) lady parenthesis lady | girl OR disjunction |
18
E XAMPLES : WRITE A N OO J R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 18 that will find all the Mr, Mrs and Miss followed by aName that will find all the words written in upper cases followed by any string of digits If = empty string write a NooJ RegEx that will find all the examples where ‘is’ is followed by 0, 1 or 2 any words that are followed by ‘the, this or that’ instead of ‘is’ recognize any form of verb ‘to be’ between ‘to be’ forms and ‘the, this, that’ there can be any number of word forms (Mr.|Mrs.|Miss) ( )* is ( | | ) (the|this|that) ( | | ) (the|this|that) * (the|this|that)
19
21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 19
20
L OCATE P ATTERN – 4. N OO J G RAMMAR 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 20
21
L OCATE P ATTERN ? 21.5.2012. 21 LREC 2012 - NooJ Tutorial: Corpus Processing Will probably see Shall probably never see Is probably going to see Are probably about to see
22
S TATISTICAL A NALYSIS 21.5.2012. 22 LREC 2012 - NooJ Tutorial: Corpus Processing
23
S TATISTICAL A NALYSIS 21.5.2012. 23 LREC 2012 - NooJ Tutorial: Corpus Processing
24
S TATISTICAL A NALYSIS 21.5.2012. 24 LREC 2012 - NooJ Tutorial: Corpus Processing
25
L INGUISTIC U NITS AND ANNOTATIONS Max Silberztein max.silberztein@gmail.com University of Besançon Next Tutorial Part 2
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.