Presentation is loading. Please wait.

Presentation is loading. Please wait.

C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University.

Similar presentations


Presentation on theme: "C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University."— Presentation transcript:

1 C ORPUS P ROCESSING Kristina Kocijan krkocijan@ffzg.hr Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University of Zagreb Tutorial Part 1

2 O UTLINE 1. Text.not Create a new text Import a text in any file format (ex. pdf / xml) Open text file 2. Corpus.noc Create a corpus Open corpus 3. Query NooJ RegEx 4. Concordances Build concordances Export concordances 5. Statistical analysis 21.5.2012. 2 LREC 2012 - NooJ Tutorial: Corpus Processing

3 T EXT U NITS = TU S Independant areas inside which linguistic resources are applied. *.not 21.5.2012. 3 LREC 2012 - NooJ Tutorial: Corpus Processing

4 T EXT : O PEN 21.5.2012. 4 LREC 2012 - NooJ Tutorial: Corpus Processing

5 T EXT : I MPORT - PDF 21.5.2012. 5 LREC 2012 - NooJ Tutorial: Corpus Processing

6 T EXT : I MPORT - XML 21.5.2012. 6 LREC 2012 - NooJ Tutorial: Corpus Processing

7 T EXT : I MPORT - XML 21.5.2012. 7 LREC 2012 - NooJ Tutorial: Corpus Processing

8 T EXT : C REATE N EW 21.5.2012. 8 LREC 2012 - NooJ Tutorial: Corpus Processing Type new text. Copy -> paste text from another document.

9 C ORPUS Collection of text files that share the same linguistic resources. *.noc 21.5.2012. 9 LREC 2012 - NooJ Tutorial: Corpus Processing

10 C ORPUS : C ONSTRUCTING 21.5.2012. 10 LREC 2012 - NooJ Tutorial: Corpus Processing

11 C ORPUS : C ONSTRUCTING 21.5.2012. 11 LREC 2012 - NooJ Tutorial: Corpus Processing

12 C ORPUS : O PEN E XISTING 21.5.2012. 12 LREC 2012 - NooJ Tutorial: Corpus Processing

13 C ORPUS : N EW 21.5.2012. 13 LREC 2012 - NooJ Tutorial: Corpus Processing

14 L OCATE P ATTERN – 4 TYPES OF PATTERNS 21.5.2012. 14 LREC 2012 - NooJ Tutorial: Corpus Processing

15 L OCATE P ATTERN – 1. STRING OF CHARACTERS 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 15 lady

16 L OCATE P ATTERN – 2. PERL R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 16 [0-9] [1-2][0-9][0-9][0-9] [aeiou] [aeiou] [aeiou] [aeiou]

17 L OCATE P ATTERN – 3. N OO J R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 17 lady young lady AND concatenation ( a | the ) lady parenthesis lady | girl OR disjunction |

18 E XAMPLES : WRITE A N OO J R EG E X 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 18 that will find all the Mr, Mrs and Miss followed by aName that will find all the words written in upper cases followed by any string of digits If = empty string write a NooJ RegEx that will find all the examples where ‘is’ is followed by 0, 1 or 2 any words that are followed by ‘the, this or that’ instead of ‘is’ recognize any form of verb ‘to be’ between ‘to be’ forms and ‘the, this, that’ there can be any number of word forms (Mr.|Mrs.|Miss) ( )* is ( | | ) (the|this|that) ( | | ) (the|this|that) * (the|this|that)

19 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 19

20 L OCATE P ATTERN – 4. N OO J G RAMMAR 21.5.2012. LREC 2012 - NooJ Tutorial: Corpus Processing 20

21 L OCATE P ATTERN ? 21.5.2012. 21 LREC 2012 - NooJ Tutorial: Corpus Processing Will probably see Shall probably never see Is probably going to see Are probably about to see

22 S TATISTICAL A NALYSIS 21.5.2012. 22 LREC 2012 - NooJ Tutorial: Corpus Processing

23 S TATISTICAL A NALYSIS 21.5.2012. 23 LREC 2012 - NooJ Tutorial: Corpus Processing

24 S TATISTICAL A NALYSIS 21.5.2012. 24 LREC 2012 - NooJ Tutorial: Corpus Processing

25 L INGUISTIC U NITS AND ANNOTATIONS Max Silberztein max.silberztein@gmail.com University of Besançon Next Tutorial Part 2


Download ppt "C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University."

Similar presentations


Ads by Google