Download presentation
Presentation is loading. Please wait.
Published byDiana Ross Modified over 5 years ago
1
Big Data: Text Mining The Linguistics Department Presents:
The Kucera Server
2
What we are doing today Introducing the corpora Searching the data
Sorting your data Saving & Extracting Manipulating your search data
3
Available Corpora A simple overview
4
These are most of the corpora we are making available right now.
The yellow ones are the Spoken Corpora
5
The smaller circle you see here is the sub-set of corpora that are useable with the CQO interface.
6
Kucera: Available Resources
The possibilies are endless. The resorceses available are: The Brown Corpus This was the first million-word electronic corpus of English, created in at Brown University. It spans about fifteen different categories of text. The Penn Treebank Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal. COCA: Corpus Of Contemporary American English 531 million tokens of American English sampled from across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library.
7
Kucera: Available Resources
COHA: Corpus Of Historical American English 400+ million words from the period facilitate diachronic investigation. Provided courtesy of the UGA Library. British National Corpus 100 million words of British text annotated with PoS and lemmas as well as speaker age, social class and geographical region. 91% was published between 1985 and 1993. AudioBNC Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National Corpus. SpokenBNC 2014 11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and Substantial speaker metadata is included with PoS and semantic tags.
8
Kucera: Available Resources
Arabic Treebank Approximately 800 thousand words of newswire text from Agence France-Presse annotated with parts of speech, morphology and phrase structure. DEFT Spanish Treebank About 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations. CETEMPúblico 180 million words from the Portuguese newspaper "Publico'' with morphological and syntactic annotations. French Treebank This corpus is drawn from the newspaper Le Monde annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words.
9
Kucera: Available Resources
SPMRL2014 Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. NEGRA corpus 355 thousand tokens of the German newspaper Frankfurter Rundschau annotated with syntactic structures. EuroParl About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch. CALLHOME This corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length.
10
Kucera: Available Resources
The Buckeye Speech Corpus This corpus is comprised of 40 speakers from Columbus Ohio totals more than 300,000 words of speech. CELEX2 Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch. Concretely Annotated New York Times About 1.3 billion words from articles that appeared in the New York Times with automatically-assigned lemmas and part-of-speech tags. WaCky corpora Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in These corpora are annotated with lemmas and parts of speech.
11
Conducting Searches A simple overview
12
Available Corpora CQP corpora Non-CQP corpora
This is a sub-grouping of all available corpora Searching using the CQP interface These are searched with regular expression, PoS, Lemma, and other tags. Non-CQP corpora These are searched with Linux commands and bash scripting. The first group is the CQP: Colored Query Processor
13
CQP Corpora Single Words Individual word ex “judge” String of words
ex “kick” “the” “bucket” Wild Cards “.”, “?”, “*”, “+”, “( )”, “|”, “[ ]”, “come” “(for|because)” [ ]* “stay(.+)?” Tags - [ pos = “vvd” ], [ lemma = “eat” ]
15
CQP Corpora Single Words Individual word ex “judge” String of words
ex “kick” “the” “bucket” Wild Cards “.”, “?”, “*”, “+”, “( )”, “|”, “[ ]”, “come” “(for|because)” [ ]* “stay(.+)?” Tags - [ pos = “vvd” ], [ lemma = “eat” ]
16
Other Corpora Linux Commands Grep + Regular expressions
Wild characters Exact matches, non-matches Egrep + Regular expressions Optional commands -n: Which lines,-c: How many lines,-i: Ignore case, -v: Invert match, and more.
18
Other Corpora Linux Commands
Regex and Bash scripting are both well documented and supported. Intro Webpages Youtube Videos Lynda.com Books
19
Sorting your data A simple overview
20
Counting your data Counting commands
Use the command “count” to count your results in various ways. > count by (attribute) (attributes include) + word, lemma, pos, etc + cut (number) – cuts to only the number included. + descending – reverse the order + reverse – sorts the matches by suffix + %cd normalizes for case and/or diacritics
21
Sorting your data Sorting commands
Use the command “sort” to put the results in the order they were in, in the corpus. Additional commands modify the sort. > sort by (attribute) (attributes include) + word, lemma, pos, etc + randomize – shuffle the results so you don’t see only the top. + descending – reverse the order + reverse – sorts the matches by suffix + %cd normalizes for case and/or diacritics You’ll notice that the numerical order is now all over the place
22
Saving & Exporting A simple overview
23
Saving Searches Naming Searches (CQP)
Each search is stored as the named “Last” search. You can just rename the last search to call on it later by using the “cat” (concatenate) command and the “>” (write out) command or “>>” (append) command. >cat Last >> “FileName.txt” (adds to the bottom of the named file) >cat Last > “FileName.txt” (creates file of that name or saves over that file if it already exists)
24
Saving Searches Naming Searches (Bash)
Know your directory and pathways Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder. Don’t have access to auto saving last searches
26
Saving Searches Naming Searches (Bash) Scripting
Know your directory and pathways Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder. Don’t have access to auto saving last searches Scripting Both avenues allow for scripts. You can test and improve a set of commands, adding complexity, until you like it.
27
Exporting Best Way Other Ways - FTP (MobaExterm, Linux Commandline)
WinSCP (various options for iOS users) Other Ways - FTP (MobaExterm, Linux Commandline) Format - These will be “.txt” files so using notepad or notepad++ is an easy way to see what you have.
28
Manipulating Data A simple overview
29
Manipulating Data Excel – Simple, familiar, short learning curve
Python, Perl, etc, – Steeper learning curve, more powerful, very flexible. R – Also has a steeper learning curve, also a powerful stats tool. Also, Linux tools on a local machine Bash, vi, vim, atom, etc
30
Summation What next?
31
Fall 2019: LING 4886/6886 Excellent opportunity to learn the how and just as importantly the why. There will be significant digital humanities content. Counts toward the DH Certificate.
33
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.