Big Data: Text Mining The Linguistics Department Presents:

Slides:



Advertisements
Similar presentations
Using Journal and Other Tablet PC Tools. Tools Bars in Journal To access all tool bars click on view and select each tool bar to activate each.
Advertisements

CSCI3170 Introduction to Database Systems
Modern Language Association (MLA) International Bibliography Hosted by Gale Cengage Welcome to our Guided Tour Tour takes about 7 minutes. The show will.
Linux+ Guide to Linux Certification, Second Edition
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
InfoTrac Power Search 2.0 Lund Online 2009 – Products & Platforms Monique Schutterop.
Tutorial 11: Connecting to External Data
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1 Agenda Overview Review Roles Lists Libraries Columns.
Created in 2011 at Liberty High School. Getting Started Overview on Magnet Tool – Graphics – Text – Image – Video – Sound – Wall A Sample Glog How to.
Advanced File Processing
‘ {] PowerPoint Presentation to Accompany GO! with Windows 7 Getting Started Chapter 2 Getting Started with Windows 7.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Dedan Githae, BecA-ILRI Hub Introduction to Linux / UNIX OS MARI eBioKit Workshop; Nov , 2014.
Chapter Three The UNIX Editors. 2 Lesson A The vi Editor.
© Pennsylvania Department of Education What is POWER Library ?
Linux+ Guide to Linux Certification Chapter Four Exploring Linux Filesystems.
Linux+ Guide to Linux Certification, Third Edition
Putting Applets into Web Pages.  Two things are involved in the process of putting applets onto web pages ◦ The.class files of the applet ◦ The html.
Week 3 Exploring Linux Filesystems. Objectives  Understand and navigate the Linux directory structure using relative and absolute pathnames  Describe.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Chapter Three The UNIX Editors.
Greenstone Building your own collection. Overview Installation Usage Building a collection.
ONZEminer Margaret Maclagan, ONZE director Robert Fromont, designer.
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
Factiva.com. What is Factiva? Joint venture between two of the world’s leading sources of company and business news + Knight Ridder Media General Hoover’s.
Using the Automatic Captions Feature. Objectives Learn how to use the Automatic Captions feature in YouTube  Edit the generated captions  Extract the.
Chapter 3: Mastering Editors Chapter 3 Mastering Editors (Emacs)
CRAI Library Catalog of University of Barcelona
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
History Reference Center
History Reference Center
History Reference Center
Language Identification and Part-of-Speech Tagging
Lesson 5-Exploring Utilities
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Language Translation Services –Wordpar.com
Measuring Monolinguality
Online Educational tool #2 and #3
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Learning the Basics – Lesson 1
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
* Lecture # 7 Instructor: Rida Noor Department of Computer Science
Guide To UNIX Using Linux Third Edition
By Jonathan Rinfret CREATING A BASH SCRIPT By Jonathan Rinfret
Natural Language Processing (NLP)
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Research with Gale Databases
Topics in Linguistics ENG 331
Corpus-Based ELT CEL Symposium Creating Learning Designers
Guide To UNIX Using Linux Third Edition
Literary reference center
History Reference Center
Lemma: canonical (citation) form of a lexeme, which conventionally represents the set of related words Lexeme: the set of related words But….
Statistical n-gram David ling.
Extracting Recipes from Chemical Academic Papers
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
BYU COCA: CORPUS OF CONTEMPORARY AMERICAN ENGLISH
Natural Language Processing (NLP)
Presentation transcript:

Big Data: Text Mining The Linguistics Department Presents: The Kucera Server

What we are doing today Introducing the corpora Searching the data Sorting your data Saving & Extracting Manipulating your search data

Available Corpora A simple overview

These are most of the corpora we are making available right now. The yellow ones are the Spoken Corpora

The smaller circle you see here is the sub-set of corpora that are useable with the CQO interface.

Kucera: Available Resources The possibilies are endless. The resorceses available are: The Brown Corpus This was the first million-word electronic corpus of English, created in 1961 at Brown University. It spans about fifteen different categories of text. The Penn Treebank Manually-corrected phrase structure trees for English, including 1.2 million words of newspaper text from the Wall Street Journal.  COCA: Corpus Of Contemporary American English 531 million tokens of American English sampled from 1990--2017 across categories such as Academic, Fiction, Magazine, Newspaper, and Spoken. This version is annotated with lemmas and parts of speech. Provided courtesy of the UGA Library.

Kucera: Available Resources COHA: Corpus Of Historical American English  400+ million words from the period 1810--2008 facilitate diachronic investigation. Provided courtesy of the UGA Library. British National Corpus 100 million words of British text annotated with PoS and lemmas as well as speaker age, social class and geographical region. 91% was published between 1985 and 1993. AudioBNC Audio and all available transcriptions of the 7.5 million words of the spoken portion of the British National Corpus. SpokenBNC 2014 11.4 million tokens, orthographically transcribed from smartphone recordings made between 2012 and 2016. Substantial speaker metadata is included with PoS and semantic tags.

Kucera: Available Resources Arabic Treebank Approximately 800 thousand words of newswire text from Agence France-Presse annotated with parts of speech, morphology and phrase structure. DEFT Spanish Treebank About 100 thousand words from both Spanish newswire and discussion forums, with extensive morphological and syntactic annotations. CETEMPúblico 180 million words from the Portuguese newspaper "Publico'' 1991--1998 with morphological and syntactic annotations. French Treebank This corpus is drawn from the newspaper Le Monde 1989-1994 annotated with syntactic constituents, syntactic categories, lemmas and compounds and totals about 650 thousand words.

Kucera: Available Resources SPMRL2014 Dependency, constituency and morphology annotations for Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish. NEGRA corpus 355 thousand tokens of the German newspaper Frankfurter Rundschau annotated with syntactic structures. EuroParl About 40 million words of European Parliamentary proceedings aligned across translations into English, German, Spanish, French, Italian and Dutch. CALLHOME This corpus consists of 5-10 minute snippets from 120 phone calls, each 30 minutes each in length.

Kucera: Available Resources The Buckeye Speech Corpus This corpus is comprised of 40 speakers from Columbus Ohio totals more than 300,000 words of speech. CELEX2 Orthography, phonology, morphology and attestation frequency information for words in English, German and Dutch. Concretely Annotated New York Times About 1.3 billion words from articles that appeared in the New York Times 1982--2007 with automatically-assigned lemmas and part-of-speech tags. WaCky corpora Between 1.2 and 1.9 billion tokens each of French, German and Italian as crawled from the world wide web. Also includes about 800 million tokens of English Wikipedia as it was in 2009. These corpora are annotated with lemmas and parts of speech.

Conducting Searches A simple overview

Available Corpora CQP corpora Non-CQP corpora This is a sub-grouping of all available corpora Searching using the CQP interface These are searched with regular expression, PoS, Lemma, and other tags. Non-CQP corpora These are searched with Linux commands and bash scripting. The first group is the CQP: Colored Query Processor

CQP Corpora Single Words Individual word ex “judge” String of words ex “kick” “the” “bucket” Wild Cards “.”, “?”, “*”, “+”, “( )”, “|”, “[ ]”, “come” “(for|because)” [ ]* “stay(.+)?” Tags - [ pos = “vvd” ], [ lemma = “eat” ]

CQP Corpora Single Words Individual word ex “judge” String of words ex “kick” “the” “bucket” Wild Cards “.”, “?”, “*”, “+”, “( )”, “|”, “[ ]”, “come” “(for|because)” [ ]* “stay(.+)?” Tags - [ pos = “vvd” ], [ lemma = “eat” ]

Other Corpora Linux Commands Grep + Regular expressions Wild characters Exact matches, non-matches Egrep + Regular expressions Optional commands -n: Which lines,-c: How many lines,-i: Ignore case, -v: Invert match, and more.

Other Corpora Linux Commands Regex and Bash scripting are both well documented and supported. Intro Webpages Youtube Videos Lynda.com Books

Sorting your data A simple overview

Counting your data Counting commands Use the command “count” to count your results in various ways. > count by (attribute) (attributes include) + word, lemma, pos, etc + cut (number) – cuts to only the number included. + descending – reverse the order + reverse – sorts the matches by suffix + %cd normalizes for case and/or diacritics

Sorting your data Sorting commands Use the command “sort” to put the results in the order they were in, in the corpus. Additional commands modify the sort. > sort by (attribute) (attributes include) + word, lemma, pos, etc + randomize – shuffle the results so you don’t see only the top. + descending – reverse the order + reverse – sorts the matches by suffix + %cd normalizes for case and/or diacritics You’ll notice that the numerical order is now all over the place

Saving & Exporting A simple overview

Saving Searches Naming Searches (CQP) Each search is stored as the named “Last” search. You can just rename the last search to call on it later by using the “cat” (concatenate) command and the “>” (write out) command or “>>” (append) command. >cat Last >> “FileName.txt” (adds to the bottom of the named file) >cat Last > “FileName.txt” (creates file of that name or saves over that file if it already exists)

Saving Searches Naming Searches (Bash) Know your directory and pathways Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder. Don’t have access to auto saving last searches

Saving Searches Naming Searches (Bash) Scripting Know your directory and pathways Use the “>” (write out) or “>>” (append) commands but, you must put it in your home folder. Don’t have access to auto saving last searches Scripting Both avenues allow for scripts. You can test and improve a set of commands, adding complexity, until you like it.

Exporting Best Way Other Ways - FTP (MobaExterm, Linux Commandline) WinSCP (various options for iOS users) Other Ways - FTP (MobaExterm, Linux Commandline) Format - These will be “.txt” files so using notepad or notepad++ is an easy way to see what you have.

Manipulating Data A simple overview

Manipulating Data Excel – Simple, familiar, short learning curve Python, Perl, etc, – Steeper learning curve, more powerful, very flexible. R – Also has a steeper learning curve, also a powerful stats tool. Also, Linux tools on a local machine Bash, vi, vim, atom, etc

Summation What next?

 Fall 2019: LING 4886/6886 Excellent opportunity to learn the how and just as importantly the why. There will be significant digital humanities content. Counts toward the DH Certificate.

The End