Text Editing Kim Shepherd Digital Development Team The University of Auckland Library Tools, tips, tricks LIANZA ITSIG webinar.

Slides:



Advertisements
Similar presentations
Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.
Advertisements

TD Ameritrade IT audit intern Ramez Mina. Position definition Department head  IT audit intern Managers  system analyst and developer to build automated.
What is a Web Page? Web pages are a combination of text and graphics, wrapped in a special “markup” language. The markup language (Hypertext Markup Language.
1 CS 106, Winter 2009 Class 4, Section 4 Slides by: Dr. Cynthia A. Brown, Instructor section 4: Dr. Herbert G. Mayer,
November 15, 2006A. St.Denis Nvu: a web authoring tool {An example of a Technical Topic Presentation in ELM4701}
Scripting Languages. Originally, a script was a file containing a sequence of commands that needed to be executed Control structures were added to make.
The Basic Tools Presented by: Robert E., & Jonathan Chase.
XP Practical PC, 3e Chapter 12 1 Accessing Databases.
CS 0008 Day 2 1. Today Hardware and Software How computers store data How a program works Operators, types, input Print function Running the debugger.
G51FSE Version Control Naisan Benatar. Lecture 5 - Version Control 2 On today’s menu... The problems with lots of code and lots of people Version control.
Filters using Regular Expressions grep: Searching a Pattern.
1 FAQ on video editing. 2 1.Is it possible if I look for some video clips (e.g. firework, speech of Obama) from other sources?  Yes, but you need to.
Game Programming © Wiley Publishing All Rights Reserved. The L Line The Express Line to Learning L Line L.
1 Day 16 Sed and Awk. 2 Looking through output We already know what “grep” does. –It looks for something in a file. –Returns any line from the file that.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Tame Your Data with OpenRefine GIL User Group Meeting May 14 th, 2015 Tricia Clayton Collection Services Librarian Georgia State University.
Fluency with Information Technology INFO100 and CSE100 Katherine Deibel Katherine Deibel, Fluency in Information Technology1.
Computer Software CSCI N207 Data Analysis Using Spreadsheet Department of Computer and Information Science, IUPUI.
Systems Used for Collaboration When to achieve a common goal, result or work product.
Unix Talk #2 (sed). 2 You have learned…  Regular expressions, grep, & egrep  grep & egrep are tools used to search for text in a file  AWK -- powerful.
Mike Spence General appearance of map Ease of use Export capabilities Additional features.
Linux+ Guide to Linux Certification Chapter Four Exploring Linux Filesystems.
Linux+ Guide to Linux Certification, Third Edition
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
CS 105 Perl: Course Introduction Nathan Clement 13 May 2014.
Writing macros and programs for Voyager cataloging Kathryn Lybarger ELUNA 2013 May 3, #ELUNA2013.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Week 3 Exploring Linux Filesystems. Objectives  Understand and navigate the Linux directory structure using relative and absolute pathnames  Describe.
Module 6 – Redirections, Pipes and Power Tools.. STDin 0 STDout 1 STDerr 2 Redirections.
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
Guide to Programming with Python Chapter One Getting Started: The Game Over Program.
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
VIDEO: INTRODUCTION TO STATA EMBA Data Analysis Professor Timothy Simcoe Boston University School of Management.
Review Please hand in your practicals and homework Regular Expressions with grep.
1 EndNote X2 Your Bibliographic Management Tool 30 September 2009 Aaron Tay Tel: /30
Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.
Denali Webinar Series October 1, 2014 Cougar Mountain Software1.
Compiler Construction (CS-636)
Advanced Tips And Tricks For Power Query
16-Dec-15Advanced Programming Spring 2002 sed and awk Henning Schulzrinne Dept. of Computer Science Columbia University.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
Querying CSV Files with SQL using ‘q’ Presented by Simon Frank.
Linux+ Guide to Linux Certification, Second Edition Chapter 4 Exploring Linux Filesystems.
CLEANING UP MESSY DATA WITH OPEN REFINE Presented by Anjum Najmi & Spencer Keralis.
Unix tools Regular expressions grep sed AWK. Regular expressions Sequence of characters that define a search pattern banana matches the text banana
Build Your List, Create a Product And Get Big Players In Your Niche To Promote Your Product… All At The Same Time! The Interview Series Strategy – Class.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 10/9/2006 Lecture 6 – String Processing.
Perl A language for Systems and Network Administration and Management.
Regular Expressions Copyright Doug Maxwell (
Code Editing Lesson 2.
Introduction to the Command Line for Data Analysts Gus Cavanaugh
Python for data analysis Prakhar Amlathe Utah State University
CSE 374 Programming Concepts & Tools
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Grep Allows you to filter text based upon several different regular expression variants Basic Extended Perl.
Henning Schulzrinne Advanced Programming
ECONOMETRICS ii – spring 2018
Unit# 8: Introduction to Computer Programming
An Introduction to Collaborative Online Documents
Unix Talk #2 (sed).
What is a CMS. CMS is content management system CMS is a software that stores content.
USING OPENREFINE FOR DATA-DRIVEN DECISION-MAKING
LING 408/508: Computational Techniques for Linguists
1.5 Regular Expressions (REs)
The Life-Changing Magic of OpenRefine
More Variables and Data Types: references/matricies
Input and Output Python3 Beginner #3.
RSA 2019, Toronto Preconference day March 16, AM-1PM
L L Line CSE 420 Computer Games Lecture #3 Introduction to Python.
Vancouver Public Library
Presentation transcript:

Text Editing Kim Shepherd Digital Development Team The University of Auckland Library Tools, tips, tricks LIANZA ITSIG webinar series

Summary General (large) text files –We manage and manipulate text data daily –It’s tedious and time consuming –Find & Replace is too limited and dangerous –We know there must be a better way... Tabular data files (eg. Spreadsheets) –We work with these all the time, usually in Excel –What tools can help us clean messy data?

Topics Regular Expressions Text Editors Operating on lines, not entire files Google Refine

Regular Expressions /^\s+[a-zA-Z0-9](?:\W+)/

Regular Expressions A way to describe a set of strings and capture parts of them Originated in old UNIX/POSIX tools Now used all over the place Test your regexes out on the web: –

Text Editors & Useful Languages sed, grep, awk

Text Editors Word processors aren’t text editors Shop around, compare features My favourite: Vim (UNIX, Windows, Mac) –Wikipedia comparison of editor featuresWikipedia comparison of editor features –Wikipedia list of regex softwareWikipedia list of regex software

Useful Languages / Interpeters Perl –An old favourite, great for string manipulation Python –The cool kids tell me it’s better than Perl GREL –We’ll get to this later...

Line-by-line processing while( ) {.... }

Line-by-line processing Large files are large! –If they’re big on disk, they’ll be big in memory Lines are (usually!) small –Read a line –Do something with it –Output the modified line

Google Refine Cleans messy tabular data –Easy facetting and filtering of columns/values –Easy transformation of values Google Refine Expression Language (GREL) –Extensive use of regular expressions and other standard string manipulation techniques Other features –Perform web service calls directly, reconcile row IDs

Conclusion Our problems are solvable! –Regular expressions –Decent text editors for general/unformatted text –Google Refine for tabular data Contact me –Please feel free to contact me with questions, corrections or ideas –Google+: