Lesson 1: Introduction to Trifacta Wrangler

Slides:



Advertisements
Similar presentations
Module 2 Navigation.     Homepage Homepage  Navigation pane that holds the Applications and Modules  Click the double down arrow on the right of.
Advertisements

Introduction to KE EMu Unit objectives: Introduction to Windows Use the keyboard and mouse Use the desktop Open, move and resize a.
Introduction to KE EMu Unit objectives: Introduction to Windows Use the keyboard and mouse Use the desktop Open, move and resize a.
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
Microsoft Excel Illustrated Introductory Workbooks and Preparing them for the Web Managing.
Introduction to Business Information and Communication Technologies Lecture Six.
Introduction to EBSCOhost
Core LIMS Training: Entering Experimental Data – Simple Data Entry.
Introduction to Notes Sui for Teachers.
General System Navigation
Building Dashboards with JMP 13 Dan Schikore SAS, JMP
Introduction to Notes Sui for Students.
Dive Into® Visual Basic 2010 Express
Project Management: Messages
Test Information Distribution Engine (TIDE)
Practical Office 2007 Chapter 10
Getting Started with SAM
Chapter Lessons Understand the Macromedia Flash workspace
Basic Training May 2016.
Google Docs Workshop Jan. 2014
Lesson 3: Trifacta Basics
Lesson 1: Introduction to Trifacta Wrangler
Access Lesson 1 Understanding Access Fundamentals
Download Orders, Shipments, and Receipts
Introducing Microsoft Office 2010
Lesson 1: Introduction to Trifacta Wrangler
Unit 7 – Excel Graphs.
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
Adding and Editing Students and Student Test Settings
Lesson 1: Introduction to Trifacta Wrangler
Lesson 4: Advanced Transforms
Download Orders, Shipments and Receipts
Lesson 1 – Chapter 1B Chapter 1B – Terminology
Managing Rosters Screener Training Module Module 5
Lesson 1: Introduction to Trifacta Wrangler
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Activating Your Account and Navigating Through TIDE
Lesson 4: Advanced Transforms
Lesson 2 – Chapter 2A CHAPTER 2A – CREATING A DATASET
Lesson 1 – Chapter 1C Trifacta Interface Navigation
Lesson 3 – Chapter 3C Changing Datatypes: Settypes
Lesson 4: Advanced Transforms
Lesson 4: Advanced Transforms
Lesson 2: Getting Started
Code Analysis, Repository and Modelling for e-Neuroscience
Lesson 4: Advanced Transforms
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
Lesson 6: Tools Chapter 6D – Lookup.
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
ACE Secure Data Portal - Accounts Tab - Statements
Lesson 3: Trifacta Basics
Lesson 6: Tools Chapter 6C – Join.
Lesson 4: Advanced Transforms
Overview of Contract Association Batch Upload
Lesson 3: Trifacta Basics
Lesson 2: Getting Started
Tutorial 7 – Integrating Access With the Web and With Other Programs
Lesson 5: Wrangling Tools
Lesson 4: Advanced Transforms
Grauer and Barber Series Microsoft Access Chapter One
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Lesson 5: Wrangling Tools
Code Analysis, Repository and Modelling for e-Neuroscience
Lesson 2: Getting Started
Microsoft Excel 2007 – Level 2
Adobe Flash CS3 Revealed
Presentation transcript:

Lesson 1: Introduction to Trifacta Wrangler Chapter 1C – Trifacta Interface Navigation

Lesson 1 – Chapter 1C Chapter 1C – TRIFACTA INTERFACE NAVIGATION In this Chapter, you will learn important Trifacta terminology: Transformer Recipe Data Quality Bar Histogram Data Type Column Menus Column Browser Panel Column Details A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source.

Transformer A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source.

Transformer Identifies the data that you need to wrangle Build recipes on samples Recipe changes are immediately applied to your sample Preview the results before you run it against the dataset at scale A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source.

Recipe Review and modify the steps of the recipe that you have already created Click the recipe panel to expand the recipe Represented as a series of icons on the right side of the screen You can navigate forward and backward in the the recipe to see the progression Within the Transformer page, you build the steps of your script against a sample of the datase A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset. If you add or remove columns from your dataset, you can optionally generate a new sample for use in the Transformer page As you build or modify your script, the results of each modification to the script are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your script within the same interface When you work with data in the Trifacta Transformer, you are nearly always working with a sample of the data. This is not unusual in data processing, sampling is used to speed the iteration cycle. The default Sample data that is loaded into the Transformer will be pulled from the beginning of the data file (starting with the first byte), until 500K, or end of the file – whichever comes first. 500K is the default setting. A Random Sample is also collected from the Source. An initial random sample is created for datasets with pre-structured data, or for which some structure can be automatically inferred (via a splitrows transform). For such projects, you will see that the random sample is available for selection. The random sample should be roughly the same size (# of rows) as the first 500K.

Data Quality Bar Displays a bar that indicates: Green: Values that match column’s datatype Red: Values that do not match column’s datatype Black: Missing values Clicking on the data quality bar generates suggestions that apply to valid, invalid, or missing records A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source.

Histogram Displays a summary visualization of the data Hover over the histogram to see values Select one or more values in the histogram to prompt suggestion cards Before you can start wrangling, or transforming, data, you must create a Dataset in Trifacta. You cannot wrangle data directly in a datasource! The dataset is your fundamental area of work within the Trifacta web application. Datasets are created through the Workspace page. When you create the Dataset, you will assign a name and description. A Dataset is the marriage of one or more datasources and a Wrangle script. Note that you will select a single source when you create a dataset, but additional sources can be added, using the Join or Union tools, for example. A Wrangle script is a set of transformations to be executed against the datasource. We’ll talk more about the Wrangle language and generating scripts in a few minutes. A includes a reference to a datasource, a Trifacta Web Application dataset script, and any number of executions using the script to transform the data in the datasource. See . Create Dataset Page A dataset may be associated with a project. A is a container for holding one or more datasets and their associated project results. See . Workspace Page A identifies the sequential set of steps that you define to cleanse and transform your data. script A script is specific to a dataset. Scripts are written in Wrangle, a . Scripts are interpreted by the domain-specific language for data transformation Trifacta and turned into commands that can be executed against the data.platform Scripts are created using the various visual tools in the . Transformer Page

Data Type To the left of the column header is the column’s datatype A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source. To the left of the column header is the column’s datatype Trifacta infers the column’s datatype

Column Menus To the right of the column header is the column menu These menus allow for basic operations and also contextual transformations based on the columns datatype Within the Transformer page, you build the steps of your script against a sample of the datase A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset. If you add or remove columns from your dataset, you can optionally generate a new sample for use in the Transformer page As you build or modify your script, the results of each modification to the script are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your script within the same interface When you work with data in the Trifacta Transformer, you are nearly always working with a sample of the data. This is not unusual in data processing, sampling is used to speed the iteration cycle. The default Sample data that is loaded into the Transformer will be pulled from the beginning of the data file (starting with the first byte), until 500K, or end of the file – whichever comes first. 500K is the default setting. A Random Sample is also collected from the Source. An initial random sample is created for datasets with pre-structured data, or for which some structure can be automatically inferred (via a splitrows transform). For such projects, you will see that the random sample is available for selection. The random sample should be roughly the same size (# of rows) as the first 500K.

Column Browser Panel Open the Column Browser by click Columns in the toolbar of the Transformer page Select one or more columns and perform actions on them Toggle the display of individual columns in the Trifacta Application Within the Transformer page, you build the steps of your script against a sample of the datase A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset. If you add or remove columns from your dataset, you can optionally generate a new sample for use in the Transformer page As you build or modify your script, the results of each modification to the script are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your script within the same interface When you work with data in the Trifacta Transformer, you are nearly always working with a sample of the data. This is not unusual in data processing, sampling is used to speed the iteration cycle. The default Sample data that is loaded into the Transformer will be pulled from the beginning of the data file (starting with the first byte), until 500K, or end of the file – whichever comes first. 500K is the default setting. A Random Sample is also collected from the Source. An initial random sample is created for datasets with pre-structured data, or for which some structure can be automatically inferred (via a splitrows transform). For such projects, you will see that the random sample is available for selection. The random sample should be roughly the same size (# of rows) as the first 500K.

Column Details Provides a high-level overview of the column’s data including: Data quality Summary Statistics Histograms Frequency Charts Within the Transformer page, you build the steps of your script against a sample of the datase A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset. If you add or remove columns from your dataset, you can optionally generate a new sample for use in the Transformer page As you build or modify your script, the results of each modification to the script are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your script within the same interface When you work with data in the Trifacta Transformer, you are nearly always working with a sample of the data. This is not unusual in data processing, sampling is used to speed the iteration cycle. The default Sample data that is loaded into the Transformer will be pulled from the beginning of the data file (starting with the first byte), until 500K, or end of the file – whichever comes first. 500K is the default setting. A Random Sample is also collected from the Source. An initial random sample is created for datasets with pre-structured data, or for which some structure can be automatically inferred (via a splitrows transform). For such projects, you will see that the random sample is available for selection. The random sample should be roughly the same size (# of rows) as the first 500K.