Lesson 1 – Chapter 1B Chapter 1B – Terminology

Slides:



Advertisements
Similar presentations
Google Refine Tutorial April, Sathishwaran.R - 10BM60079 Vijaya Prabhu - 10BM60097 Vinod Gupta School of Management, IIT Kharagpur This Tutorial.
Advertisements

Matt Masson| Senior Program Manager
1 Copyright © Oracle Corporation, All rights reserved. Writing Basic SQL SELECT Statements.
Core LIMS Training: Entering Experimental Data – Simple Data Entry.
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Data Visualization with Tableau
Building Dashboards with JMP 13 Dan Schikore SAS, JMP
Project Management: Messages
QuadriDCM Easy Access as a communication platform
Introduction to OBIEE:
Working in the Forms Developer Environment
What are they? The Package Repository Client is a set of Tcl scripts that are capable of locating, downloading, and installing packages for both Tcl and.
Spark Presentation.
IS444: Modern tools for applications development
Contract Compliance: Search
IS444: Modern tools for applications development
Microsoft FrontPage 2003 Illustrated Complete
CFS Community Day Core Flight System Command and Data Dictionary Utility December 4, 2017 NASA JSC/Kevin McCluney December 4, 2017.
Tutorial for using Case It for bioinformatics analyses
CompTIA Linux+ Powered by LPI 2 LX0-104 Dumps PDF LX0-104 Dumps LX0-104 Braindumps LX0-104 Question Answers LX0-104 Study Material.
Azure Machine Learning & ML Studio
Lesson 3: Trifacta Basics
Lesson 1: Introduction to Trifacta Wrangler
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Creating assignments: Best Practices
Download Orders, Shipments, and Receipts
Exporting & Formatting Budgets from NextGen o Excel
Lesson 1: Introduction to Trifacta Wrangler
JDXpert Workday Integration
Data Visualization Web Application
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
Exploring Microsoft® Access® 2016 Series Editor Mary Anne Poatsy
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
Lesson 4: Advanced Transforms
KELLER WILLIAMS REALTY
Download Orders, Shipments and Receipts
SETL: Efficient Spark ETL on Hadoop
Navya Thum February 13, 2013 Day 7: MICROSOFT EXCEL Navya Thum February 13, 2013.
Lesson 1: Introduction to Trifacta Wrangler
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Lesson 4: Advanced Transforms
Lesson 2 – Chapter 2A CHAPTER 2A – CREATING A DATASET
Lesson 1 – Chapter 1C Trifacta Interface Navigation
Lesson 3 – Chapter 3C Changing Datatypes: Settypes
Lesson 4: Advanced Transforms
Lesson 4: Advanced Transforms
Lesson 2: Getting Started
Lesson 4: Advanced Transforms
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
Lesson 6: Tools Chapter 6D – Lookup.
Lesson 3: Trifacta Basics
Lesson 6: Tools Chapter 6C – Join.
Lesson 4: Advanced Transforms
Welcome USAS – R March 20th, 2019 Valley View 4/7/2019.
Overview of Contract Association Batch Upload
Getting Started With Solr
Lesson 3: Trifacta Basics
Lesson 2: Getting Started
Tutorial 7 – Integrating Access With the Web and With Other Programs
Lesson 5: Wrangling Tools
Lesson 4: Advanced Transforms
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Lesson 5: Wrangling Tools
Lesson 2: Getting Started
Unit J: Creating a Database
Create, Upload and Use Data Extensions (Lists)
Presentation transcript:

Lesson 1 – Chapter 1B Chapter 1B – Terminology In this lesson, you will: Learn important Trifacta terminology: Data Source Dataset Wrangler Script Sample Job/Results A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source. Trifacta. Confidential & Proprietary.

Terminology: Datasource Reference to a set of data that has been imported into the system NEVER modified within the application Can be used in multiple datasets Datasources can be added via: Selecting a file in HDFS (Hadoop File System) Selecting a table in Hive Uploading a local file From Job Results A datasourse is a reference to a set of data that has been imported into the system. This source is not modified within the application datasource and can be used in multiple datasets. It is important to note that when you use Trifacta to wrangle a source, or file, the original file is not modified – therefore, it can be used over and over – to prepare output in multiple ways, for example. Datasources are created in the Datasources Page, or when a new dataset is created. There are two ways to add a datasource to your Trifacta instance: You can locate and select a file in HDFS – HDFS stands for Hadoop File System. You can use the file browser to locate and select the file. You can also upload a local file from your machine. Note that there is a 1 GB file size limit for local files. Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source. *Wrangler is local files only Trifacta. Confidential & Proprietary.

Datasource: Supported Formats The following File Formats are supported: CSV LOG JSON AVRO GZIP/BZIP XLS/XLSX TXT XML Several file formats are supported: CSV LOG JSON AVRO EXCEL – Note that if you upload an Excel file with multiple worksheets, each worksheet will be imported as a separate source. Trifacta. Confidential & Proprietary.

Terminology: Datasets A Dataset must be created before data can be transformed A Dataset includes a reference to: One (or more) Datasource(s) A Wrangle Script (sequential set of steps that you define to cleanse and transform your data) Jobs (any number of executions using the script to transform the data in the datasource) Before you can start wrangling, or transforming, data, you must create a Dataset in Trifacta. You cannot wrangle data directly in a datasource! The dataset is your fundamental area of work within the Trifacta web application. Datasets are created through the Workspace page. When you create the Dataset, you will assign a name and description. A Dataset is the marriage of one or more datasources and a Wrangle script. Note that you will select a single source when you create a dataset, but additional sources can be added, using the Join or Union tools, for example. A Wrangle script is a set of transformations to be executed against the datasource. We’ll talk more about the Wrangle language and generating scripts in a few minutes. A includes a reference to a datasource, a Trifacta Web Application dataset script, and any number of executions using the script to transform the data in the datasource. See . Create Dataset Page A dataset may be associated with a project. A is a container for holding one or more datasets and their associated project results. See . Workspace Page A identifies the sequential set of steps that you define to cleanse and transform your data. script A script is specific to a dataset. Scripts are written in Wrangle, a . Scripts are interpreted by the domain-specific language for data transformation Trifacta and turned into commands that can be executed against the data.platform Scripts are created using the various visual tools in the . Transformer Page Trifacta. Confidential & Proprietary.

Terminology: Sample Data in the Transformer is a Sample (not entire source) Except for small files ( <500kb) Sample can be: First 500kb from the source (default) Random sample New Random sample Within the Transformer page, you build the steps of your script against a sample of the datase A sample is typically a subset of the entire dataset. For smaller datasets, the sample may be the entire dataset. If you add or remove columns from your dataset, you can optionally generate a new sample for use in the Transformer page As you build or modify your script, the results of each modification to the script are immediately reflected in the sampled data. So, you can rapidly iterate on the steps of your script within the same interface When you work with data in the Trifacta Transformer, you are nearly always working with a sample of the data. This is not unusual in data processing, sampling is used to speed the iteration cycle. The default Sample data that is loaded into the Transformer will be pulled from the beginning of the data file (starting with the first byte), until 500K, or end of the file – whichever comes first. 500K is the default setting. A Random Sample is also collected from the Source. An initial random sample is created for datasets with pre-structured data, or for which some structure can be automatically inferred (via a splitrows transform). For such projects, you will see that the random sample is available for selection. The random sample should be roughly the same size (# of rows) as the first 500K. Trifacta. Confidential & Proprietary.

Terminology: Jobs Created when you run a Wrangle script “at scale” (on the entire data set) Jobs can be run: On the Trifacta Server In Hadoop (>100 MB) You can do the following from the Job Results: View and analyze Job Results (using column data quality bars and histograms) Add sample rows to Transformer (if sample rows are available) Download Job Results (CSV, JSON, or Tableau TDE) When you are satisfied with the script that you have created in the Transformer page, you can execute a job, which applies the set of scripted steps to the entire dataset Jobs are queued up for execution by the platform When a job completes, you can review the resulting data and use familiar selection tools to identify data that still needs fixing. This data can be added back to the Transformer page, so that you can further refine your scrip Jobs are created when you run a wrangle script “at scale. When we talk about running a job “at scale” we are talking around running the wrangle script on the entire data set. When you generate the wrangle script, you do so by working with a sample of the data. When you run a job, you have the option to run it Locally – meaning it will run on the local machine – this is recommended for datasets that are less than 100 meg. For larger datasets, you should run the job in Hadoop. Each time you run a Job at scale, the results are stored and can be accessed from the Dataset details page or from the Jobs tab. Once the job completes, you can view the Job Results. From the Job Results page, there are several things you can do – You can view and analyze the Job Results – the same data analysis tools that you will see in the Transformer are available in the Job Results – for example, column histograms and data quality bars. It’s a good idea to carefully analyze the job results. Because you are working with a sample of data in Trifacta, it is not usual for the Job Results to turn up anomalies that were not included in the sample. If you identify anomalies within the sample data, you can select those rows only add them back to the Transformer. You can export the results of the job – either in CSV or Tableau TDE format Trifacta. Confidential & Proprietary.