Data Analytics using MATLAB and HDF5

Slides:



Advertisements
Similar presentations
DC GIS Presentation 1/14/2007 DC GIS Use of Google Geospatial Technology MWGOG GIS Committee January 14, 2008 Barney Krucoff GIS Director District of Columbia.
Advertisements

Christine White, Esri Growing OPeNDAP Support: Current ArcGIS Workflows and Future Directions Christine White, Esri
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
MATLAB and Scientific Data: New Features and Capabilities
© 2005 The MathWorks December 2 nd, 2005 MATLAB ® and HDF Accelerating Engineering Productivity and Scientific Discovery.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Introduction to ArcView NPS Introduction to GIS: Lecture 2 Based on NINC, ESRI and Other Sources.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
An Introduction to HDInsight June 27 th,
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
CMPS 1371 Introduction to Computing for Engineers FILE Input / Output.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Handling Landsat Images with Matlab Malinda Siriwardana, Prof. Yuji Murayama University of Tsukuba Graduate School of Life and Environmental Science 132.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
GeoServer Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
Data Analytics (CS40003) Introduction to Data Lecture #1
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Image taken from: slideshare
Big Data Analytics and HPC Platforms
MATLAB, Big Data, and HDF Server
Platform as a Service (PaaS)
Big Data, Data Mining, Tools
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Platform as a Service (PaaS)
Matlab.
Hadoop.
Data Sharing We all need data
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Azure Machine Learning & ML Studio
Ministry of Higher Education
September 11, Ian R Brooks Ph.D.
Cloud Distributed Computing Environment Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
CMPT 733, SPRING 2016 Jiannan Wang
Accelerate Your Self-Service Data Analytics
CS110: Discussion about Spark
Overview of big data tools
Analytics: Its More than Just Modeling
funCTIONs and Data Import/Export
Lecture 2 Components of GIS
Course Introduction CSC 576: Data Mining.
Charles Tappert Seidenberg School of CSIS, Pace University
Big Data Analysis in Digital Marketing
Introduction to Dataflows in Power BI
Tile layers, map image layers, and on-premises Web GIS
Big DATA.
5/7/2019 Map Reduce Map reduce.
The Student’s Guide to Apache Spark
Big-Data Analytics with Azure HDInsight
Server & Tools Business
MapReduce: Simplified Data Processing on Large Clusters
Mark Quirk Head of Technology Developer & Platform Group
Lecture 29: Distributed Systems
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Igor Stančin, Alan Jović to: {igor.stancin,
Map Reduce, Types, Formats and Features
Visual Data Flows – Azure Data Factory v2
Working with Temporal Data
Visual Data Flows – Azure Data Factory v2
Spark with R Martijn Tennekes
Presentation transcript:

Data Analytics using MATLAB and HDF5 Ellen Johnson Senior Team Lead, MATLAB Toolbox I/O MathWorks

Overview MATLAB support for Scientific Data Big Data and Data Analytics Workflows Functions and datatypes for Data Analytics Example: FileDatastore for HDF5 data

MATLAB Support for Scientific Data Scientific data formats HDF5, HDF4, HDF-EOS2 NetCDF (with OPeNDAP!) FITS, CDF, BIL, BIP, BSQ Image file formats TIFF, JPEG, HDR, PNG, JPEG2000, and more Vector data file formats ESRI Shapefiles, KML, GPS and more Raster data file formats GeoTIFF, NITF, USGS and SDTS DEM, NIMA DTED, and more Web Map Service (WMS)

MATLAB Support for HDF5 High Level Interface (h5read, h5write, h5disp, h5info) h5disp('example.h5','/g4/lat'); data = h5read('example.h5','/g4/lat'); Low Level Interface (Wraps HDF5 C APIs) fid = H5F.open('example.h5'); dset_id = H5D.open(fid,'/g4/lat'); data = H5D.read(dset_id); H5D.close(dset_id); H5F.close(fid); h5disp maps to h5dump try, catch don’t have to recompile your code to play with the lower level interfaces Run code as you type it

MATLAB Support for netCDF including OPeNDAP High Level Interface (ncdisp, ncread, ncwrite, ncinfo) url = 'http://oceanwatch.pifsc.noaa.gov/thredds/ dodsC/goes-poes/2day'; ncdisp(url); data = ncread(url,'sst'); Low Level Interface (Wraps netCDF C APIs) ncid = netcdf.open(url); varid = netcdf.inqVarID(ncid,'sst'); netcdf.getVar(ncid,varid,'double'); netcdf.close(ncid); ncdisp maps to ncdump

Big Data and Data Analytics: Why MATLAB? 2 MATLAB lets domain experts do Data Science themselves 1 Data Analytics MATLAB Analytics work with business, scientific, engineering data DATA Engineering, Scientific, and Field Business and Transactional 3 4 MATLAB Analytics run in embedded systems developed with Model-Based Design MATLAB Analytics deploy to enterprise IT systems Embedded Systems Developed with Model-Based Design Enterprise IT Systems

Big Data Workflows in MATLAB ACCESS Access data and collections of files that do not fit in memory Datastores Images Spreadsheets SQL Hadoop (HDFS) Tabular Text Custom Files SCALE Scale to compute clusters and Hadoop/Spark for data stored in HDFS PROCESS AND ANALYZE Purpose-built capabilities for domain experts to work with big data locally Tall Arrays Math Statistics GPU Arrays Matrix Math Deep Learning Image Classification Visualization Machine Learning Image Processing Tall Arrays Math, Stats, Machine Learning on Spark Distributed Arrays Matrix Math on Compute Clusters MDCS for EC2 Cloud-based Compute Cluster MapReduce MATLAB API for Spark MapReduce/Spark API: Lower level control

Data Analytics Workflows in MATLAB Files Databases Sensors Access and Explore Data Preprocess Data Working with Messy Data Data Reduction/ Transformation Feature Extraction Develop Predictive Models Model Creation e.g. Machine Learning Model Validation Parameter Optimization Integrate Analytics with Systems Desktop Apps Enterprise Scale Systems Embedded Devices and Hardware

Today’s Focus: Accessing, Exploring, Preprocessing Data Files Databases Sensors Access and Explore Data Preprocess Data Working with Messy Data Data Reduction/ Transformation Feature Extraction Repositories – SQL, NoSQL, etc. File I/O – Text, Spreadsheet, etc. Web Sources – RESTful, JSON, etc. Business and Transactional Data Engineering, Scientific and Field Data Real-Time Sources – Sensors, GPS, etc. File I/O – Image, Scientific Data Formats, Video, Audio, etc.. Communication Protocols – OPC (OLE for Process Control), CAN (Controller Area Network), etc.

What is a datastore? PCT MDCS MDCS MATLAB Compiler Serial An Object representing a collection of data PCT Local Workers MDCS Serial MDCS MATLAB Compiler

Access Big Data through datastore Datastore: easily access large sets of data Object designed for accessing data Preview data structure and format Variety of types for different data sources: TabularText Datastore Spreadsheet Datastore Database Datastore KeyValue Datastore File Datastore Image Datastore Incrementally read portions of the data Use with Parallel Computing tools Datastore provides a straightforward way to access big data that consists of one text file or a collection of text files. Step through files a chunk at a time Use wildcards to specify all the files in a given directory Identify columns to import using column names Specify format for each column of interest

When to Use datastore Data Characteristics Compute Platform Data stored in files supported by datastore Compute Platform Desktop or cluster Analysis Characteristics Supports Load, Analyze, Discard workflows Incrementally read chunks of data, process within a while loop Datastore provides a straightforward way to access big data that consists of one text file or a collection of text files. Step through files a chunk at a time Use wildcards to specify all the files in a given directory Identify columns to import using column names Specify format for each column of interest

Example datastore code ds = tabularTextDatastore('c:\airlinedata\*.csv'); maxDelay = 0; while hasdata(ds) data = read(ds); chunkmax = max(data.DepartureDelay); maxDelay = max(maxDelay,chunkmax); end % or use tall! t = tall(ds); maxDelay = gather(max(t.DepartureDelay));

Datastores – the Key to Tall Arrays Custom Databases Images … ds = datastore(…) T = tall(ds) ds = datastore('s3://…',…)

“Tall” data types and functions for use with out-of-memory data What are Tall Arrays? tall data type introduced in Ideal for tabular/columnar data One or more rows can fit into memory Overall data size is too big to fit into memory Access Data Text Spreadsheet (Excel) Database (SQL) Images Custom Reader Simulink Tall Data Types Table Timetable Cell Numeric Dates & times String Categorical Cellstr Preprocessing Numeric functions Summary statistics String processing Table wrangling Missing data handling Visualizations: Plot, scatter Histogram/histogram2 Kernel density plot Bin-scatter Machine Learning Linear Models Logistic Regression Discriminant analysis Classification Trees SVM K-means PCA Random data sampling “Tall” data types and functions for use with out-of-memory data

Execution Environments for Tall Arrays Local disk, Shared folders, Databases or Spark + Hadoop (HDFS), for large scale analysis Run on Compute Clusters Process out-of-memory data on your Desktop to explore, analyze, gain insights and to develop analytics Use Parallel Computing Toolbox for increased performance MATLAB Distributed Computing Server, Spark+Hadoop

Example: Working with HDF5 data using FileDatastore NASA’s Operation IceBridge Aircraft Missions Reference: https://nsidc.org/data/icebridge/campaign_data_summary.html Airborne Topographic Mapper LIDAR Measures changes in ice surface elevation Let’s look at the Antarctica Larsen D Ice Sheet datasets Larsen D data collected on 10/18/14 and 11/18/2016 Create a FileDatastore with a custom file reader Read through the collections of files Gather information on the datasets

Example: Working with HDF5 data using FileDatastore Create a FileDatastore ds = fileDatastore(h5Folder, 'ReadFcn', @h5readall); Scale to MapReduce Map function receives chunks of data and outputs intermediate results Reducefunction reads the intermediate results and produces a final result mapreducer(0); mrOutputFolder = fullfile(pwd, 'output'); outds = mapreduce(ds, @countMap, @countReduce, 'OutputFolder', 'output');

Example: Working with HDF5 data using FileDatastore Read and view the computed data tbl = readall(outds); outTable = horzcat(tbl.Key, struct2table([tbl.Value{:}])); outTable.Properties.VariableNames{1} = 'Filename‘ >> fileDatastoreDemo ******************************** * MAPREDUCE PROGRESS * Map 0% Reduce 0% Map 10% Reduce 0% Map 21% Reduce 0% Map 31% Reduce 0% Map 42% Reduce 0% Map 53% Reduce 0% Map 63% Reduce 0% Filename NumberOfDatasets FileSize ErrorDatasets _____________________________________________________________________________________________________________________________________ ________________ __________ _____________ '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_162307.ATM6AT6.h5' 19 1.3913e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_162801.ATM6AT6.h5' 19 1.5699e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_163343.ATM6AT6.h5' 19 1.6593e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_163935.ATM6AT6.h5' 19 1.4693e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_164516.ATM6AT6.h5' 19 1.5862e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_165055.ATM6AT6.h5' 19 1.6317e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_165637.ATM6AT6.h5' 19 1.6681e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_170223.ATM6AT6.h5' 19 1.6438e+07 0 '\\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_170810.ATM6AT6.h5' 19 1.6231e+07 0 ‘ \\mathworks\home\ellenj\iceSheet\n5eil01u.ecs.nsidc.org\ICEBRIDGE\ILATM1B.002\2016.11.18\h5Files\ILATM1B_20161118_171357.ATM6AT6.h5' 19 1.6502e+07 0

Saving Preprocessed/Intermediate Data – MAT-Files Saving preprocessed or intermediate results In MATLAB, many people use .mat files for this Binary MATLAB files that store workspace variables MAT-File version 7.3 are based on the HDF5 file format!

Thank you! Questions?