September 11, Ian R Brooks Ph.D.

Slides:



Advertisements
Similar presentations
Data in R. General form of data ID numberSexWeightLengthDiseased… 112m … 256f3.61 NA1… 3……………… 4……………… n91m5.1711… NOTE: A DATASET IS NOT A MATRIX!
Advertisements

Management Information Systems, Sixth Edition
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
United Nations Economic Commission for Europe Statistical Division NTTS 2015 – Satellite Workshop on Big Data March 9, 2015 The Big Data Project – The.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson
Institute for Personal Robots in Education (IPRE)‏ CSC 170 Computing: Science and Creativity.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Andy Roberts Data Architect
Dato Confidential 1 Danny Bickson Co-Founder. Dato Confidential 2 Successful apps in 2015 must be intelligent Machine learning key to next-gen apps Recommenders.
AZURE MACHINE LEARNING Bringing New Value To Old Data SQL Saturday #
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.
Introduction to Mongo DB(NO SQL data Base)
Big Data Analytics and HPC Platforms
Platform as a Service (PaaS)
Connected Infrastructure
Data Platform and Analytics Foundational Training
Big Data is a Big Deal!.
Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer
Platform as a Service (PaaS)
SAS users meeting in Halifax
Big Data Enterprise Patterns
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Introduction to Spark Streaming for Real Time data analysis
Hadoop and Analytics at CERN IT
Metis Data Science Meetup:
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
The importance of being Connected
Sqoop Mr. Sriram
Big-Data Fundamentals
Connected Infrastructure
Building Analytics At Scale With USQL and C#
A Big Data Cheat Sheet: The Big Pharma Edition
Open Source on .NET A real world use case.
SQOOP.
ICT Database Lesson 1 What is a Database?.
SNOW ONLINE TRAINING IN HYDERABAD
Azure Machine Learning & ML Studio
Operationalize your data lake Accelerate business insight
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Design engineer deliver.
Ch 4. The Evolution of Analytic Scalability
Data science and machine learning at scale, powered by Jupyter
Overview of big data tools
Introduction to SAP HANA
Big Data Young Lee BUS 550.
Databricks: the new kid on the block
Project Goals Collect and permanently store the data flowing around ONAP system into several Big Data storages, each in different category. Also serve.
relational thoughts on NoSql
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Agenda Need of Cloud Computing What is Cloud Computing
ETL Patterns in the Cloud with Azure Data Factory
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
The Student’s Guide to Apache Spark
UCLA Health Data Analytics Strategy
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
Continuous Integration and Delivery (CI/CD) in Azure Data Factory
SQL Server 2019 Bringing Apache Spark to SQL Server
The “perfect” data platform
Pig Hive HBase Zookeeper
Presentation transcript:

September 11, 2018 - Ian R Brooks Ph.D. A New Hope: Improving Patient Screening By Applying Predictive Analytics To Electronic Medical Records

Who Is The Data Jedi? Completed MS in Computer Science in 2008. Software Engineer at Flextronics Medical and Peterbilt Motor Company. Completed Ph.D. in Computer Science in 2015. Involved in Big Data and Hadoop since 2015. Joined Hortonworks in 2016 DS SME Lead 2018 Who Is The Data Jedi?

Why Are We Here Today? Commons Problems for IT Large volume of EMR Data that pushes limits of modern architecture Variety of data sources from traditional sources and IoT data sources make it difficult manage Why Are We Here Today?

Commons Problems For Data Science and Engineering Teams All data of their organization’s isn’t available to them Models are easy to build difficult to share leaving them in isolation Deploying pipelines models into their IT infrastructure is easy Accessing and storing model results Why Else Are We Here?

Electronic Medical Records The dataset used for the demo are from EMR Bots at http://www.emrbots.org/ Synthetic EMR data that can used to without worrying about patient privacy concerns Table 1 – Patient information Table 2 – Patient admission Table 3 – Lab results Table 4 – Patient Diagnosis Electronic Medical Records

End To End Work Flow Load data files & explore dataset Sessionalize patient data Feature engineering Pipeline model building Model deployment End To End Work Flow

Load Files and Explore Dataset Load data files into Apache Spark and build a Spark Dataframe, which will include rows and columns Dataframes can be saved as temporary tables and queried used Spark SQL Load Files and Explore Dataset

Sessionalize Patient Data Find all patients ID with and without target condition. Sessonalize all related patient data to patient IDs Patient information Lab results Diagnosis records Merge the two data sets Sessionalize Patient Data

Feature Engineering String Index the categorical variables Select features to send to model Build feature vector Feature Engineering

Pipeline Model Building Apache Spark provides an API to build pipeline models Pipeline models encapsulate all of the transformation steps that are required to build the model’s features Once the model is exported, predictions can be made on raw inputs Pipeline Model Building

Model Deployment Once model has been built, it can be deployed Model deployment options often provide an RESTful interface The working interface can be embedded into an Apache NiFi flow Model predictions can now be automated into a work flow Model Deployment

Reference Architecture

Divisions Of Roles

Final Thoughts Highly scalable storage platform Flexible storage platform for both structured and unstructured datasets Centralized data repository Tools and libraries to build pipeline models Model deployment into an existing IT infrastructure Final Thoughts

No Questions, Only Answers Thanks for coming Y’all!