Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1.

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Technology of Data Analytics. INTRODUCTION OBJECTIVE  Data Analytics mindset – shallow and wide, deep when you need it  Quick overview, useful tidbits,
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
Distributed Systems Architectures
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
DATA WAREHOUSING.
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Course Instructor: Aisha Azeem
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Hadoop Ecosystem Overview
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Tyson Condie.
ISYS 512 Business Application Design and Development with.Net David Chao.
Build a Free Website1 Build A Website For Free 2 ND Edition By Mark Bell.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1.
An Introduction to HDInsight June 27 th,
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
OMIS 694, Big Data Analytics
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
David M. Kroenke and David J. Auer Database Processing Fundamentals, Design, and Implementation Chapter Twelve: Big Data, Data Warehouses, and Business.
Big Data Yuan Xue CS 292 Special topics on.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
BIG DATA/ Hadoop Interview Questions.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
OMOP CDM on Hadoop Reference Architecture
Big Data is a Big Deal!.
What is cognitive psychology?
Big Data A Quick Review on Analytical Tools
An Open Source Project Commonly Used for Processing Big Data Sets
Original Slides by Nathan Twitter Shyam Nutanix
Microsoft Professional Program
Chapter 14 Big Data Analytics and NoSQL
Spark Presentation.
Introduction to MapReduce and Hadoop
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Hadoop Clusters Tess Fulkerson.
Introduction to Spark.
Data Warehousing and Data Mining
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Overview of big data tools
Spark and Scala.
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Welcome! Knowledge Discovery and Data Mining
Big-Data Analytics with Azure HDInsight
CSCE156: Introduction to Computer Science II
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Introduction to Data Science Section 1 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1

Course Materials I used many sources in preparing for this course: – Practical Data Science using R by Zumel and Mount – – Data Mining with R: Learning with Case Studies, by Torgo – – An Introduction to Data Science, Version 3, by Stanton – – Monte Carlo Simulation and Resampling Methods for Social Science, by Carsey and Harden – &subject=J00&sortBy=defaultPubDate%20desc&fs=1#tabview=title &subject=J00&sortBy=defaultPubDate%20desc&fs=1#tabview=title – Machine Learning with R by Lantz – 2

Additional Materials A Simple Introduction to Data Science, by Burlingame and Nielsen simple_introduction_to_data_science simple_introduction_to_data_science Ethics of Big Data, by Davis Privacy and Big Data, by Craig and Ludloff Doing Data Science: Straight Talk from the Frontline, by O’Neil and Schutt 3

Learning R Lots of places to learn more about R – All of the sources on the first slide have R code available – Comprehensive R Archive Network (CRAN) – – Springer Textbooks Use R! Series – – Online search tool Rseek – – The RStudio site – – The Odum Institute’s online course – 4

What is Data Science? 5

What words come to mind when you think of Data Science? What experience do you have with Data Science? Why are you taking an Introduction to Data Science Class? 6

The Data Science Revolution Data science is exploding in importance and the attention it receives. It’s hard to sort through the substance and the hype. There is real value in data science, but you should have a purpose or goal in mind first. 7

8

The Roots of Data Science Simple observation and recording those observations dates back to the most ancient civilizations – The Greeks were the first western civilization to adopt observation and measurement Some call Aristotle the first empirical scientist – Muslim scholars between the 10 th and 14 th centuries developed experimentation (Haytham) – Roger Bacon ( ) promoted inductive reasoning (inference) – Descartes ( ) shifted focus to deductive reasoning. 9

What is Data Science? “How Companies Learn Your Secrets” NYT, by Charles Duhigg, February 16, ing-habits.html?pagewanted=1&_r=2&hp& ing-habits.html?pagewanted=1&_r=2&hp& 10

What did Target Do? Mining of data on shopping patterns – Specific products purchased – Combination of products purchased – Combined with demographic and other data Psychology and neuroscience – Habits: Cue-routine-reward When are habits open to change? 11

Lessons from Target Yes, Data Science is about mining data There are deeper theoretical issues involved in understanding what you find Left out of that long article are most of the critical steps that precede the analysis In short, Data Science > data mining 12

Definition of Data Science There are many, but most say data science is: – Broad – broader than any one existing discipline – Interdisciplinary: Computer Science, Statistics, Information Science, databases, mathematics Also substantive domains (environmental science, sociology, public health, etc.) – Applied focus on extracting knowledge from data to inform decision making. – Focuses on the skills needed to collect, manage, store, distribute, analyze, visualize, and reuse data. There are many visual representations of Data Science 13

Some definitions link computational, statistical, and substantive expertise 14

Other definitions focus more on technical skills alone 15

Still other definitions are so broad as to include nearly everything 16

There are many “Word Cloud” representations of Data Science as well 17

18

19

Definition of Data Science The field is immature, cluttered by hype, unfocused. But, key features should include: – Data across its lifecycle – Interdisciplinary skills – Substantive knowledge 20

Defining Some Terms 21

MapReduce and Hadoop – Designed to process large operations quickly – Distributes the problem across multiple servers – The Map part filters and sorts data into bins or queues based on some share characteristic – The Reduce part then executes some operation on each bin of data. – Results are then reassembled – It is like parallel processes, but distributed across servers rather than just processors – Scalable and has a fault tolerance – Hadoop is an open-source version 22

More on MapReduce Pig – Software platform used for creating MapRedce programs used by Hadoop. Hive – A date warehouse infrastructure built on top of Hadoop. Used to query, summarize, or analyzed data. 23

Database Management SQL – Structured Query Language – A programming language designed for management of relational databases MySQL – Open source implementation of an SQL-like system for management of relational databases (used by Wikipedia, Google, Facebook, Twitter, Flickr, YouTube) NoSQL – (Not Only SQL) – Used for databases where the data is in some form other than tabular relations like those used in relational databases – Cassandra (Apache) Distributed database management with not single node of failure Scalable with no down time 24

25

Cloud Computing Standard client-server model where computing operations don’t happen on the local (desktop) machine. What’s new? Virtualization. You are not connecting to a specific server. – Servers are virtual – One server can run multiple virtual machines – One virtual machine can use multiple servers This makes the “machine” scalable, moveable, configurable. Allows selling software, platforms, and even computing infrastructure as a “service” 26

27

Data Mining/Machine Learning Machine learning uses computer algorithms to get a machine to learn and adapt to new information. Data Mining more explicitly focuses on discovering patterns or structure in a given set of data. Often used as synonyms by non-experts without much loss of information. 28

29

Web Scraping This is a process of collecting information from websites and then organizing it for some sort of analysis. Scraping is just about getting the data; the analysis comes later. 30

Programming Tools – R – Statistical programming (object oriented, scripting language) – Python – a scripting programming language that supports object-oriented programming, structured programming, functional programming – SQL – Relational Database – SAS – General purpose data analysis software – Julia – Faster than R and more scalable than Python – Kafka and Storm – Used for real-time streaming analysis 31