Https://portal.futuregrid.org Data Analytics and its Curricula DELSA Workshop October 9 2012 Chicago Geoffrey Fox Informatics, Computing.

Slides:



Advertisements
Similar presentations
Nokia Technology Institute Natural Partner for Innovation.
Advertisements

Teaching Courses in Scientific Computing 30 September 2010 Roger Bielefeld Director, Advanced Research Computing.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
CHAPTER 4 ANALYTICS, DECISION SUPPORT, AND ARTIFICIAL INTELLIGENCE
Curriculum Overview School of Information and Library Science University of North Carolina at Chapel Hill BW.
Curriculum Overview School of Information and Library Science University of North Carolina at Chapel Hill BM.
Corporation For National Research Initiatives NSF SMETE Library Building the SMETE Library: Getting Started William Y. Arms.
Data Mining – Intro.
Master of Arts in Data Science Geoffrey Fox for Data Science Program March
The Indiana University School of Informatics Bobby Schnabel: Dean, Indiana University School of Informatics Presented by Geoffrey Fox: Associate Dean for.
The Indiana University School of Informatics Bobby Schnabel: Dean, Indiana University School of Informatics Presented by Geoffrey Fox: Associate Dean for.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Master of Arts in Data Science
© Heikki Topi Computing and University Education in Analytics ACM Education Council San Francisco, CA November 2, 2013 Heikki Topi, Bentley University.
Big Data Course Plans at Purdue Ananth Iyer. Big Data/Analytics Coursera course on Big Data by Bill Howe claims that Big Data involves issues of
Data Analytics and its Curricula Microsoft eScience Workshop October Chicago Geoffrey Fox Informatics,
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data in the Cloud: Research and Education September PPAM 2013 Warsaw Geoffrey Fox
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
© Heikki Topi Data Science and Computing Education ACM Education Council Portland, OR September 16-17, 2014 Heikki Topi, Bentley University.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? January Geoffrey Fox
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
I399 1 Research Methods for Informatics and Computing D: Basic Issues Geoffrey Fox Associate Dean for Research.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Final Search Terms: Archiving (digital or data) Authentication (data) Conservation (digital or data) Curation (digital or data) Cyberinfrastructure Data.
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
The Interplay Between Mathematics/Computation and Analytics Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology.
Resources and Reflections: Using Data in Undergraduate Geosciences Cathy Manduca SERC Carleton College DLESE Annual Meeting 2003.
Computing Research Testbeds as a Service: Supporting large scale Experiments and Testing SC12 Birds of a Feather November.
Community Grids Lab at Pervasive Technology Labs Geoffrey Fox
Remarks on MOOC’s Open Grid Forum BOF July 24 OGF38B at XSEDE13 San Diego Geoffrey Fox Informatics, Computing.
OMIS 694, Big Data Analytics
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Training Data Scientists DELSA Workshop DW4 May Washington DC Geoffrey Fox Informatics, Computing.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Remarks on MOOC’s SC13 Birds of a Feather November Geoffrey Fox Informatics, Computing and Physics.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
FACULTY EXTERNSHIP OPPORTUNITIES IN DATA SCIENCE AND DATA ANALYTICS Facilitated by: FilAm Software Technology, Clark Freeport Zone Ecuiti, San Francisco,
Book web site:
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Introducing Precictive Analytics
Computer Information Technology
Brief Intro to Machine Learning CS539
Private Public FG Network NID: Network Impairment Device
Status and Challenges: January 2017
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Biology MDS and Clustering Results
Tutorial Overview February 2017
OMIS 665, Big Data Analytics
Scalable Parallel Interoperable Data Analytics Library
4 Education Initiatives: Data Science, Informatics, Computational Science and Intelligent Systems Engineering; What succeeds? National Academies Workshop.
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Clouds from FutureGrid’s Perspective
Department of Intelligent Systems Engineering
Panel on Research Challenges in Big Data
Presentation transcript:

Data Analytics and its Curricula DELSA Workshop October Chicago Geoffrey Fox Informatics, Computing and Physics Indiana University Bloomington

Project 3: Training Data Scientists Lead: Geoffrey Fox (Indiana University, Bloomington, Indiana) Goal: Train new and established scientists to enable more effective use of big data and its cyberinfrastructure. Deliverable: Courses in data enabled science culminating in a certification similar to Microsoft or Cisco certification or existing scientific computing or computational science certificates/curricula. Need to evaluate existing resources such as: UW eScience classes, OGF Grid Computing certificate, and XSEDE HPC University… Possible approach is to focus on particular life science subdomains. 2

Jobs v. Countries 3

McKinsey Institute on Big Data Jobs There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. 4

Data Analytics Broad Range of Topics from Policy to new algorithms Enables X-Informatics where several X’s defined especially in Life Sciences – Medical, Bio, Chem, Health, Pathology, Astro, Social, Business, Security, Intelligence Informatics defined (more or less) – Could invent Life Style (e.g. IT for Facebook), Radar …. Informatics – Physics Informatics ought to exist but doesn’t Plenty of Jobs and broader range of possibilities than computational science but similar issues – What type of degree (Certificate, track, “real” degree) – What type of program (department, interdisciplinary group supporting education and research program) 5

Computational Science Interdisciplinary field between computer science and applications with primary focus on simulation areas Very successful as a research area – XSEDE and Exascale systems enable Several academic programs but these have been less successful as – No consensus as to curricula and jobs (don’t appoint faculty in computational science; do appoint to DoE labs) – Field relatively small Started around

General Remarks I An immature (exciting) field: No agreement as to what is data analytics and what tools/computers needed – Databases or NOSQL? – Shared repositories or bring computing to data – What is repository architecture? Sources: Data from observation or simulation Different terms: Data analysis, Datamining, Data analytics., machine learning, Information visualization, Data Science Fields: Computer Science, Informatics, Library and Information Science, Statistics, Application Fields including Business Approaches: Big data (cell phone interactions) v. Little data (Ethnography, surveys, interviews) Includes: Security, Provenance, Metadata, Data Management, Curation 7

General Remarks II Tools: Regression analysis; biostatistics; neural nets; bayesian nets; support vector machines; classification; clustering; dimension reduction; artificial intelligence; semantic web Some driving forces: Patient records growing fast and Abstract graphs from net leads to community detection Some data in metric spaces; others very high dimension or none Large Hadron Collider analysis mainly histogramming – all can be done with MapReduce (larger use than MPI) Commercial: Google, Bing largest data analytics in world Time Series: Earthquakes, Tweets, Stock Market (Pattern Informatics) Image Processing from climate simulations to NASA to DoD to Radiology (Radar and Pathology Informatics – same library) Financial decision support; marketing; fraud detection; automatic preference detection (map users to books, films) 8

Data Analytics Futures? PETSc and ScaLAPACK and similar libraries very important in supporting parallel simulations Need equivalent Data Analytics libraries Include datamining (Clustering, SVM, HMM, Bayesian Nets …), image processing, information retrieval including hidden factor analysis (LDA), global inference, dimension reduction – Many libraries/toolkits (R, Matlab) and web sites (BLAST) but typically not aimed at scalable high performance algorithms Should support clouds and HPC; MPI and MapReduce – Iterative MapReduce an interesting runtime; Hadoop has many limitations Need a coordinated Academic Business Government Collaboration to build robust algorithms that scale well – Crosses Science, Business Network Science, Social Science Propose to build community to define & implement SPIDAL or Scalable Parallel Interoperable Data Analytics Library 9

Massive Open Online Courses (MOOC) MOOC’s are very “hot” these days with Udacity and Coursera as start-ups Over 100,000 participants but concept valid at smaller sizes Relevant to Data Science as this is a new field with few courses at most universities Technology to make MOOC’s – Drupal mooc (unclear) – Google Open Source Course Builder is lightweight LMS (learning management system) released September 12 rescuing us from Sakai At least one model is collection of short prerecorded segments (talking head over PowerPoint) 10

I400 X-Informatics (MOOC) General overview of “use of IT” (data analysis) in “all fields” starting with data deluge and pipeline Observation  Data  Information  Knowledge  Wisdom Go through many applications from life/medical science to “finding Higgs” and business informatics Describe cyberinfrastructure needed with visualization, security, provenance, portals, services and workflow Lab sessions built on virtualized infrastructure (appliances) Describe and illustrate key algorithms histograms, clustering, Support Vector Machines, Dimension Reduction, Hidden Markov Models and Image processing 11

12 SchoolProgram On- Campus OnlineDegrees Undergraduate George Mason University Computational and Data Sciences: the combination of applied math, real world CS skills, data acquisition and analysis, and scientific modeling YesNoB.S. Illinois Institute of Technology CS Specialization in Data Science CIS specialization in Data Science B.S. Oxford University Data and Systems Analysis?YesAdv. Diploma Masters Bentley University Marketing Analytics: knowledge and skills that marketing professionals need for a rapidly evolving, data-focused, global business environment. Yes?M.S. Carnegie Mellon MISM Business Intelligence and Data Analytics: an elite set of graduates cross-trained in business process analysis and skilled in predictive modeling, GIS mapping, analytical reporting, segmentation analysis, and data visualization. Yes M.S. 9 courses Carnegie Mellon Very Large Information Systems: train technologists to (a) develop the layers of technology involved in the next generation of massive IS deployments (b) analyze the data these systems generate DePaul University Predictive Analytics: analyze large datasets and develop modeling solutions for decision making, an understanding of the fundamental principles of marketing and CRM Yes?MS. Georgia Southern University Comp Sci with concentration in Data and Know. Systems: covers speech and vision recognition systems, expert systems, data storage systems, and IR systems, such as online search engines NoYesM.S. 30 cr Survey from Howard Rosenbaum SLIS IU

13 Illinois Institute of Technology CS specialization in Data Analytics: intended for learning how to discover patterns in large amounts of data in information systems and how to use these to draw conclusions. Yes?Masters 4 courses Louisiana State University businessanalytics.lsu.edu/ Business Analytics: designed to meet the growing demand for professionals with skills in specialized methods of predictive analytics 36 cr YesNoM.S. 36 cr Michigan State University Business Analytics: courses in business strategy, data mining, applied statistics, project management, marketing technologies, communications and ethics YesNoM.S. North Carolina State University: Institute for Advanced Analytics Analytics: designed to equip individuals to derive insights from a vast quantity and variety of data YesNoM.S.: 30 cr. Northwestern University Predictive Analytics: a comprehensive and applied curriculum exploring data science, IT and business of analytics Yes M.S. New York University Business Analytics: unlocks predictive potential of data analysis to improve financial performance, strategic management and operational efficiency YesNoM.S. 1 yr Stevens Institute of Technology Business Intel. & Analytics: offers the most advanced curriculum available for leveraging quant methods and evidence-based decision making for optimal business performance Yes M.S.: 36 cr. University of Cincinnati Business Analytics: combines operations research and applied stats, using applied math and computer applications, in a business environment YesNoM.S. University of San Francisco Analytics: provides students with skills necessary to develop techniques and processes for data-driven decision-making — the key to effective business strategies YesNoM.S.

14 Certificate Syracuse Data Science: for those with background or experience in science, stats, research, and/or IT interested in interdiscip work managing big data using IT tools Yes?Grad Cert. 5 courses Rice University Big Data Summer Institute: organized to address a growing demand for skills that will help individuals and corporations make sense of huge data sets YesNoCert. Stanford University Data Mining and Applications: introduces important new ideas in data mining and machine learning, explains them in a statistical framework, and describes their applications to business, science, and technology NoYesGrad Cert. University of California San Diego Data Mining: designed to provide individuals in business and scientific communities with the skills necessary to design, build, verify and test predictive data models NoYesGrad Cert. 6 courses University of Washington Data Science: Develop the computer science, mathematics and analytical skills in the context of practical application needed to enter the field of data science Yes Cert. Ph.D George Mason University Computational Sci and Informatics: role of computation in sci, math, and engineering, YesNoPh.D. IU SoIC InformaticsYesNoPh.D

15 FutureGrid Usages Computer Science Applications and understanding Science Clouds Technology Evaluation including XSEDE testing Education and Training IaaS  Hypervisor  Bare Metal  Operating System  Virtual Clusters, Networks PaaS  Cloud e.g. MapReduce  HPC e.g. PETSc, SAGA  Computer Science e.g. Languages, Sensor nets Research Computing aaS  Custom Images  Courses  Consulting  Portals  Archival Storage SaaS  System e.g. SQL, GlobusOnline  Applications e.g. Amber, Blast FutureGrid offers Computing Testbed as a Service FutureGrid Uses Testbed-aaS Tools  Provisioning  Image Management  IaaS Interoperability  IaaS tools  Expt management  Dynamic Network  Devops FutureGrid Uses Testbed-aaS Tools  Provisioning  Image Management  IaaS Interoperability  IaaS tools  Expt management  Dynamic Network  Devops