1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons

Slides:



Advertisements
Similar presentations
Workshop Organizing Committee: Rosalind R. JamesCarolyn Lawrence Sharon PapiernikCurt Van Tassell.
Advertisements

Supporting Research on Campus - Using Cyberinfrastructure (CI) Public research use of ICT has rapidly increased in the past decade, requiring high performance.
The Data Lifecycle and the Curation of Laboratory Experimental Data Tony Hey Corporate VP for Technical Computing Microsoft Corporation.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Client+Cloud The Future of Research Dr. Daniel A. Reed Corporate Vice President Extreme Computing Group & Technology Strategy and Policy.
1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
Symposium on Digital Curation in the Era of Big Data: Career Opportunities and Educational Requirements Workforce Demand and Career Opportunities From.
What should learners understand? Defining Understanding Goals for Disciplined Inquiry.
Introduction to Software Engineering CS-300 Fall 2005 Supreeth Venkataraman.
Topics in Computational Biology (COSI 230a) Pengyu Hong 09/02/2005.
July 16, Introduction to CS II Data Structures Hongwei Xi Comp. Sci. Dept. Boston University.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
Alma Swan Key Perspectives Ltd Truro, UK.  Study commissioned by JISC  Following up on two recommendations in the ‘Lyon report’  Focus on ‘data scientists’
Moving forward with Scalable Game Design. The landscape of computer science courses…  Try your vegetables (sneak it in to an existing course)  Required.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
What happens after graduation? Steve Cover & Doug Mulkey (or, how do I get a job?)
IPlant Collaborative Powering a New Plant Biology iPlant Collaborative Powering a New Plant Biology.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Ontology Development in the Sciences Some Fundamental Considerations Ontolytics LLC Topics:  Possible uses of ontologies  Ontologies vs. terminologies.
EEA 2012 – Middle School STEM Day 1, PM Content Session.
Data to Discovery The iPlant Collaborative Community Cyberinfrastructure for Life Science Nirav Merchant iPlant / University.
Enabling Cloud and Grid Powered Image Phenotyping Nirav Merchant iPlant Collaborative
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Sept. 5, 2012 Kevin T. Gallagher and Linda C. Gundersen September 5, 2012 CDI Science.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Objectives.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Fourth Paradigm Science-based on Data-intensive Computing.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Introduction to Science Informatics Lecture 1. What Is Science? a dependence on external verification; an expectation of reproducible results; a focus.
CyberInfrastructure workshop CSG May Ann Arbor, Michigan.
Russ Hobby Program Manager Internet2 Cyberinfrastructure Architect UC Davis.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
The iPlant Collaborative Using iPlant for sharing, managing, and analyzing ecological data Ramona Walls Presented at ESA 2014 – Ignite session August 12,
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Design of a Typical Course s c h o o l s o f e n g I n e e r I n g S. D. Rajan Professor of Civil Engineering Professor of Aerospace and Mechanical Engineering.
Overview of Atmosphere
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
Introduction to STEM Integrating Science, Technology, Engineering, and Math.
Big Data: Industry Needs Data Scientists Data Analysts Data Infrastructure Engineers Developers (all kinds) 2-3:30, August 10, 2015 Room 261 RSC.
Implementing a National Data Infrastructure: Opportunities for the BIO Community Peter McCartney Program Director Division of Biological Infrastructure.
Welcome and Introduction to the Course MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Impact of the New ASA Undergraduate Curriculum Guidelines on the Hiring of Future Undergraduates Robert Vierkant Mayo Clinic, Rochester, MN.
Unleash your inner (data) scientist : The ability and audacity to scale your science with extensible cyberinfrastructure Nirav Merchant The University.
Bringing your favorite analysis applications to iPlant using Docker containers Nirav Merchant
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Big Data Yuan Xue CS 292 Special topics on.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
Transforming Science Through Data-driven Discovery Genomics in Education University of Delaware – February 2016 Jason Williams, Education, Outreach, Training.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Teaching How to Scale Science (and People) Using Cloud Resources Nirav Merchant The University of Arizona
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Applied Cyber Infrastructure Concepts Fall 2017
Campus Cyberinfrastructure
Applied Cyber Infrastructure Concepts Fall 2017
Cyberinfrastructure for the Life Sciences
Introduction to CS II Data Structures
Enabling ML Based Research
Big DATA.
Presentation transcript:

1 Applied CyberInfrastructure Concepts Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons Plant Sciences & iPlant Collaborative University of Arizona or Will Computers Crash Genomics? Science Vol 331 Feb 2011

2 PowerPoint Does Rocket Science--and Better Techniques for Technical Reports -Essay by Edward Tufte

Topic Coverage  Frontiers in Massive Data Analysis (Text Book)  Focus of this course  What is “Cyberinfrastructure”  What is “Big Data” ?  Who is a “Data Scientist” ?  Why you should care about it ?  Project based learning  Discussion: Resources, conduct, etc. 3

4 + = Simple Formula

The Reality 5 ++ PERL Python Java Ruby Fortran C C# C++ R Matlab etc. PERL Python Java Ruby Fortran C C# C++ R Matlab etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. and lots of glue…..

+ = Simple Formula

7

Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation 8 Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007

The Fourth Paradigm: Data-Intensive Scientific Discovery  Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.  The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. 9

The Discovery Lifecycle 10 The Fourth Paradigm: Data-Intensive Scientific Discovery

Evolution of X-Info  The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics, Computational-Biology)  How to codify and represent our knowledge  The Generic Problems: 11 How to share it with others Query and Vis tools Building and executing models Integrating data and literature Documenting experiments Curation and long-term preservation Data ingest Managing a petabyte Common schema How to organize it How to reorganize it

12  Classic paradigm: You produce data, analyze, interpret (end to end)  Conventional paradigm: Consortium/centers produce data and you consume it  New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”

13 ∧ big

14

The “V” of big data  Volume  Velocity  Variety  (Value) 15 Attributed to Gartner Consulting

Big Data  Extracting meaningful results from vast amount of data (linked data)  Big data “information assets” demand cost- effective, innovative forms of information processing for enhanced insight and decision making.  “Big Data” Is only the Beginning of Extreme Information Management Not New  Big Data Technology, all Is Not New 16 Attributed to Gartner Consulting

The transition (Data->Big data) 17

The hype cycle (2014) 18

Dealing with the choices (Thought Works)

Building Data Science Teams  Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.  Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.  Storytelling: the ability to use data to tell a story and to be able to communicate it effectively.  Cleverness: the ability to look at a problem in different, creative ways. 21 D.J. Patil: characterizing data scientist qualities

22 EMC

23 EMC

24 EMC

Big Data: Venn Diagram 25

Rise of the “data janitors” 26

@danariely

So what is the course about ??  Provide you with key concepts to work with “Big Data”  Get familiar with use of Cyberinfrastructure  Help you build your “tool chest” of simple and easy to use resources  Ability to work as a TEAM  Having fun with cutting edge computing infrastructure  Working with REAL data and infrastructure !  Take these pragmatic skills to your lab/job 28

29  Abstraction: C.T. is operating in terms of multiple layers of abstraction simultaneously C.T. is defining the relationships the between layers  Automation: C.T. is thinking in terms of mechanizing the abstraction layers and their relationships  Mechanization is possible due to precise and exacting notations and models  There is some “machine” below (human or computer, virtual or physical)  They give us the ability and audacity to scale.

Focus for this course  Project based learning class  Introduce fundamental concepts, tools and resources, best practices for effectively managing common tasks associated with analyzing large datasets  Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers 30

Focus for this course  Learn how to automate computation (tasks)  Learn how to utilize distributed compute and storage resources  Adopt a “large-ish” dataset (and do fun things with it)  Build your tool chest for working with “Big Data”  YOU will develop wiki based documentation of these best practices  YOU will learn how to effectively collaborate in interdisciplinary team settings  C O M P U T A T I O N A L T H I N K I N G A N D D O I N G ! 31

Topics we will have emphasis on  Scalable Data Handling: iRODS  Distributed Workflow Management: Makeflow  Visualization: Web based  Computing platforms: XSEDE, UA Campus, iPlant CI  Software Carpentry: Git, wiki etc.  Stitching all of this together 32

Class logistics  Grading based on:  Assignments (~5)  Group Projects  Class participation  Midterm (focused on key concepts and problem solving)  Graduate v. undergraduates  Demonstrated application towards novel discovery (hopefully using their data)  Mentorship to undergraduates 33

Where/What is the XYZ  Class documentation is on iPlant wiki (google “iPlant wiki” and go to ACIC 2015)  If you don’t find it (search again), then write your own  Where are the PPT from class ?  How do I form a group ?  How do I turn in my homework ?  What if my group hates me or I hate them ?  How is this class different then last year ? 34

Questions I have been asked so far  How difficult is this class  Do I need a laptop  Do I need to know LINUX  Do I need to be a sysadmin  Do I need to how to program X language  Will this help me with my X project  Will all my jobs run faster (and I can graduate sooner)  Can I take this for audit, sit in, sleep  Can I bring my X to class 35

Pragmatic Cyberinfastructure (CI)  Pragmatic*: Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations  Cyberinfrastructure**: Research environments that support advanced data analytics, data management, data visualization and other computing and information processing services distributed over the Internet. These capabilities are beyond the scope of a single institution to implement * Oxford dictionary (from my mac) ** Wikipedia Knowledgeable people and engaged community are essential components for successful CI Knowledgeable people and engaged community are essential components for successful CI

Cyberinfrastructure: (wikipedia)  Research environments that support:  advanced data acquisition  data storage  data management  data integration  data mining  data visualization  and other computing and information processing services distributed over the Internet beyond the scope of a single institution 37

38 Man is the best computer we can put aboard a spacecraft. And the only one that can be mass produced with unskilled labor. -- Wernher von Braun Greatest rocket scientist in history

39