Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons

Similar presentations


Presentation on theme: "1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons"— Presentation transcript:

1 1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons (ericlyons@email.arizona.edu) Plant Sciences & iPlant Collaborative University of Arizona http://goo.gl/p4j3mhttp://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/ Will Computers Crash Genomics? Science Vol 331 Feb 2011

2 2 PowerPoint Does Rocket Science--and Better Techniques for Technical Reports -Essay by Edward Tufte

3 Topic Coverage  Frontiers in Massive Data Analysis (Text Book)  Focus of this course  What is “Cyberinfrastructure”  What is “Big Data” ?  Who is a “Data Scientist” ?  Why you should care about it ?  Project based learning  Discussion: Resources, conduct, etc. 3

4 4 + = Simple Formula

5 The Reality 5 ++ PERL Python Java Ruby Fortran C C# C++ R Matlab etc. PERL Python Java Ruby Fortran C C# C++ R Matlab etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. and lots of glue…..

6 + = Simple Formula

7 7

8 Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation 8 Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007

9 The Fourth Paradigm: Data-Intensive Scientific Discovery  Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.  The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. 9 http://research.microsoft.com/en-us/collaboration/fourthparadigm/

10 The Discovery Lifecycle 10 The Fourth Paradigm: Data-Intensive Scientific Discovery

11 Evolution of X-Info  The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics, Computational-Biology)  How to codify and represent our knowledge  The Generic Problems: 11 How to share it with others Query and Vis tools Building and executing models Integrating data and literature Documenting experiments Curation and long-term preservation Data ingest Managing a petabyte Common schema How to organize it How to reorganize it

12 12  Classic paradigm: You produce data, analyze, interpret (end to end)  Conventional paradigm: Consortium/centers produce data and you consume it  New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”

13 13 ∧ big

14 14

15 The “V” of big data  Volume  Velocity  Variety  (Value) 15 Attributed to Gartner Consulting

16 Big Data  Extracting meaningful results from vast amount of data (linked data)  Big data “information assets” demand cost- effective, innovative forms of information processing for enhanced insight and decision making.  “Big Data” Is only the Beginning of Extreme Information Management Not New  Big Data Technology, all Is Not New 16 Attributed to Gartner Consulting

17 The transition (Data->Big data) 17 http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/

18 The hype cycle (2014) 18

19

20 Dealing with the choices (Thought Works) 20 http://www.thoughtworks.com/radar https://assets.thoughtworks.com/assets/technology-radar-may-2015-en.pdf

21 Building Data Science Teams  Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.  Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.  Storytelling: the ability to use data to tell a story and to be able to communicate it effectively.  Cleverness: the ability to look at a problem in different, creative ways. 21 D.J. Patil: characterizing data scientist qualities

22 22 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

23 23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

24 24 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/

25 Big Data: Venn Diagram 25 http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

26 Rise of the “data janitors” 26

27 27 @jeremyjarvis @danariely

28 So what is the course about ??  Provide you with key concepts to work with “Big Data”  Get familiar with use of Cyberinfrastructure  Help you build your “tool chest” of simple and easy to use resources  Ability to work as a TEAM  Having fun with cutting edge computing infrastructure  Working with REAL data and infrastructure !  Take these pragmatic skills to your lab/job 28

29 29  Abstraction: C.T. is operating in terms of multiple layers of abstraction simultaneously C.T. is defining the relationships the between layers  Automation: C.T. is thinking in terms of mechanizing the abstraction layers and their relationships  Mechanization is possible due to precise and exacting notations and models  There is some “machine” below (human or computer, virtual or physical)  They give us the ability and audacity to scale.

30 Focus for this course  Project based learning class  Introduce fundamental concepts, tools and resources, best practices for effectively managing common tasks associated with analyzing large datasets  Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers 30

31 Focus for this course  Learn how to automate computation (tasks)  Learn how to utilize distributed compute and storage resources  Adopt a “large-ish” dataset (and do fun things with it)  Build your tool chest for working with “Big Data”  YOU will develop wiki based documentation of these best practices  YOU will learn how to effectively collaborate in interdisciplinary team settings  C O M P U T A T I O N A L T H I N K I N G A N D D O I N G ! 31

32 Topics we will have emphasis on  Scalable Data Handling: iRODS  Distributed Workflow Management: Makeflow  Visualization: Web based  Computing platforms: XSEDE, UA Campus, iPlant CI  Software Carpentry: Git, wiki etc.  Stitching all of this together 32

33 Class logistics  Grading based on:  Assignments (~5)  Group Projects  Class participation  Midterm (focused on key concepts and problem solving)  Graduate v. undergraduates  Demonstrated application towards novel discovery (hopefully using their data)  Mentorship to undergraduates 33

34 Where/What is the XYZ  Class documentation is on iPlant wiki (google “iPlant wiki” and go to ACIC 2015)  If you don’t find it (search again), then write your own  Where are the PPT from class ?  How do I form a group ?  How do I turn in my homework ?  What if my group hates me or I hate them ?  How is this class different then last year ? 34

35 Questions I have been asked so far  How difficult is this class  Do I need a laptop  Do I need to know LINUX  Do I need to be a sysadmin  Do I need to how to program X language  Will this help me with my X project  Will all my jobs run faster (and I can graduate sooner)  Can I take this for audit, sit in, sleep  Can I bring my X to class 35

36 Pragmatic Cyberinfastructure (CI)  Pragmatic*: Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations  Cyberinfrastructure**: Research environments that support advanced data analytics, data management, data visualization and other computing and information processing services distributed over the Internet. These capabilities are beyond the scope of a single institution to implement * Oxford dictionary (from my mac) ** Wikipedia Knowledgeable people and engaged community are essential components for successful CI Knowledgeable people and engaged community are essential components for successful CI

37 Cyberinfrastructure: (wikipedia)  Research environments that support:  advanced data acquisition  data storage  data management  data integration  data mining  data visualization  and other computing and information processing services distributed over the Internet beyond the scope of a single institution 37

38 38 Man is the best computer we can put aboard a spacecraft. And the only one that can be mass produced with unskilled labor. -- Wernher von Braun Greatest rocket scientist in history http://earthobservatory.nasa.gov/Features/vonBraun/

39 39


Download ppt "1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons"

Similar presentations


Ads by Google