Download presentation
Presentation is loading. Please wait.
Published byCora Turner Modified over 8 years ago
1
1 Applied CyberInfrastructure Concepts Fall 2015 1 Nirav Merchant (nirav@email.arizona.edu) Bio Computing & iPlant Collaborative Eric Lyons (ericlyons@email.arizona.edu) Plant Sciences & iPlant Collaborative University of Arizona http://goo.gl/p4j3mhttp://goo.gl/p4j3m or https://sites.google.com/site/appliedciconcepts/ Will Computers Crash Genomics? Science Vol 331 Feb 2011
2
2 PowerPoint Does Rocket Science--and Better Techniques for Technical Reports -Essay by Edward Tufte
3
Topic Coverage Frontiers in Massive Data Analysis (Text Book) Focus of this course What is “Cyberinfrastructure” What is “Big Data” ? Who is a “Data Scientist” ? Why you should care about it ? Project based learning Discussion: Resources, conduct, etc. 3
4
4 + = Simple Formula
5
The Reality 5 ++ PERL Python Java Ruby Fortran C C# C++ R Matlab etc. PERL Python Java Ruby Fortran C C# C++ R Matlab etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. Amazon Azure Rackspace Campus HPC XSEDE Etc. and lots of glue…..
6
+ = Simple Formula
7
7
8
Science Paradigms 1. Thousand years ago: science was empirical describing natural phenomena, observations 2. Last few hundred years: theoretical branch using models, generalizations 3. Last few decades: a computational branch simulating complex phenomena 4. Today: data exploration (eScience) unify theory, experiment, and simulation 8 Based on the transcript of a talk given by the late Jim Gray to the National Research Council – Computer Science and Telecommunication Board in Mountain View, CA, on January 11, 2007
9
The Fourth Paradigm: Data-Intensive Scientific Discovery Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud computing technologies. 9 http://research.microsoft.com/en-us/collaboration/fourthparadigm/
10
The Discovery Lifecycle 10 The Fourth Paradigm: Data-Intensive Scientific Discovery
11
Evolution of X-Info The evolution of X-Info and Comp-X for each discipline X e.g. (Bio-Informatics, Computational-Biology) How to codify and represent our knowledge The Generic Problems: 11 How to share it with others Query and Vis tools Building and executing models Integrating data and literature Documenting experiments Curation and long-term preservation Data ingest Managing a petabyte Common schema How to organize it How to reorganize it
12
12 Classic paradigm: You produce data, analyze, interpret (end to end) Conventional paradigm: Consortium/centers produce data and you consume it New Paradigm: Consortium/centers have produced data and creating “cyber infrastructure” to tackle the “grand challenge”
13
13 ∧ big
14
14
15
The “V” of big data Volume Velocity Variety (Value) 15 Attributed to Gartner Consulting
16
Big Data Extracting meaningful results from vast amount of data (linked data) Big data “information assets” demand cost- effective, innovative forms of information processing for enhanced insight and decision making. “Big Data” Is only the Beginning of Extreme Information Management Not New Big Data Technology, all Is Not New 16 Attributed to Gartner Consulting
17
The transition (Data->Big data) 17 http://hortonworks.com/blog/7-key-drivers-for-the-big-data-market/
18
The hype cycle (2014) 18
20
Dealing with the choices (Thought Works) 20 http://www.thoughtworks.com/radar https://assets.thoughtworks.com/assets/technology-radar-may-2015-en.pdf
21
Building Data Science Teams Technical expertise: the best data scientists typically have deep expertise in some scientific discipline. Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested. Storytelling: the ability to use data to tell a story and to be able to communicate it effectively. Cleverness: the ability to look at a problem in different, creative ways. 21 D.J. Patil: characterizing data scientist qualities
22
22 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
23
23 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
24
24 EMC http://www.r-bloggers.com/emc-survey-differentiates-bi-and-data-science/
25
Big Data: Venn Diagram 25 http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
26
Rise of the “data janitors” 26
27
27 @jeremyjarvis @danariely
28
So what is the course about ?? Provide you with key concepts to work with “Big Data” Get familiar with use of Cyberinfrastructure Help you build your “tool chest” of simple and easy to use resources Ability to work as a TEAM Having fun with cutting edge computing infrastructure Working with REAL data and infrastructure ! Take these pragmatic skills to your lab/job 28
29
29 Abstraction: C.T. is operating in terms of multiple layers of abstraction simultaneously C.T. is defining the relationships the between layers Automation: C.T. is thinking in terms of mechanizing the abstraction layers and their relationships Mechanization is possible due to precise and exacting notations and models There is some “machine” below (human or computer, virtual or physical) They give us the ability and audacity to scale.
30
Focus for this course Project based learning class Introduce fundamental concepts, tools and resources, best practices for effectively managing common tasks associated with analyzing large datasets Provide familiarity with cyberinfrastrucutre (CI) resources available at the University of Arizona campus, iPlant Collaborative, NSF XSEDE centers 30
31
Focus for this course Learn how to automate computation (tasks) Learn how to utilize distributed compute and storage resources Adopt a “large-ish” dataset (and do fun things with it) Build your tool chest for working with “Big Data” YOU will develop wiki based documentation of these best practices YOU will learn how to effectively collaborate in interdisciplinary team settings C O M P U T A T I O N A L T H I N K I N G A N D D O I N G ! 31
32
Topics we will have emphasis on Scalable Data Handling: iRODS Distributed Workflow Management: Makeflow Visualization: Web based Computing platforms: XSEDE, UA Campus, iPlant CI Software Carpentry: Git, wiki etc. Stitching all of this together 32
33
Class logistics Grading based on: Assignments (~5) Group Projects Class participation Midterm (focused on key concepts and problem solving) Graduate v. undergraduates Demonstrated application towards novel discovery (hopefully using their data) Mentorship to undergraduates 33
34
Where/What is the XYZ Class documentation is on iPlant wiki (google “iPlant wiki” and go to ACIC 2015) If you don’t find it (search again), then write your own Where are the PPT from class ? How do I form a group ? How do I turn in my homework ? What if my group hates me or I hate them ? How is this class different then last year ? 34
35
Questions I have been asked so far How difficult is this class Do I need a laptop Do I need to know LINUX Do I need to be a sysadmin Do I need to how to program X language Will this help me with my X project Will all my jobs run faster (and I can graduate sooner) Can I take this for audit, sit in, sleep Can I bring my X to class 35
36
Pragmatic Cyberinfastructure (CI) Pragmatic*: Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations Cyberinfrastructure**: Research environments that support advanced data analytics, data management, data visualization and other computing and information processing services distributed over the Internet. These capabilities are beyond the scope of a single institution to implement * Oxford dictionary (from my mac) ** Wikipedia Knowledgeable people and engaged community are essential components for successful CI Knowledgeable people and engaged community are essential components for successful CI
37
Cyberinfrastructure: (wikipedia) Research environments that support: advanced data acquisition data storage data management data integration data mining data visualization and other computing and information processing services distributed over the Internet beyond the scope of a single institution 37
38
38 Man is the best computer we can put aboard a spacecraft. And the only one that can be mass produced with unskilled labor. -- Wernher von Braun Greatest rocket scientist in history http://earthobservatory.nasa.gov/Features/vonBraun/
39
39
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.