NIST Big Data Public Working Group

Slides:



Advertisements
Similar presentations
IEEE BigData Overview October NIST Big Data Public Working Group NBD-PWG Based on September 30, 2013 Presentations at one day workshop at NIST Leaders.
Advertisements

Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Reference Architecture Subgroup NIST Big Data Public Working Group Reference Architecture Subgroup September 30, 2013 Co-chairs: Orit LevinMicrosoft James.
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
NIST Big Data Public Working Group Reference Architecture Subgroup September 30, 2013 Co-chairs: Orit LevinMicrosoft James KetnerAT&T Don KrapohlAugmented.
NIST BIG DATA WG Reference Architecture Subgroup Meeting Agenda Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence)
Big Data Use Cases and Requirements
Overview of Search Engines
Big Data Course Plans at Purdue Ananth Iyer. Big Data/Analytics Coursera course on Big Data by Bill Howe claims that Big Data involves issues of
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
TPAC Digital Library Talk Overview Presenter:Glenn Hyland Tasmanian Partnership for Advanced Computing & Australian Antarctic Division Outline: TPAC Overview.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
October 16-18, Research Data Set Archives Steven Worley Scientific Computing Division Data Support Section.
Big Data Power and Responsibility HEPTech, Budapest, 30 March 2015
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
PolarGrid Geoffrey Fox (PI) Indiana University Associate Dean for Graduate Studies and Research, School of Informatics and Computing, Indiana University.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox
material assembled from the web pages at
NIST BIG DATA WG Reference Architecture Subgroup Draft Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence) August.
1 26 October 2013 Observation and Reflection on Official Statistics against Big Data Challenge Yuan Pengfei Research Institute of Statistical Sciences.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
The Namibia Flood Dashboard Satellite Acquisition and Data Availability through the Namibia Flood Dashboard Matt Handy NASA Goddard Space Flight Center.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Developing software and hardware in parallel Vladimir Rubanov ISP RAS.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
U.S. Department of the Interior U.S. Geological Survey CDI Webinar Series 2013 Data Management at the National Climate Change and Wildlife Science Center.
WP18: High-Speed Data Recording Krzysztof Wrona, European XFEL 07 October 2011 CRISP.
NIST BIG DATA WG Reference Architecture Subgroup Agenda for the Subgroup Call Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented.
Actualog Social PIM Helps Companies to Manage and Share Product Information Using Secure, Scalable Ease of Microsoft Azure MICROSOFT AZURE ISV PROFILE:
NIST BIG DATA WG Reference Architecture Subgroup Intermediate Report Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence)
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
8/20/2013NIST Big Data WG / Roadmap Subgroup1 Architecture Storage Architecture Processing Architecture Resource Managers Architecture Infrastructure Architecture.
ISERVOGrid Architecture Working Group Brisbane Australia June Geoffrey Fox Community Grids Lab Indiana University
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Applications and Requirements for Scientific Workflow Introduction May NSF Geoffrey Fox Indiana University.
51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware
NIST BIG DATA WG Reference Architecture Subgroup Draft Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence) August.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
| nectar.org.au NECTAR TRAINING Module 2 Virtual Laboratories and eResearch Tools.
A Technical Overview Bill Branan DuraCloud Technical Lead.
IoT Meets Big Data Standardization Considerations
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
Project number: ENVRI and the Grid Wouter Los 20/02/20161.
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Geoffrey Fox Panel Talk: February
Status and Challenges: January 2017
Volume 3, Use Cases and General Requirements Document Scope
Campus Cyberinfrastructure
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Tutorial Overview February 2017
Big Data Use Cases and Requirements
Big Data Use Cases and Requirements
Digital Science Center
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
Data Management Components for a Research Data Archive
Big Data, Simulations and HPC Convergence
Presentation transcript:

NIST Big Data Public Working Group Use Case and Requirements Subgroup September 30, 2013 Co-chairs: Geoffrey Fox, Indiana University Joe Paiva, VA Tsegereda Beyene, Cisco

Overview Initial Process with use case collection Use Cases Continuing Process with requirement specifications Requirements Web Resources Next Steps

Working Group Process I Iteratively discuss goals of working group quickly focusing on use cases Based on prototyping of 8 use cases, agree August 11 2013 on use case template with minor later clarifications on fields 26 template fields cover topics of Reference Architecture 51 Use cases collected from end of July through September 7 Contact to organizations/people including NIST Big Data and Research Data Alliance soliciting use cases Especially strong NASA and DoE response In planning write-up, realized value of “English” summaries and pictures and added these as separate request to review our summary and add pictures

51 Use Cases in 9 areas Part I Government Operation (4): National Archives and Records Administration, Census Bureau Commercial (8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense (3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences (10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media (6): Driving Car, Geolocate images, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets

51 Use Cases in 9 areas Part II The Ecosystem for Research (4): Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics (5): Sky Surveys including comparing to simulation, Large Hadron Collider at CERN, Belle Accelerator II Japan Earth, Environmental and Polar Science (10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth, Environment Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy (1): Smart grid Data volumes vary from TB’s to 10’s PB’s Platforms from clouds to HPC

Use Case Template Note agreed in this form August 11 Some clarification on Veracity v. Data Quality added Request for picture added Early use cases did a detailed breakup of workflow into multiple stages identifying Use case/goals component, Infrastructure, Data Sources, Data Analytics, Security & Privacy

Example Use Case: Summary of Genomics Application: NIST/Genome in a Bottle Consortium integrates data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as reference materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run. Current Approach: The storage of ~40TB NFS at NIST is full; there are also PBs of genomics data at NIH/NCBI. Use Open-source sequencing bioinformatics software from academic groups (UNIX-based) on a 72 core cluster at NIST supplemented by larger systems at collaborators. Futures: DNA sequencers can generate ~300GB compressed data/day which volume has increased much faster than Moore’s Law. Future data could include other ‘omics’ measurements, which will be even larger than DNA sequencing. Clouds have been explored.

Part of Property Summary Table

Requirements Extraction Process Two-step process is used for requirement extraction: Extract specific requirements and map to reference architecture based on each application’s characteristics such as: data sources (data size, file formats, rate of grow, at rest or in motion, etc.) data lifecycle management (curation, conversion, quality check, pre-analytic processing, etc.) data transformation (data fusion/mashup, analytics), capability infrastructure (software tools, platform tools, hardware resources such as storage and networking), and data usage (processed results in text, table, visual, and other formats). all architecture components informed by Goals and use case description Security & Privacy has direct map Aggregate all specific requirements into high-level generalized requirements which are vendor-neutral and technology agnostic.

Requirements 437 Requirements were extracted from Use Cases by working group members This is tentative as role and especially categorization of requirements evolved as reference architecture evolved Each use case has its own set of specific requirements such as: Requires high volume data transfer to remote batch processing resource Software needs R, Matlab, Weka, Hadoop Support data update every 15 minutes Significant privacy issues requiring anonymization by aggregation Require rich robust provenance defining complex machine/human processing Real time and batch mode both needed

Working Group Process II Note sum i=050 i can be pretty large however small i is i.e. processing 51 use cases takes a long time even for modest tasks Future solution would be to automate process so that submitter does most of post-processing tasks Following material available for use cases Template with 26 fields “Readable” summary with fields Application, Current Approach, Futures and sometimes pictures Digest with fields Data Volume, Velocity, Variety (The big 3 V’s), Software, Data Analytics Set of Specific requirements extracted from each use case with (sometimes) explicit tie to use case Link between Specific requirements and General requirements

35 General Requirements These were organized in 7 categories suggested by components of reference architecture As specific requirements and reference architecture only completed in last days, these could change Example: Transformation Provider General Requirements (# Specific Requirements generalized) TPR-1: Needs to support diversified compute intensive, analytic processing and machine learning techniques (38) TPR-2: Needs to support batch and real time analytic processing (7) TPR-3: Needs to support processing large diversified data content and modeling (15) TPR-4: Needs to support processing data in motion (streaming, fetching new content, tracking, etc.) (6)

Size of Process The draft use case and requirements report is 264 pages How much web and how much publication? 35 General Requirements 437 Specific Requirements 8.6 per use case, 12.5 per general requirement Data Sources: 3 General 78 Specific Transformation: 4 General 60 Specific Capability: 6 General 133 Specific Data Consumer: 6 General 55 Specific Security & Privacy: 2 General 45 Specific Lifecycle: 9 General 43 Specific Other: 5 General 23 Specific

Significant Web Resources Index to all use cases http://bigdatawg.nist.gov/usecases.php This links to individual submissions and other processed/collected information List of specific requirements versus use case http://bigdatawg.nist.gov/uc_reqs_summary.php List of general requirements versus architecture component http://bigdatawg.nist.gov/uc_reqs_gen.php List of general requirements versus architecture component with record of use cases giving requirement http://bigdatawg.nist.gov/uc_reqs_gen_ref.php List of architecture component and specific requirements plus use case constraining this component http://bigdatawg.nist.gov/uc_reqs_gen_detail.php

Next Steps Review and clean up current draft material Request clarifications from some submitters Evaluate – especially with architecture group – requirements Specific and General See how particular use cases map into reference architecture If expect to collect more use cases, decide on more automated (less work by requirements group) process Set up web use case upload resource