Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF.

Slides:



Advertisements
Similar presentations
Data Science for Natural Medicines: Dead Doctors Don't Lie Radio
Advertisements

Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Data Science for Tackling the Challenges of Big Data
Data Science for NSF Polar Cyberinfrastructure & MIT Big Data Course Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community.
Build VIVO in the Cloud NIH Workshop on Value Added Services for VIVO Brand Niemann Semantic Community March 25-26,
Data Science for Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Vivien Bonazzi Ph.D. Program Director: Computational Biology (NHGRI) Co Chair Software Methods & Systems (BD2K) Biomedical Big Data Initiative (BD2K)
OMB Data Visualization Tool Requirements Analysis: Microsoft Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
NLM-Semantic Medline Data Science Data Publication Commons Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
Big Data and Social Media & Web Analytics Innovation Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
NIST Scientific Data for Data Science United Nations Open Data / Open Government Conference, April 26-28, Abu Dhabi
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Data Science for RDA Climate Change Data Challenge and Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
Linked Data Visualizations for Eurostat Linked Data Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
OMB Data Visualization Tool Requirements Analysis: SAP Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Transforming Data-Driven Publications and Decision Support Joan L. Aron, Ph.D. Consultant Federal Big Data Working Group COM.BigData 2014.
Data Science for USGS Minerals Big Data Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data.
Imagine Everything is Before You: Past, Present, and Future Paper and Demonstration for the 2014 Family History Technology BYU Dr. Brand Niemann.
Information Sharing Begins With Me Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
GIS Data Science for Collaboration Across Communities: GIScience 2.0 and Beyond Dr. Brand Niemann Director and Senior Data Scientist Semantic Community.
Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.
Data Science Publication for NSF Polar Cyberinfrastructure Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Using Data Science as Evidence in Public Policy With Big Data and Elections Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist.
Federal Big Data Working Group Meetup: The Yosemite Project: A Roadmap for Healthcare Information Interoperability and The New Book: Building Ontologies.
Farm Data Dashboards: USDA and Microsoft Innovation Challenge Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
Data Science for Agency Initiatives 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Data Science for RDA Climate Change Data Challenge and Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Data Science for VIVO Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Data Science for International Data Week 2016: Concept Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science.
Director and Senior Data Scientist/Data Journalist
Data Science for DataBay DataBay "Reclaim the Bay" Innovation Challenge: August 1-3, 2014, Smithsonian Environmental Research Center, 647 Contees Wharf.
Data Science ESIP Publication Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Data Science for USGS Minerals Big Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.
Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
The 2012 EuroStat Regional Yearbook for Semantic Interoperability Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic.
Why Doesn't EPA Have a Self- Contained Statistical Unit?: A Tribute to Doug Engelbart Dr. Brand Niemann Director and Senior Data Scientist Semantic Community.
Data Science for USDA Big Data
Data Science for HealthData.gov Developers & Family Caregivers Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community.
Data Science for EPA Big Data Analytics: Oregon Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Open DATA METI: All Content As Big Data Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
Data Science for Migration Data Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Health Datapalooza IV: Child and Adolescent Health Data App Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
SmartGrid and Spotfire Cloud Computing - Similarities in Innovation Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic.
Research on US Federal Government Handling of Data Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community
-- Don Preuss NCBI/NLM/NIH
Build the NITRD Dashboard in the Cloud Brand Niemann Semantic Community March 14,
Data Science for the NOAA Chief Data Officer Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Data Science for Semantics Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science for Semantics.
Department of Commerce App Challenge: Big Data Dashboards Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community.
Data Science for DoI BSEE Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science for DoI BSEE.
Data Science for Joint Doctrine Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science for Joint.
Data Science for FDA RFI Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Data Science for Conservation International's Big Ecosystem Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community.
Government Technology & Innovation Incubator for Big Data Analytics Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community.
Data Science for NIST Big Data Framework Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Defense Strategies Institute Professional Educational Forum Harnessing the Power of Big Data for The Intelligence Community November 17-18, 2015 Mary M.
Climate Change & Genomic Data - Data Science Meetup of Meetups Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community.
Data Science for EarthCube 2015 Key Documents Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
MPS Workshop 1: Gauging the Impact of Requirements for Public Access to Data November 19, 2015 Jennie Larkin, Ph.D. Office of the Associate Director for.
Data Science for Global Ebola Response Data Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.
National Data Science Organizers Lightning Talks From Around the Country Dr. Brand Niemann Founder and Co-Organizer Federal Big Data Working Group Meetup.
Data Science and Semantic Insights for DoD Joint Doctrine Meetup Dr. Brand Niemann Founder and Co-Organizer Federal Big Data Working Group Meetup Director.
Data Science for the National Big Data R&D Initiative Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community
Enhancements to Galaxy for delivering on NIH Commons
Solutions to Clinical Data Visualization and Analysis
Data Science for RDA Climate Change Data Challenge and Meetup
Spotfire 5 Users Guide Dashboard
TOPMed Analysis Workshop Genetic Analysis Center Biostatistics Department University of Washington TOPMed Data Coordinating Center August 7-9, 2017 Introduction.
Presentation transcript:

Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF Data Science Workshop 2015 August 24,

2

3 Semantic Community Data Science NSF Data Science Workshop 2015 Workshop Knowledge Base and Data Science Data Publication

Workshop Knowledge Base Content Overview Agenda Mentors, Observers, Ethnographers & Organizers Posters Team Assignments Team Work Products GERT PhD Program in Big Data and Data Science at the UW Results: White Papers (Only 3 and Review Criteria Met?) Interviews (?) Audit (See Next Slide for Details): Mine Science Questions Publish 4

Data Mining - Science - Questions - Publication Process Data Mining Process: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Data Science Process: Data Preparation Data Ecosystem Data Story Data Science Questions: How was the data collected? Where is the data stored? What are the data results? and Why should we believe the data results? Data Science Data Publication: Knowledge Base Spreadsheet Index Web & PDF Tables to Spreadsheet Data Browser Dynamically Linked Adjacent Visualizations 5

Workshop White Paper Conclusions Genomic Data Science: Problems regarding the speed, cost and hardware that are required for analyzing and sharing the big genomic data are among the major challenges in Genomic Data Science. On the other hand, the area of genomics provides Data Science with not only great challenges but also great promises. Big Data: From correlation to causation: For data science and big data analytics to become more useful towards examining causal relations nowadays, I argue that we need to draw on the substantial knowledge base created in the economics and social science fields over the years in order to infer interesting causal effects as simply analyzing large amounts of data does not necessarily help us make better data-driven decisions. Shape mapping in genome-wide association studies: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biological traits. Currently, the major focuses of GWASs are the associations between single-nucleotide polymorphisms (SNPs) and traits such as human diseases. 6

NIH Data Commons FAIR Principles: Findable Accessible Interoperable Reusable Cloud: Data Software Results Federal Science Policy: OSTP Public Access to Scientific Data Memo (February 2013) New Program: Big-Data-to- Knowledge (2013) New Position: Associate Director of Data Science (2014) Digital Enterprise (2015): Data Commons Metadata Open APIs Digital Objects Containers Federal Big Data Working Group Meetup, August 17, 2015: A NIH – Semantic Medline Data Science Data Publication Commons 7

8 The Commons Framework is: Discoverability: Search and Find Open APIs: Data and Tools Unique IDs: for Digital Research Objects Containers: For Packaging Applications Computing Platform: Cloud & HPC

OSTP/NSF Data Science Meetup of Meetups Week of November 2 nd : NSF Data Science/Big Data Principal Investigators (About 300) NSF Data Hubs (4) Organizers of Largest Data Science/Big Data Meetups (About 65) Pipeline for Return on Investment: PIs put their data, tools and research results in the Data Hubs Data Hubs provide those data, tools, and research results to the world, but especially to the Data Science/Big Data Meetups Data Science/Big Data Meetups collaborate with PIs and Data Hubs to increase usage and feedback 9

We Already Do This! Semantic Community: Provides a Community Sandbox that is like a GitHub, Data Hub, Data Commons, etc. Metadata (MindTouch) Open APIs (MIndTouch) Digital Objects (MindTouch) Containers (Spotfire) Organize the Federal Big Data Working Group Meetup Support Agencies and Programs in Crowdsourcing Their Data Sets Mentor Data Scientists (Tutorials and MOOCs) and Entrepreneurs (Eastern Foundry) Federal Big Data Working Group Meetup: Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; Working Group: Data Science Teams composed of Federal Government and Non- Federal Government experts producing big data products; and Meetup: The world's largest network of local groups to revitalize local community and help people around the world self- organize like MOOCs (Massive Open On-line Classes) now embraced by the White House. 10

The Journey to Data and Meetup 1 Since the three white papers from the NSF Data Science Workshop did not describe any actual work with data sets, I decided to use their content to find a data set and the first reference in the first white paper was GenBank, and when I shared that, I got a response from a member of the OSTP/NSF Data Science Meetup of Meetups planning team that works directly with it:GenBank I do a lot of genomics data stuff (I work for NCBI, which is the largest genomic database in the world [we make Genbank, which is the first citation in the genomics data challenge summary]). I think I might be able to help focus the genomics data challenge a bit more. 11

The Journey to Data and Meetup 2 I responded: I looked at: And found: And wondered where it would be good to start? This is like a Data Commons that Vivien Bonazzi talked about at our last meetup: A NIH – Semantic Medline Data Science Data Publication Commons (Click See All). NIH – Semantic Medline Data Science Data Publication Commons I could build an searchable index in a spreadsheet and Spotfire with your guidance like I have done for other NIH data sets. 12

The Journey to Data and Meetup 3 He responded: We also have bigger databases (in terms of data size) like SRA, dbGaP and GEO. Here’s a third party attempt at normalizing the SRA metadata: We also provide a run selector tool for visualization in SRA, if you go to the send to menu. We’ve also done some hackathoning with such data To come full circle, the RNA_mapping repo here: May be the preamble for a collaboration with the NIH Data Science Data CommonsNIH Data Science Data Commons 13

14 My Data Mining Notes in Notepad That Helped Structure What Follows Next

15 The majority of NCBI data are available for downloading, either directly from the NCBI FTP site or by using software tools to download custom datasets.

NCBI Download: FTP and Aspera 16

17

18 For downloading purposes, please keep in mind that the uncompressed GenBank flatfiles are approximately 735GB (sequence files only); the ASN.1 data are approximately 600GB.

19

20

21 This is very complicated big data that requires subject matter expertise and big data science expertise and tools. Is there another way? Yes, and I found it by chance!

22 Somehow I found this page! Which has links to Web Site, Table 1, and Supplementary Data We believe that our database will contribute to the future establishment of personalized medicine and increase our understanding of genetic factors underlying diseases. So can SemMed generate such a catalog!

23

,758 Records and 19 MB!

25 all_gwas_snp.csv

26

VaDE Supplementary Data 27 Next Slide

28

29 These results reveal a novel function of Maitake beta-glucan that enhances the granulopoiesis and mobilization of granulocytes and their progenitors by stimulating G- CSF production. This finding presents opportunities to develop new therapeutic strategies against the immunosuppression caused by chemotherapies in cancer patients. We had beta-Glucan results from Data Science and Semantic Medline at our August 17 th Meetups!

30 NSFNIHGeonomic.xlsx

31 Web Player

Conclusions and Recommendations There are at least three phases and products in geonomic data science: Raw data to a Commons like GenBank with data, software, and results. Distilled associations for personalized medicine like VarySysDB Disease Edition (VaDE). Data Science Data Publications for students, researchers, medical doctors, data scientists, and the public. Tasks in process: Build a Knowledge Base with searchable index in a wiki, spreadsheet and Spotfire like I have done for other NIH data sets that follows the Commons Framework: Discoverability: Search and Find; Open APIs: Data and Tools; Unique IDs: for Digital Research Objects; Containers: For Packaging Applications; and Computing Platform: Cloud & HPC. See Slide 10 for Mapping Our Commons to the NIH Commons. Build data science data publications of the multiple content types and formats. Submit this as a White Paper for the NSF Data Science Workshop and Federal Big Data Working Group Meetup. 32