Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments Dr. Brand Niemann Director and Senior Data Scientist Semantic.

Similar presentations


Presentation on theme: "A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments Dr. Brand Niemann Director and Senior Data Scientist Semantic."— Presentation transcript:

1 A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments Dr. Brand Niemann Director and Senior Data Scientist Semantic Community July 28, 2011 1

2 Webinar Description Establishing a foundation for data governance has never been more critical as federal agencies face more data center consolidation pressures. Many agencies are following the IT trend of breaking their problems into smaller pieces to make a complex problem more solvable. Your agency may be planning to send “some data” and “some applications” to the cloud, but do you have a methodology for optimizing your data once it’s spread across a hybrid environment? Join us to learn what you need to do to lay the groundwork for a good data governance program to support your agency’s consolidation goals: – Create views and models of your architecture – Maintain clear definitions of data, involved applications/systems and process flows – Leverage metadata for data governance processes – And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools 2

3 Speakers Moderator: Michael Smoyer, President, Digital Government Institute – The moderator will introduce speakers, coordinate logistics and Q&A with the "virtual" attendees. David Lyle, VP Product Strategy, Office CTO, Informatica – Co-author of “Lean Integration: An Integration Factory Approach to Business Agility” Brand Niemann, Director and Senior Data Scientist, Semantic Community – Author of over 50 Data Science Products in the Cloud for the US EPA and Data.gov 3

4 David Lyle He co-authored two books… his latest was just last year. The book “Lean Integration: An Integration Factory Approach to Business Agility”, published by Addison-Wesley. This book shows how “Lean” and “Agile” thinking can be applied to information management projects because they all follow a relatively small number of repeating patterns, and taking an assembly-line approach to dealing with these patterns delivers information to the business far faster, with less risk and cheaper costs than traditional approaches. He spoke at DGI’s EA Conference about “the acceleration in volumes of data as well as the acceleration in technological “options” (cloud, appliances, SOA, etc.) makes this problem (we call it the “integration hairball” in the book) even worse. With Lean Principles, (focus on the customer, eliminate waste in processes from the customer’s perspective, and use technology to manage this complexity more efficiently), we have a fighting chance, not to make the simple tasks mundane, but to make the seemingly impossible tasks manageable. The goal is to create a better IT world where the “customer/citizen” can self-serve themselves (when appropriate), yet give IT the visibility, oversight and governance of what the “customer/citizen” is up to. http://www.linkedin.com/in/davelyle 4

5 Brand Niemann Dr. Brand Niemann is the Director and Senior Data Scientist of the Semantic Community. He was the former Senior Enterprise Architect and Data Scientist at the U.S. Environmental Protection Agency and co-led the Federal CIO Council’s Semantic Interoperability Community of Practice (SICOP) with Mills Davis from 2003-2008. He is currently authoring a series of Editorials for Federal Computer Week on his work and recently made Spotfire's Twitter list for his cool visualizations on government data to produce more transparent, open and collaborative business analytics applications. – http://semanticommunity.info/A_Gov_2.0_spin_on_archiving_2.0_data http://semanticommunity.info/A_Gov_2.0_spin_on_archiving_2.0_data – http://spotfireblog.tibco.com/?p=5328 http://spotfireblog.tibco.com/?p=5328 He is working as a data journalist for AOL Government due to launch July 11 th. – http://semanticommunity.info/AOL_Government http://semanticommunity.info/AOL_Government He is also helping organize the 12 th SOA for eGov Conference, October 11 th. – http://semanticommunity.info/Federal_SOA http://semanticommunity.info/Federal_SOA 5

6 Preface Thank you for the opportunity to present. Primer (basic), Methodology (real-world example), and Cloud (tools I used). Real-world example: EPA Apps for the Environment Challenge – good place to start and learn since agency data governance already in place and build on that! Some metrics: About 50 Data Products, Over 100 Spotfire Visualizations, Nine Data Stories for Federal Computer Week this year and 15 for AOL Government: Google “AOL Government Brand Niemann” to see the three that have been published since July 13 th launch. 6

7 Overview Data Center Consolidation Initiative: Send agency data to Data.gov and to the Cloud and close data centers. – My solution was and is: Put My EPA Desktop in the Cloud in Support of the Open Government Directive and a Data.gov/Semantic Published a Paper April 19, 2010 Data Governance Program to Support Your Agency’s Consolidation Goals: – My solution was and is: Create views and models of your architecture Maintain clear definitions of data, involved applications/systems and process flows Leverage metadata for data governance processes And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools – Using the EPA Apps for the Environment Challenge 7

8 EPA Apps for the Environment Challenge Applications for the challenge must use EPA data and be accessible via the Web or a mobile device. EPA experts will select a winner and runner up in each of two categories: Best Overall App and Best Student App. In addition, the public will vote for a “People’s Choice” winner. Apps will be judged based on their usefulness, innovation, and ability to address one or more of EPA Administrator Lisa P. Jackson’s seven priorities for EPA’s future. Winners will receive recognition from EPA on the agency’s website and at an event in Washington, DC in the fall, where they can present their apps to senior EPA officials and other interested parties. 8 Source: http://www.epa.gov/appsfortheenvironment/http://www.epa.gov/appsfortheenvironment/

9 EPA Apps for the Environment Challenge EPA challenges you to find new ways to combine and deliver environmental data in a new app. In the Apps for the Environment challenge, you have free reign to make an app that uses EPA data, addresses one of Administrator Lisa Jackson’s Seven Priorities, and is useful to communities or individuals. EPA encourages you to use other environmental and health data too. The winners will be honored at a recognition event in Washington, D.C. this fall and the winning apps will be publicized on EPA’s website.Administrator Lisa Jackson’s Seven Priorities 9 Source: http://www.epa.gov/appsfortheenvironment/http://www.epa.gov/appsfortheenvironment/

10 Create views and models of your architecture 10 http://semanticommunity.info/AOL_Government/EPA_Announces_Apps_for_the_Environment_Challenge#Apps_for_the_Environment Unstructured to structured information view and model Supports Sitemap.org and Schema.org Protocol

11 Maintain clear definitions of data, involved applications/systems and process flows 11 http://semanticommunity.info/@api/deki/files/13015/=EPAApps.xlsx Data set inventory and data element dictionary Work flow for Phases I (Preparation) and II (Applications)

12 Leverage metadata for data governance processes 12 http://semanticommunity.info/EPA/EPA_Toxic_Release_Inventory_2009#Record_Layout The EPA TRI 2009 has 99 data elements defined in a 30 page PDF file that was exposed here with well-defined URLs (Getting to the Five Stars of Linked Open Data)

13 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools 13 PC Desktop Spotfire The Data sets and data dictionaries and links to data sources and metadata are integrated here

14 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools 14 Spotfire Web PlayerWeb Player Phase I identifies Data Quality Issues: The Guam Brownfields site is obviously mis-located (see outlier to extreme right in the Scatter Plot below). It should be a negative Longitude and have a larger value.

15 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools 15 Socrata at Data.gov

16 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools Smart Mapping: Automatic Creation of Information Models: – Spotfire 3.3 Information Services users can automatically generate 1-to-1 mappings of the existing tables and columns in their Data Sources. Just generate a Data Source in Spotfire, then right click it and select “Create Default Information Model…” This helps a lot when the work has already been done to nicely model and expose tables for business applications such as Spotfire, so the mapping step is more about transparency than transformation. For example, if you use Spotfire Application Data Services, you do the work in ADS to expose Spotfire-ready tables and columns, so a simple transparent mapping of those elements through Spotfire Information Services can now be accomplished in one click. Note that the automated creation will work through nested levels of data objects in the data source you supply. – The result is a folder structure that matches the catalogs, schemas etc. that were selected with a column element for each column and an information link for each table containing those column elements. Procedures will get a procedure element and an information link of their own if they return data. – See next slide. 16 http://semanticommunity.info/@api/deki/files/10975/=Whats_New_in_Spotfire_3.3.pdf

17 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools 17

18 And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools Semantic Community Workflow: – Information Architecture of Public Web Pages in Spreadsheets as Linked Open Data. – Public Reports (Web and PDF) in Wiki as Linked Open Data. – Desktop and Network Databases in Wiki and Spreadsheets in Linked Open Data Format. – Spreadsheets in Spotfire as Linked Open Data. – Spreadsheets in Semantic Insights Research Assistant for Semantic Search, Report Writing, and Ontology Development. 18

19 Questions and Answers Now and Later: – Brand Niemann – Director and Senior Data Scientist – Semantic Community – http://semanticommunity.info http://semanticommunity.info – bniemann@cox.net bniemann@cox.net 19

20 Supplemental Slides 7.1 Semantic Technology Training: Building Knowledge-Centric Systems – KM 2011 – SemTech 2011 7.2 W3C Government Linked Data Working Group – Clinical Quality Linked Data on Health.data.gov – Build Clinical Quality Linked Data on Health.data.gov in the Cloud – Hospital Compare Downloadable Database Example of "5 Star Government Data“ 7.3 Library of Congress Project Recollection and Digital Preservation Initiative 7.4 Elsevier/Tetherless World Health and Life Sciences Hackathon (27-28 June 2011) – Build TWC in the Cloud – Build NCI CLASS in the Cloud – Build the NYC Data Mine Health in the Cloud – Build SciVerse Apps in the Cloud (IN PROCESS) 7.5 Be Informed (IN PROCESS) 20

21 7.1 Semantic Technology Training: Building Knowledge-Centric Systems 21 http://semanticommunity.info/FOSE_Institute/Knowledge_Management

22 7.1 Semantic Technology Training: Building Knowledge-Centric Systems 22 http://semanticommunity.info/Semantic_Technology_Conferences

23 7.2 W3C Government Linked Data Working Group The mission of the Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies. This group will develop standards-track documents and maintain a community website in order to help governments at all levels (from small towns to nations) share their data as high quality ("five-star") linked data. The Working Group will construct and maintain an online directory of the government linked data community. "Cookbook" Advice Site The group will produce Best Practices for Publishing Linked Data. The group will develop Standard Vocabularies. First Face-to-Face Meeting, June 29-30 th, NSF, Arlington, VA. 23 http://www.w3.org/2011/gld/charter

24 7.2 Open Public Dataset Catalogs Faceted Browser 24 http://semanticommunity.info/Data.gov/An_Open_Data_Public_Dataset_Catalogs_Faceted_Browser

25 7.2 Linked Data Cookbook Linked Data is an evolving set of techniques for publishing and consuming data on the Web. Learn how Linked Data can turn the Web into a distributed database and how you can participate. In this session, Bernadette Hyland takes the mystery out of Linked Data by summarizing seven steps to prepare your data sets as Linked Data and announce it so others will use it. – Model without context: There is a Process: Identify, Model, Name, Describe, Convert, Publish, and Maintain. I Disagree! Participants will understand the actual steps to produce high quality, useful data sets that can be modeled, transformed, documented and available on the Linked Data cloud. We'll discuss a recent government agency that did just this in less than 12 weeks. Best practices for data publishing as well as the "social contract" one makes as a publisher will be discussed. – Better to make progress with something rather than do nothing because we cannot be comprehensive and complete. I Disagree! Bernadette oversees strategy for Talis‘ North American clients. She brings a strong background in commercial and government data management strategies, coupled with expertise in leading high-growth software organizations. Prior to joining Talis, Bernadette was CEO of several profitable Internet companies delivering scalable Web-based solutions for the enterprise, including Zepheira LLC and Tucana Technologies Inc., a pioneer in the emerging semantic technology community. 25 http://semtech2011.semanticweb.com/sessionPop.cfm?confid=62&proposalid=3822

26 7.2 Linked Data Cookbook 1. Leverage what exists. – Obtain data extracts (i.e., databases and/or spreadsheets) or create data in a way that can be replicated. 2. Model data without context to allow for reuse and easier merging of data sets. – With LD, application logic does not drive the data schema, concepts, etc. 3. Look for real world objects of interest (e.g., people, places, things, locations, etc.) and model them. – Use common sense to decide whether or not to make link. I Disagree! 4. Connect data from different sources and authoritative vocabularies (see list of popular vocabularies below). – Put aside immediate needs of any application. I Disagree! – Don’t think about how an application will use your data. I Disagree! 5. Write a script or process to convert the data set repeatedly. 6. Publish to the Web and announce it! (more details shortly). 7. Maintenance strategy (more details in the social contract at the end). 26 http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

27 7.2 Linked Data Cookbook Guidelines for merging: – URIs name the resources we are describing. – Two people using the same URI are describing the same thing. – The same URI in two datasets means the same thing. – Graphs from several different sources can be merged. – Resources with the same URI are considered identical. – No limitations on which graphs can be merged. For a government agency... a data policy is “a must”: – specify data quality and retention, treatment of data thru secondary sources, restrictions for use, frequency of updates, public participation, and applicability of this data policy. I Agree! 27 http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

28 7.2 Linked Data Cookbook 28 http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

29 7.2 Clinical Quality Linked Data on Health.data.gov 29 http://www.data.gov/communities/node/81/blogs/4920 See Next Slide

30 7.2 Clinical Quality Linked Data on Health.data.gov 30 http://health.data.gov/def/hospital/Hospital

31 7.2 Clinical Quality Linked Data on Health.data.gov 31 http://health.data.gov/doc/hospital/393303.csv

32 7.2 Clinical Quality Linked Data on Health.data.gov 32 http://www.slideshare.net/george.thomas.name/clinical-quality-linked-data-on-healthdatagov

33 7.2 Health data innovation 'at a crawl' The health care data community should step up its efforts to innovate to help improve the nation’s health outcomes and reduce costs, Health and Human Services Secretary Kathleen Sebelius said at the department’s second Health Data Initiative Forum on June 9. “Use tools and use data,” Sebelius said at the forum held at the National Institute of Medicine campus in Bethesda, Md. “Do it more, do it better and do it faster.” Sebelius said Americans experience a “triple loss” due to having the highest public health care costs, highest private health care costs, and only mediocre health outcomes. The goal of the conference was to present 45 winning health care IT applications developed with HHS’ newly-available data sets within the last several months. HHS CTO Todd Park called the event a “Health Data Palooza” that would showcase innovation in health IT. – PerlDiverInc and Semantic Community were one of the finalists! 33 http://fcw.com/articles/2011/06/09/nation-needs-more-health-data-innovation-sebelius-says-at-forum.aspx

34 PearlDiver Data Engine & Semantic Community Data Visualization Benjamin YoungBrand Niemann PearlDiver Technologies Inc.Semantic Community Health Data Initiative Forum Submission Medicare Zombie Hunter

35 7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud 35 http://semanticommunity.info/Semantic_Technology_Conferences/Clinical_Quality_Linked_Data_on_Health.data.gov

36 7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud 36 http://semanticommunity.info/Semantic_Technology_Conferences/Clinical_Quality_Linked_Data_on_Health.data.gov/Hospital_Compare_Downloadable_Database_Metadata

37 7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud 37 PC Desktop Spotfire

38 7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud 38 Spotfire Web PlayerWeb Player

39 7.3 Library of Congress Project Recollection and Digital Preservation Initiative 39 The Libraries of Congress & MIT are developing a Semantic Web Browser (Exhibit and now Exhibit 3) to do essentially what Spotfire already does!

40 7.3 Library of Congress Project Recollection and Digital Preservation Initiative 40 PC Desktop Spotfire

41 7.3 Library of Congress Project Recollection and Digital Preservation Initiative 41 http://semanticommunity.info/Semantic_Technology_Conferences/Library_of_Congress

42 7.3 Library of Congress Project Recollection and Digital Preservation Initiative 42 Spotfire Web PlayerWeb Player Interoperability Interface!

43 7.4 Elsevier/Tetherless World Health and Life Sciences Hackathon (27-28 June 2011) 43 http://semanticommunity.info/Build_TWC_in_the_Cloud

44 7.4 NYC Data Web 44 http://knoodl.com/ui/groups/NYC_Homepage

45 7.4 NYC Data Web 45 http://semanticommunity.info/Semantic_Technology_Conferences/NY_Data_Mine/Revelytix Quote: Ontology architecture is a new aspect of system architecture and development, to our knowledge it has not been employed anywhere else in DOD.

46 7.4 NYC Data Web 46 http://semanticommunity.info/Semantic_Technology_Conferences/NY_Data_Mine/Revelytix#Dashboard

47 7.4 NYC Data Web 47 PC Desktop Spotfire

48 7.4 NYC Data Web 48 PC Desktop Spotfire

49 7.5 Be Informed A recent paper describes the formalism and rationale that Be Informed applies to business process modeling. It explains how and why goal- oriented modeling differs from more conventional business process modeling which is procedural. In the near-term, there is applicability for many government agencies, especially for those exploring semantic approaches. For example, Dennis Wisnosky advocates semantic web (RDF & OWL) standards for modeling data integration, and a dialect of BPMN for modeling processes. The metaphor for processes is an electronic circuit specification that uses standard building blocks. "We all know what those primitives mean." Previous, costly attempts at business process modeling were failures in part because there was no standard at the primitive level. However, as this paper makes clear, just having unambiguous primitives is only part of what is needed to specify and manage complex and dynamic business processes. Modeling flow in swim lanes is less agile than modeling goals, activities, and pre and post conditions. 49 Source: Mills Davis, Project10x, July 5, 2011.

50 7.5 Be Informed 50 Source: Specifying Flexible Business Processes using Pre and Post Conditions, Jeroen van Grondelle and Menno Gulpers, Be Informed BV, Apeldoorn, The Netherlands, 13 pp. Fig. 1. Summary of the Meta Model for Capturing Business Processes


Download ppt "A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments Dr. Brand Niemann Director and Senior Data Scientist Semantic."

Similar presentations


Ads by Google