Semantic Search for NSF Decision Making Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community AOL Government Blogger April 4,
Overview Background NITRD Dashboards Data.gov Developer Community Research.gov Dashboard Semantic MedLine Some Next Steps 2
Background My role at EPA, as their Senior Enterprise Architect and Data Scientist, and as lead for several Federal CIO Council activities, and since leaving government to become Director and Senior enterprise Architect-Data Scientist of Semantic Community, has been to implement high-level direction as follows: 3
Background Teri Takai (DoD CIO) - Harvard Leadership for a Networked World, Lead Practitioner. I am an Invited Practitioner that Mentors Students under her direction. – Social Business Intelligence from Open Government Data Social Business Intelligence from Open Government Data Letitia Long, Director of the National Geospatial Intelligence Agency. I am the lead for the pilot demonstration for the NCOIC-NGA CRADA at the upcoming 13th SOA for eGov Conference, April 3 rd – A Quint – Cross Information Sharing and Integration for the Intelligence Community A Quint – Cross Information Sharing and Integration for the Intelligence Community – Demonstration at the 13 th SOA for E-Government Conference, April 3, 2012, at MITRE13 th SOA for E-Government Conference, April 3, 2012, at MITRE Donna Roy, Executive Director of NIEM. She requested that I provide suggestions and demonstrations for evolving NIEM which I have done twice. – A Plan for Scaling NIEM to Big Data A Plan for Scaling NIEM to Big Data – Build The NIEM Information Exchange Clearinghouse In The Cloud Build The NIEM Information Exchange Clearinghouse In The Cloud Gus Hunt, CIA CTO. He challenged me to show how to make the CIA World Fact Book more semantic and to work with Digital Reasoning. – CIA World Fact Book CIA World Fact Book – Digital Reasoning Digital Reasoning 4
Background Sonny Bhagowahlia, David McClure, and Jeanne Holm (Data.gov Program Executive, GSA Associate Administrator, and Data.gov Evangelist, respectively) challenged me to do data science for Data.gov. – Data.gov Data.gov – Data.gov Developers Community Space Launched Data.gov Developers Community Space Launched Wyatt Kash, Editor in Chief for AOL Government, challenged me to build Shared Services like Federal CIO Steven VanRoekel is asking for. – Federal IT Dashboard in Motion and In Memory Federal IT Dashboard in Motion and In Memory Dennis Wisnosky, DoD CTO, and Walt Okon, DoD Senior Architect Engineer challenged me to Build DoD in the Cloud and Federate It with Other DoD and non- DoD Architectures (e.g. TOGAF) – Build DoD in the Cloud and Build TOGAF in the Cloud Build DoD in the CloudBuild TOGAF in the Cloud – Enterprise Information Web for Semantic Interoperability at DoD Enterprise Information Web for Semantic Interoperability at DoD Dr. George Strawn, Director of the NSF NITRD and White House OSTP Staff to the CTO (Aneesh Chopra and Todd Park), challenged me to do data science dashboards. – A NITRD Dashboard (March and April 2011) A NITRD Dashboard – SIRA for Semantic Search (August 10, 2011) SIRA for Semantic Search – A Research.gov Dashboard (March 2012) A Research.gov Dashboard – Semantic MedLine (In process) Semantic MedLine 5
NITRD Dashboards 6 Note: Also see Build the NITRD Dashboard in the Cloud and Build the R&D Dashboard in the Cloud.
Data.gov Developer Community Play the role of a data scientist from an agency, use a platform that supports the things below, and build an app that provides semantic search for NSF abstracts that allows decision makers to identify future scientific research needs. My distilled suggestions for the recent excellent Data.gov meeting are: – Add a data scientist to the Data.gov team to lead a community of data scientists from the agencies and non-government organizations in a new community. – Ensure that the new data.gov platform supports the sitemap and schema protocols with well-defined URLs for content, faceted search, and big data in memory. – Encourage the new developer community to build their own data.gov sites to become both publishers and consumers of data to support the new data scientist community above. Note: Invited to give presentation the end of April by Jeanne Holm, Data.gov Evangelist. 7
Research.gov Dashboard Build an app that provides semantic search for NSF abstracts that allows decision makers to identify future scientific research needs. Created 176 MB Excel file (60,981 rows by 44 columns) for Spotfire Dashboard. – Get 2011 data from state tables? Tried to extract text for Semantic Search with SIRA and Digital Reasoning but found Abstract text is cut off and URLs are embedded in Publications and Project Outcomes columns. 8
Research.gov Spending & Results 9 Download Data Sets
Research.gov Dashboard 10
Sample of Hand Parsed Text 11 Note: We will need to get the raw text data to accomplish the objectives of this work.
Semantic MedLine Prototype: Home Semantic MEDLINE is a prototype Web application that summarizes MEDLINE citations returned by a PubMed search. Natural language processing is used to analyze salient content in titles and abstracts. This information is then presented in a graph that has links to the MEDLINE text processed. Currently, the results from 35 PubMed searches (including a variety of disorders and drugs) are available to be processed. The 500 most recent citations (from the date of the search) are available for further processing by Semantic MEDLINE. Begin at the Search tab by selecting a search; then move to the Summarize tab. Choose a summary type to specify the point of view of the summary (Treatment of Disease, Substance Interactions, Diagnosis, or Pharmacogenomics). After selecting the topic of the summary, click the Summarize and Visualize button. The graph appears below. Right click on an edge to display a MEDLINE citation. 12
Semantic MedLine Prototype: Search 13
Semantic MedLine Prototype: Summarize 14
Semantic MedLine 15
Semantic MedLine Prototype: Knowledgebase 16
Semantic MedLine: Predication Database 17 ftp://lhcftp.nlm.nih.gov/outgoing/cgsb/ Note: Large Tar and GZIP files!
Semantic MedLine: Data Extraction 18
Semantic MedLine: Analytics 19 Web Player I have questions based on these analytics.
Semantic MedLine: Analytics 20 Web Player
Semantic MedLine: Analytics 21 Web Player
Some Next Steps We will need to get the raw text data to accomplish the objectives of the work with the Research.gov Abstracts, Project Outcomes, etc. We need to extract the large Semantic MedLine Predication Databases files for Semantic Search with SIRA and Digital Reasoning. 22
AOL Government Stories Semantic Medline (Pending) HPN Health Prize for Health Data Palooza (Pending) From Catalyst to Semantic Synthesis - How the IC Finds More Needles in Bigger Haystacks (Pending) Challenges and Opportunities in Big Data: Defense Department Bets Big On Big Data Semantics and Ontologies for the Intelligence Community Working Toward Standards (Pending) Data.gov Developers Community Space Launched - Is Dr. Merkin In the House? (Pending) Building Trust Between Cloud Computing Providers and Suppliers Health Datapalooza Would Benefit From Real Innovation Investment Has NIEM Reached A Choke Point With Big Data Put Federal IT Dashboard Into Motion Why The Intelligence Community Loves Big Data Big Data Science Visualizations Past Present and Future 23
Challenges and Opportunities in Big Data 24
My Suggestions I think it leaves us with a disconnected federal big data program between the science and intelligence communities with the former considerably behind the latter. As Professor Jim Hendler, RPI Computer Scientist, commented during the meeting: "Computer scientists like us have to move to the social science side of things to really do big data.“ This new White House Initiative needs Todd Park's entrepreneurial spirit, Gus Hunt's experience, and DoD's new money, spent in a coordinated way with the IC and civilian agencies to make big data across the federal government a reality. 25
Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) 26