Augmenting Data with Semantics for Visualization (Big Data Analytics Competition) Daniel R. Harris Center for Clinical and Translational Sciences Institute of Pharmaceutical Outcomes and Policy
Augmenting Data with Semantics for Visualization We focus on augmenting data with semantics via concept extraction. Specifically, from the call: What insights can you provide by analyzing the dataset? Your insights and suggestions are expected to be creative. Based on this dataset, what are the most common medical and health applications where patent development is occurring? How frequently are patents being filed with the same title? How would you improve this dataset to better distinguish unique patents with duplicate titles? What additional data / metadata would you include in this dataset to help researchers more efficiently locate relevant medical and health patents? What conclusions can you draw from this data? What trends, if any, have formed over the past decade? Where are the trends moving? Consider both health industry and patent filing perspectives. What anomalies can you find in this data? Is there anything that affects the integrity of the data?
Augmenting Data with Semantics for Visualization What is the original data set? A curated collection of patents selected from BHI keywords How did we augment this data set? We mapped the invention title to UMLS concepts (CUIs) using Metamap (https://metamap.nlm.nih.gov/). How is this helpful? Each concept has a semantic type that can drive visualizations to help understand the nature of the data and how trends change over time.
A Quick Example Original: A CARD GAME HAVING CARDS WITH GRAPHIC AND PICTORIAL ILLUSTRATIONS OF GEOGRAPHIC, HISTORICAL AND HEALTH RELATED FACTS Idea or Concept (1) : (A CARD GAME HAVING CARDS WITH GRAPHIC AND PICTORIAL ILLUSTRATIONS OF GEOGRAPHIC, HISTORICAL AND HEALTH RELATED FACTS) Intellectual Product (3): (((A CARD GAME) HAVING CARDS WITH GRAPHIC AND PICTORIAL ILLUSTRATIONS) OF GEOGRAPHIC, HISTORICAL AND HEALTH RELATED FACTS) Finding (2): A CARD GAME HAVING CARDS WITH (GRAPHIC AND PICTORIAL ILLUSTRATIONS) OF (GEOGRAPHIC, HISTORICAL AND HEALTH RELATED FACTS) Qualitative Concept (1): A CARD GAME HAVING CARDS WITH GRAPHIC AND PICTORIAL ILLUSTRATIONS OF (GEOGRAPHIC, HISTORICAL AND HEALTH RELATED) FACTS Spatial Concept (1): A CARD GAME HAVING CARDS WITH GRAPHIC AND PICTORIAL ILLUSTRATIONS OF (GEOGRAPHIC), HISTORICAL AND HEALTH RELATED FACTS Each of the next slides is a visualization (Tableau) produced using the augmented data.
Aggregation Concepts provide simple buckets to use when aggregating data The right shows the frequency of concepts extracted from the patent database
Aggregation This can be paired with any other facet of information available in the original data set, such as time of filing
Temporal Considerations We can filter by semantic type. The right shows the frequency of patents having the semantic type of “Pharmacologic Substance” Time is important because we can use previous data points to forecast future data points
Trends Trend lines can work in conjunction with temporal data (linear regression shown)
Comparing Different semantic types experience different trends We can visualize them side by side to ask questions that might help us understand how the data fluctuates across time We can merge this with forecasting and trend lines
Stratify Change We can stratify our data by sections of time to see how our forecasts change when considering only the last X years. This is helpful when specific policy change, funding changes, or discoveries impact the filing of patents.
Drill Down We can still leverage specific information about each invention. For QA purposes, the concepts extracted from each invention title are easily listed and are reviewable Research question: Can we automate this evaluation?
Landscape Analysis Peaks and valleys are easily identifiable but explaining why is a more difficult challenge. We can overlay this with annotations corresponding to significant law, policy, or regulatory changes to see the impact of such changes
Identify Areas of Improvement Aggregating at a conceptual level both exposes areas that are popular and areas where additional opportunity exists
Example Using the last X years of data, we can forecast that the number of medical device patents initially decrease then remain stable. We can also see that the number of manufactured objects decrease with a possible return. If these manufactured objects are not new medical devices, what are they? (IoT, surveillance, communication, etc)?
Conjunction We can recover the invention titles given the semantic type We can also see how this semantic type coincides with other semantic types
Conclusions We focus on augmenting the data set with semantic knowledge extracted from the invention titles This additional knowledge gives abstract buckets to compare and contrast patent trends at a higher level We give example visualizations to demonstrate the potential expressive power