Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University.

Similar presentations


Presentation on theme: "The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University."— Presentation transcript:

1 The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University of Iowa

2 Research Networking Programmatic support for discovery and use of research and scholarly information regarding people and resources. They are essentially special purpose institutional knowledge management systems.

3 Representative RN Systems Profiles (Harvard) VIVO (VIVO Consortium) Loki (Iowa) SciVal Experts (aka Pure – Elsevier) A number of others

4 Why Bother with VIVO (the ontology)? Words in a profile are just sequences of characters carrying no meaning –Try asking Google Scholar what grant funded a given hit… With structure and relationship comes meaning, aka semantics –Enter the Semantic Web!

5 Connecting the Dots The real challenge here is translation of information already in existence in scattered sources –Research networking tools –Citation databases (e.g., PubMED) –Award databases (e.g., NIH Reporter) –Curated archives (e.g., GenBank) –Locked up in text (the research literature)

6 CTSAsearch – version 1 10 SPARQL endpoints 19 institutions 124,945 individuals Proved challenging for some sites to handle the queries

7 CTSAsearch – version 1 subclass | count --------------------+--------- NonFacultyAcademic | 2592383 FacultyMember | 26826 NonAcademic | 15268 EmeritusFaculty | 2134 EmeritusProfessor | 2070 Postdoc | 1226 Librarian | 232 Student | 89 GraduateStudent | 71

8 CTSAsearch – version 2 10 SPARQL endpoints (19 institutions) 15 VIVO sites –Harvested with customized crawler 14 Profile sites –Harvested with customized crawler

9 CTSAsearch – version 2 subclass | count --------------------+--------- NonFacultyAcademic | 2592885 FacultyMember | 55499 NonAcademic | 15430 Student | 11074 GraduateStudent | 10951 EmeritusFaculty | 3096 EmeritusProfessor | 2072 Postdoc | 1410 Librarian | 264

10 CTSAsearch – architecture 1 VIVO-based SPARQL harvester 2(!) VIVO-based crawlers 1 Profiles-based crawler 2 Platform-specific HTML crawlers 1 CSV-based loader

11 CTSAsearch – architecture

12 CTSAsearch – current 45,456,417 VIVO-derived triples 48,569,115 Profiles-derived triples

13 Recent Work Cross-linkage across sites –Resolving ‘stubs’ –Formation of a single ecosystem Macro concerns –Institution-scale analytics –Pondering reflection

14 Current “profile”

15 CTSAsearch/Polyglot – version x Temporary SPARQL endpoint: –http://marengo.info-science.uiowa.edu:2020http://marengo.info-science.uiowa.edu:2020 Shared visualization widgets –Intended for embedding in institutional sites Community-wide sameAs assertions

16 Pattuelli’s Spectrum of Relationships (2012) http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

17 Pattuelli’s Spectrum of Relationships (2012) RN Tools http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

18 Pattuelli’s Spectrum of Relationships (2012) RN Tools Linked In http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

19 Pattuelli’s Spectrum of Relationships (2012) Ontologies used –foaf (Friend of a Friend) –rel (Relationship) –mo (Music) Echos of Trigg’s link taxonomy –Trigg, R. 1983. Network-Based Approach to Text Handling for the Online Scientific Community. Ph.D. dissertation, Department of Computer Science, University of Maryland, technical report TR-1346

20 Connecting the Dots – Take 2 Figure courtesy of Melissa Haendel, OHSU

21 PubMed Central Open Access 886,172 papers (as of 1/1/15) 423,764 with acknowledgements 994,931 sentences 4,329,972 parses

22 The Simple Cases PMCID: 3008610 SeqNum: 2 SentNum: 6 Sentence: EK analysed the data. POS: [EK/NNP, analysed/VBD, the/DT, data/NNS,./.] Parse: [S [NP EK/NNP ] [VP analysed/VBD [NP the/DT data/NNS ] ]./. ]

23 And the Not So Simple… PMCID: 4159542 Sentence: We thank Sheila Harvey, Clinical Trials Unit Manager at ICNARC, and Ruth Canter, Trials Administrator at ICNARC, for their assistance in chasing completed surveys; Dr Kevin Gunning for early advice and project development; Drs Neill K. J. Adhikari and Gordon D. Rubenfeld for feedback and discussion of analysis plan; Dr Chris AKY Chong for his valuable comments on the initial draft of this manuscript; and our Responders: Addenbrooke’s Hospital ( Dr Kevin Gunning ), Airedale General Hospital ( Dr John Scriven ), Alexandra Hospital ( Dr Tracey Leach ), Arrowe Park Hospital ( Dr Lawrence Wilson ), Barnet Hospital ( Dr AH Wolff ), … 8,245 character long sentence

24 Extract Entities/Relationships with Syntactic Queries [S [NP:Author NN:Author ] [VP NN [NP:Person ] [PP ], [PP ] ] ] S <1NP:Author <2[VP <1/thank/ <2(NP) <3(PP) ] –For the sentence having this pattern, match the object noun phrase and the next prepositional phrase NP <#2 <1(NNP) <2(NNP) –For the noun phrase, extract two proper nouns PP <#2 <1DT <2(NP) –For the prepositional phrase, match the noun phrase

25 Person Results Snippet IDTitleFirst NameMiddle NameLast Name 76HansMatrin 77JeffVieira 78P.ZAMORE 79Prof.EricSchon 80CarlosLois 81AndreaMöll 82ElenaGovorkova 83K.M.Pollard 84Dr.MichaelBerton

26 Relationships for Person 77 PMCIDCategoryPP 4006053Supportthe kind gift of rKSHV.219 4006053Supportthe kind gift of rKSHV.219 and for helpful discussions 4006053Collaborationhelpful discussions

27 Relationships for Person 79 PMCIDCategoryPP 2801706Resourcethe rabbit polyclonal antibody 2801706Resourcethe ECFP and EYFP plasmids 4013013Collaborationhis helpful advice and discussions

28 Category Frequencies CategoryCount Collaboration47,052 46,327 Technique33,598 Resource8,894 Support6,836 Event3,744 Project854 Place Name229 Publication Component 210 Place186 Organization93

29 Next Steps Continue slogging through extraction pattern definition Define patterns for –funding declarations –chairs, fellowships, etc. Merge data into CTSAsearch visualizations Align current category scheme with Melissa Haendel’s current draft ontology for CASRAI taxonomy and then merge with VIVO-ISF

30 In the Next Year Joint work with Melissa Haendel (OHSU) on administrative supplement to OHSU’s CTSA bridging RNs and NIH’s SciENcv –Map SciENcv data model to VIVO-ISF –Enable bi-directional data exchange –Integrate clinical/trial data sources –Integrate SciENcv, ORCID data into CTSAsearch –Multi-granularity search and visualization

31 Questions? Email: david-eichmann@uiowa.edu


Download ppt "The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University."

Similar presentations


Ads by Google