Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego.

Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego

Themes  Computers are now partners with humans in reading the literature Search Summarization Linking Discovery  The scientific paper starts with the materials and methods All observations, claims etc flow from experimental design and materials If authors do not provide this information in the first place, then we can’t use it to improve all of the above  Scientists produce articles for each other, not for computers Not everything you need to interpret the paper is in the paper More information may be there than is in the text

 NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) are available to the neuroscience community? What types of resources (data, tools, materials, services) are available to the neuroscience community? How many are there? How many are there? What domains do they cover? What domains do they not cover? What domains do they cover? What domains do they not cover? Where are they? Where are they? ○ Web sites ○ Databases ○ Literature ○ Supplementary material Who uses them? Who uses them? Who creates them? Who creates them? How can we find them? How can we find them? How can we make them better in the future? How can we make them better in the future? http://neuinfo.org PDF files PDF files Desk drawers Desk drawers NIF provides a wealth of practical information on data and resource issues in neuroscience

The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience  A portal for finding and using neuroscience resources  A consistent framework for describing resources  Provides simultaneous search of multiple types of information, organized by category  Supported by an expansive ontology for neuroscience  Utilizes advanced technologies to search the “hidden web” http://neuinfo.org UCSD, Yale, Cal Tech, George Mason, Washington Univ Supported by NIH Blueprint Literature 22 mil Data Federation 350 mil Resource Registry 5000

In an ideal information system, we would be able to find…  What is known “ What studies used my monoclonal mouse antibody against actin in humans?” “What phenotypes are associated with each mouse model of Spinal Muscular Atrophy” “What upregulates SMN1?”  What is not known Connect information to infer plausible hypotheses ○ Genotype-phenotype ○ Possible drug targets Information gaps

Whither biological information? ∞ What is easily machine processable and accessible What is potentially knowable What is known: Literature, images, human knowledge

CA2: Ion, Brain Part or Gene? BioGrid Allen Brain Atlas Brain Info NIF queries across over 170+ independent databases

Papers are the currency of science  Despite the wealth of data out there (> 2500 databases on-line), the majority of data is still published in papers  But...we write for other humans to consume and information continues to be hard to find Even for humans, however, it is difficult to find and verify basic information about a paper critical for interpretation What is the subject of the study What reagents were used What genes were studied  A lot of information is missing from papers Not all data is available Data is published in papers in forms that are difficult to use

Mining the literature for resources  Resources: Materials, services, tools, data Project 1: Find materials: antibodies and transgenic animals Project 2: Mine supplemental data in papers showing gene expression changes in drug abuse  Purpose Find new resources Track usage of existing resources Link resources to other useful information

Linking resources: Link out broker

Use case: antibodies  Pilot project to use text mining to identify antibodies used in studies: Wanted to pick a project that would be immediately understandable by research scientists  Antibodies are used routinely to identify proteins and other molecules in basic and translational studies  Antibodies are a large source of experimental variability in results Same antibody can give you very different results Different antibodies to the same protein can give you very different results  Neuroscientists spend a lot of time tracking down antibodies and trouble shooting experiments that use antibodies

Our reagents and methods are not perfect “We note that many of the findings in the literature about neuronal NF-κB are based on data garnered with antibodies that are not selective for the NF-κB subunit proteins p65 and p50. The data urge caution in interpreting studies of neuronal NF-κB activity in the brain.” --Herkenham et al., J Neuroinflammation. 2011; 8: 141.

Antibodies are complex entities  Anti-Chat antibody Raised against a portion of choline acetyltransferase Raised in a particular species Is polyclonal or monoclonal Is affinity purified or not Recognizes the target in some species, e.g., human  Reported in materials and methods Tissue sections were blocked with 5% serum and incubated overnight at 4 °C with the following primary antibodies: anti-ChAT (1:100; Millipore, Billerica, MA), anti-Bax (1:50; Santa Cruz), anti-Bcl-xl (1:50; Cell Signaling), anti- neurofilament 200 kDa (1:200; Millipore)...

“Find studies that used a rabbit polyclonal antibody against GFAP that recognizes human in immunocytochemisty” Paz et al, J Neurosci, 2010 (AB_310775) NIF Antibody Registry: -database of > 900,000 antibodies

Searching for resources in literature  NIF recently implemented a section-specific search  Semi-automated resource identification pipeline Paul Sternberg, Yuling Li, Cal Tech

Annotation of antibodies Allows annotation of entities and key relationships: Protocol Subject of protocol Links antibodies to a database of antibodies that contains their properties NIF Antibody Registry 900,000 antibodies Unique ID http://antibodyregistry.org http://annotationframework.org/ DOMEO annotation tool: Paolo Ciccarese; Tim Clark, MGH

What studies used my monoclonal mouse antibody against actin in humans ?  Midfrontal cortex tissue samples from neurologically unimpaired subjects (n9) and from subjects with AD (n11) were obtained from the Rapid Autopsy Program  Immunoblot analysis and antibodies  The following antibodies were used for immunoblotting: - actin mAb (1:10,000 dilution, Sigma-Aldrich); -tubulin mAb (1:10,000, Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8 mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics); PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb (phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa Cruz Biotechnology)… Subject is Human mAb=monoclonal antibody

Tracking down reagents Feng et al., MATH5 controls the acquisition of multiple retinal cell fates, Mol Brain. 2010; 3: 36

Space limitations  Content gets separated in space and time Practices are designed to save space, improve readability and save authors typing  But...electrons are cheap  Cut and paste is cheap Re-examining plagiarism in the age of cut and paste  Autocomplete is cheap Acronyms and abbreviations Are there any unique 3 letter strings  Formats are flexible What the computer sees and what humans see don’t have to be the same thing

Try this Watson! 95 antibodies were identified in 8 articles 52 did not contain enough information to determine the antibody used Some provided details in another paper And another paper, and another... Failed to give species, clonality, vendor, or catalog number But, many provided the location of the vendor because the instructions to authors said to do so

Subject of study  Often not explicit: “patients with AD” = human Type III SMA mice (Smn−/−, SMN2+/−) were produced as previously described (Tsai et al., 2006a).  Official strain nomenclature of animals not designed for search SMN2Ahmb89 tg/tg ;SMNΔ7 tg/tg :Smn1−/−; no unique identifier assigned Many lines of transgenics are generated and described within a single paper; difficult to relate individual findings with the correct animal line but all are not equivalent Three lines of transgenic mice, Ml, M2, and M3, were produced (Fig. 1B). Transgene expression was found in all tissues studied, with widespread high expression in line Ml, high expression in brain of line M3, and relatively low expression in brain of line M2 (Fig. 1C). (Ripps et al., PNAS, USA Vol. 92, pp. 689-693, January 1995)

Which mouse did you use?  “Transgenic mice expressing SOD1 G93A (12) were purchased from Jackson Laboratory”12 12 = Gurney ME; et al. 1994. Motor neuron degeneration in mice that express a human Cu,Zn superoxide dismutase mutation [see comments] [published erratum appears in Science 1995 Jul 14;269(5221):149] Science 264(5166):1772-5. Search NIF/Jackson lab for “Gurney SOD” ○ 7 entries for same producer ○ 3 track to the same reference  Gogliotti et al, Biochem Biophys Res Commun. 2010 January 1; 391(1): 517. “Here we report our findings for the SMA mouse model that has been deposited by the Li group from Taiwan. These mice, JAX stock number TJL-005058, are homozygous for the SMN2 transgene, Tg(SMN2)2Hung, and a targeted Smn allele that lacks exon 7, Smn1 tm1Hung.”

Minimal metadata standards (really) for publishing in the 21 st century  1) Provide gene accession numbers for all genes referenced in the methods section of a paper, per http://www.ncbi.nlm.nih.gov/gene  2) Identify (i.e., give ID) the species for the subject of a study, and from which each gene product is derived, using the NCBI taxonomy and the strains from the model organism databases for mice, rats, worms, zebrafish and drosophila, employing any existing unique identifiers and correct species-specific nomenclature:  3) Provide catalog numbers and vendor information for all reagents and animals described in the methods section of a paper Developed by the Link Animal Model to Human Disease Initiative (LAMHDI) consortium: Journal of Comparative Neurology: Requires complete characterization of antibody as stated in instructions to authors 90% of antibodies had a catalog #; 20% had a lot number after these policies were instituted NIF could automatically identify 80% of these antibodies through matching with NIF Antibody Registry

Project 2: Extracting data from tables and supplementary material  Challenge: Extract data on gene expression in brain from studies relevant to drug abuse  Workflow: Find articles Extract results from tables Standardize results Load into NIF Drug related gene database: 140 tables from 54 articles Andrea Arnaud-Stagg, Anita Bandrowski

Extracting additional knowledge from supplementary material Gene for tyrosine hydroxylase has increased expression in locus coeruleus of mouse compared to control when given chronic morphine Translations: Upregulated p < 0.05 = increased expression LC = locus coeruleus Probe ID = gene name J Neurosci. 2005 Jun 22;25(25):6005-15.

Challenges working with tables and supplemental data  Difficult data arrangements ○ PDF, JPG, TXT, CSV, XLS ○ Difficult styles: colors, symbols, data arrangements (results combined into one column, multiple comparisons in one table, legends defining values, unclearly described data (e.g., unclear significance)  Not clear what tables/values represent nothing in paper about the supplementary data file and table has no heading Probe ID’s are given but not gene identifiers  No link from supplemental material back to article; lose provenance  Not all results are accounted for

Is SMN1 affected by drugs of abuse? SMN1 is the gene that is mutated in Spinal Muscular Atrophy, a neurological disease of children

Open world vs closed world assumptions  Closed world assumption: holds that any statement that is not known to be true is false allows an agent to infer, from its lack of knowledge of a statement being true, anything that follows from that statement being false typically applies when a system has complete control over information  Open world assumption: the assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true the open world assumption applies when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information.

 We measured the expression of 9000 genes as a function of chronic cocaine (S1). The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2 What about the other 8950 genes? Cannot assume that they were increased, decreased or remained the same (Open world)  We measured the expression of 9000 genes as a function of chronic cocaine (S1). The fold change and p value are given for each gene. The 50 genes that showed significantly increased expression (p > 0.01) are shown in Table 2 (Closed world) Reporting data: Closing the open world

Narrative vs Data publishing  Narrative (Author): Encourage use of minimal standards for key entities in the research paper Subject, protocol, genes, reagents ○ Make it easy to find accession numbers Standard templates for reporting supplemental data? ○ Unlikely although desired Tools for linking in line references to fragments of papers rather than the entire paper  Data (Curators): Structuring data requires expertise Positive and negative results equally important If data are to be published in supplemental material or in paper, should make them machine interpretable Ideally, entire data set should be deposited in a public repository, e.g., GEO OMNIBUS

Conclusions  Humans are storytellers; it’s fundamental to the way we communicate But these stories are directed to an audience with expertise Scientists know each other’s work; personal networks very important- The computer isn’t part of this  So...we need to adapt publishing practices to aid automated search and mining of content Partnership between authors, publishers, curators and computer scientists, informaticians... Future of research communications and e-scholarship http://force11.org JOIN US! http://force11.org JOIN

Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego.

Similar presentations

Presentation on theme: "Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego.

Similar presentations

Presentation on theme: "Maryann E. Martone, Ph, D, Neuroscience Information Framework University of California, San Diego."— Presentation transcript:

Similar presentations

About project

Feedback