Issues in Managing and Disseminating Changing Information in Biology Sue Rhee Carnegie Institution Department of Plant Biology Stanford, CA
Information Dissemination Media in Biology Journals ~150 years peer-reviewed highly referenced limited size static Public Repositories ~20 years minimum review minimum reference unlimited size static Community Databases ~5 years Curator-review Moderately referenced unlimited size dynamic
TAIR: the Arabidopsis Information Resource A Community Database about Arabidopsis Information Researchers can search, download, analyze data via commonly-used web browsers and ftp NSF funded project ( ) Collaboration between Carnegie (Stanford, CA), NCGR (Santa Fe, NM) and ABRC (Columbus, OH)
Who are the users? PeopleGroupsOrganism of Interest Total: 12,300 inviduals and 4700 labs working on plant research
Usage Statistics Monthly: ~5 million files served ~900,000 page views ~29,000 IP addresses ~30 Gb served
What do we do? 1.Capture data generated by large genome projects and individual researchers –Read and extract info from literature, establish contact with large-scale project groups 2.Curate and analyze the information –Error checking, making associations, synthesizing summary, adding quality control filters through a series of standard operation procedures and analysis pipelines 3.Make information accessible to users in intuitive form –In-house biologists and user feedback from surveys & workshops 4.Develop data query, analysis, curation, visualization tools –Collaboration between software developers and biologists, iterative process 5.Communicate with the users –Data submssion, suggestions, error and other problem reports
What is PubSearch? A web application and database for literature curation Stores complete literature information –References, abstracts, full text articles (pdf) Stores biological information –Genes, proteins, descriptions Stores ontologies (GO Terms) Links literature, GO terms and biological information. Assists manual curation with fast, automatic matching (using suffix trees indicer) Is password-protected, and easy to set up and use.
PubSesarch System Architecture
TAIR Installation Statistics (9/12/03) 20,272 literature references 14,920 research papers with abstracts 8,642 full-text papers (58%) 16,956 controlled vocabulary terms 105,671 hits between terms and articles (2359 terms) 38,010 gene names 29,841 hits between genes and articles (4268 genes) 14,943 hits validated –(70% valid, 29% not valid, 0.5% maybe) 11,497 manual annotations to 5981 genes from 2113 articles 38 relationship types for gene2term and gene2gene 103 evidence types
Pub* Tools Website:
TAIR Data Size Type of Info StoredSize in 1999 Size in 2003 WebsiteGeneral information, help, external sites 0.7 Gb25 Gb DatabaseData, external links, definition of database fields 3 Gb20 Gb FTP directoryLarge datasets generated from database or external sites N.D.13 Gb DVD ArchiveMicroarray raw data01.6 Gb
Current Issues in Community Databases 1.How to maximize connection with public repositories and journals? 2.How to ensure information is up-to-date? 3.How to cross-reference all the information in independent sites? 4.What happens after the funding?
Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases
Overlap and Interconnection Between Existing Media Journals Public Repositories Community Databases
Making Connections with Public Repositories 1.Utilizing existing standards A.LinkOut A.Data capture includes Genbank accession (e.g. seed stock containing an insertion and the insert-site sequence with Genbank accession) B.Data downloaded from Genbank using the accession using e-utilities C.Data curation/analysis generates additional associations (e.g. the insertion site used to identify the associating gene and a polymorphism for that gene) D.Sequence-associated information sent back to Genbank using the LinkOut XML format 2. Collaborating to make new standards A.Plant microarray submission standards with ArrayExpress B. MIAME standards for microarrays A.Researchers submit microarray data in prefilled Excel sheets B.Convert Excel into XML and load into TAIR database C.Data curation/analysis generates additional associations (e.g. usage of controlled vocabularies) D.Data exported into XML and sent to ArrayExpress
Making Connections with Journals 1.Publication requirement to adhere to existing standards A.Stock Accessions B.Gene symbol Registry (currently under discussion) 2.Data sharing A.Image data for gene expression B.Supplementary data (e.g. microarray results) 3.Resource sharing A. Publication through community databases?
Keeping Information Up-To-Date 1. In-house curation -pro: experience and standard operation procedures can ensure consistency -con: becoming difficult keep up as the amount and complexity of information increases 2. Community involvement -pro: expertise and sheer number of the community -con: has not worked successfully (no incentive in the current academic reward structure, not considered to be a typical role of a scientist) 3. Others?
Impact Factor of Top Journals
Impact Factor of Top Databases?
Impact of TAIR
Current Issues in Community Databases 1.How to maximize connection with public repositories and journals? 2.How to ensure information is up-to-date? 3.How to cross-reference all the information in independent sites? 4.What happens after the funding?
The End
People Involved TAIR-Carnegie Tanya Berardini Marga Garcia-Hernandez Eva Huala Suparna Mundodi Leonore Reiser Julie Tacklind Iris Xu Danny Yoo Peifen Zhang Nick Moseyko Brandon Zoekler Jessie Zhang TAIR-NCGR Dan Weems Neil Miller Mary Montoya ABRC Randy Scholl Debbie Crist Emma Knee Luz Rivero
Information Dissemination Media in Biology 1.Scientific Journals Traditional medium of knowledge dissemination Long history of publishing Recently have move to electronic publishing 3. Community Databases Information resources that are created, maintained, and improved by research community Funded by governments, not permanent. A few large databases share similar history as public repositories Recently there has been a radiation of the community databases 2. Public Repositories Permanent operations for electronic storage and dissmination of basic data Shorter history than journals, about 20 years A good example is NCBI’s Genbank
What is the infrastructure? Web browser applications TAIR DB Data object layer Application Program Interface Analysis cluster FTP Directory DVD archive Software Development, Curation, Testing, Staging Environments