CBioC: Massive Collaborative Curation of Biomedical Literature Chitta Baral, Hasan Davulcu, Anthony Gitter, Graciela Gonzalez, Geeta Joshi-Tope, Mutsumi.

Slides:



Advertisements
Similar presentations
Using CAB Abstracts to Search for Articles. Objectives Learn what CAB Abstracts is Know the main features of CAB Abstracts Learn how to conduct searches.
Advertisements

Publisher: Name of service: License in place: within Service Type:
EndNote Web Reference Management Software (module 5.1)
Academic Quality How do you measure up? Rubrics. Levels Basic Effective Exemplary.
NoodleBib Create a [bibliography, source list…] * [Your name/title/contact info] *Note: For the brackets, fill in your specific information.
Putting the Portal to Work Stuart Forman and Vicky Brett TBN Members Day Presentation 15 th Nov 2008.
Click to start This is best viewed as a slide show. To view it, click Slide Show on the top tool bar, then View show. Integration of experimental evidence.
How to manage my references? Stop Searching, Start Discovering.
Knowledge Management, Texas-style Session 508. Presented by: Belinda Perez Stephanie Moorer Knowledge Management, Texas-Style.
Microsoft Word 2003 Tutorial 2 – Editing and Formatting a Document.
Lesson 13 PROTECTING AND SHARING DOCUMENTS
Engineering Village ™ ® Basic Searching On Compendex ®
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.
Database Design Concepts INFO1408 Term 2 week 1 Data validation and Referential integrity.
Experimental Psychology PSY 433
1 Computing for Todays Lecture 4 Yumei Huo Fall 2006.
Access Tutorial 10 Automating Tasks with Macros
The 12 screens to follow contain a number of Tool descriptions, some instructions on their use, and in some cases a Task or two. If you dedicate one hour.
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
WIKI IN EDUCATION Giti Javidi. W HAT IS WIKI ? A Wiki can be thought of as a combination of a Web site and a Word document. At its simplest, it can be.
Bring Your Essay: Rewording and Rewriting An English/Reading Workshop Troy University Troy, Alabama.
Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Springerlink.com Introduction to SpringerLink springerlink.com.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
3.01 – Understand Business Documents Mail Merge. Administration Congratulations in order! Objective 3.01 Business Documents Test –Test Wednesday –Review.
Internet Browsing the world. Browse Internet Course contents Overview: Browsing the world Lesson 1: Internet Explorer Lesson 2: Save a link for future.
VP Publications’ Report Vincenzo Piuri 2010 AdCom Meeting San Diego, CA, USA – 9 April 2010.
Blogs and Wikis Dr. Norm Friesen. Questions What is a blog? What is a Wiki? What is Wikipedia? What is RSS?
2 InfoTrac College Edition Over 20 million online articles. Nearly 6,000 full-text journals Instant access to periodicals. Includes journals, magazines,
VP Publications’ Report Vincenzo Piuri 2009 AdCom Meeting Vancouver, Canada – November 2009.
TAIR Workshop Model Organism Databases and Community Annotation Plant and Animal Genome XVI Conference, San Diego January 13, 2008.
Instructional Guide. How does EasyBib make research easier? Citation Generation Easily create a bibliography in MLA, APA, and Chicago styles Export to.
Finding Credible Sources
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Problem Statement: Users can get too busy at work or at home to check the current weather condition for sever weather. Many of the free weather software.
How to read a scientific paper
Pathway Interaction Database (PID) Market Research BioPortals Tiger Team Meeting Mervi Heiskanen January 31, 2013.
September 6, 2013 A HUBzero Extension for Automated Tagging Jim Mullen Advanced Biomedical IT Core Indiana University.
Dr Jamal Roudaki Faculty of Commerce Lincoln University New Zealand.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
From Lucent, Inc. This is the Sablime® home page. It has access to all the functionality of the Sablime® Configuration Management System.
Literacy in Information: Evaluating Internet Resources Jennifer Fendrick & Nicole Christensen In order to properly evaluate a website, the.
Journal Article Search 1.Go to the Newman Library webpage ( 2.Select All Databases, A-Z which opens a second page. 3.A large.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
Evaluating Web Pages Techniques to apply and questions to ask.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Journal Searching Nancy B. Clark, M.Ed. Director of Medical Informatics Education FSU College of Medicine 1 All recourses are available online in Medical.
Darek Sady - Respondus - 3/19/2003 Using Respondus Beginner to Basic By: Darek Sady.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
A collaborative tool for sequence annotation. Contact:
Software Engineering Requirements + Specifications.
Knowledge Management Putting what you’ve learned to work!
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UsersTraining StatisticsCommunication Tests Knowledge Board Welcome to the Knowledge Board interactive guide! We encourage you to start with a click on.
Hubnet Training One Health Network South East Asia Network Overview | Public and Members-only Pages; Communicating and Publishing using Blogs and News.
Ian F. C. Smith Writing a Journal Paper. 2 Disclaimer / Preamble This is mostly opinion. Suggestions are incomplete. There are other strategies. A good.
Shelcat Scottish Health Libraries Catalogue Training guide, March 2009.
Copyright OpenHelix. No use or reproduction without express written consent1.
Evaluating Web Pages Techniques to apply and questions to ask.
Done By: Zeina Alkudmani. What is a Blog?  A blog is a discussion or information site published on the World Wide Web consisting of discrete entries.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Access to Electronic Journals and Articles in ARL Libraries By Dana M. Caudle Cecilia M. Schmitz.
Single Sample Registration
Observations on assignment 3 - Reviews
Introduction of KNS55 Platform
Presentation transcript:

CBioC: Massive Collaborative Curation of Biomedical Literature Chitta Baral, Hasan Davulcu, Anthony Gitter, Graciela Gonzalez, Geeta Joshi-Tope, Mutsumi Nakamura, Prabhdeep Singh, Luis Tari, and Lian Yu.

Premise – current status of curation from text Our initial focus is on curation of “knowledge” nuggets from Biomedical articles. About 15 million abstracts in Pubmed 3 million published by US and EU researchers during (800 articles per day) 300 K articles published so far reporting protein-protein interactions in human, yeast and mouse. BIND (in 7 yrs) -- 23K ; DIP – 3K; MINT – 2.4K.

Premise: High cost of human curation Overwhelming cost of large curation efforts may be unsustainable for long periods BIND: Nov 2005 bad news. Operated for 7 years Listed over 100 curators & programmers CND $29 million received in 2003, plus other funding Curation efforts of AFCS has recently stopped. Lack of funding for some genome annotation projects.

Premise: summary Human curation of text is expensive. Human curation of text is not scalable. Human curation of text is not sustainable.

Why not resort to computers? – do automatic extraction Lessons from DARPA funded MUCs (message understanding conferences) in 90s for a decade and at the cost of tens of millions of dollars. Getting to 60% recall and precision is quick Then every 5% improvement is about a years work. Even when we get to 90% for an individual entity extraction for recognizing 4 related entities: (.9) 4 =.64 Lessons from Biomedical text extraction No proper evaluation. Recognized that recall and precision is not very good even in the “best” systems.

What do we do? How do we curate not only the existing articles, but also the future articles? Too important to give up! Need to think of a new way to do it. Faster computers, better sequencing technology and better algorithms came to the rescue of the Human Genome project. Hmm. What resources are we overlooking?

Key Idea If lots of articles are being written then lot of people are writing them and lot of people are reading them. If only we could make these people (the authors and the readers) contribute to the curation effort … Especially the readers; the ones who need the curated data!

Mass collaboration has worked in Wikipedia Project Gutenberg Netflix rating Amazon rating Etc.

Mass collaborative curation: initial hurdles Russ Altman mentioned the challenges with respect to the authors. Sticking to a format Submitting data An average reader (S)he is not normally interested in filling a blank curation form. We can not make an average reader go though curation training. So it has to be very different from just making the existing curation tools available to the mass and expect them to contribute.

Mass collaborative curation : key initial ideas Make it very easy: user need not remember where (which database, which web page) to put the curated knowledge. Curation opportunity should present itself seamlessly. Curation should not be a burden to an average user Make the curated knowledge “thin”. There should be immediate rewards Do not start with a blank slate.

Realization of the key ideas: a biologist with a gene name Goes to Pubmed, types the gene name, clicks on one of the abstracts Curation panel presents itself automatically Our approach calls for researchers to contribute to the curation of facts as they read and research over the web But not with a blank slate No one wants to be the first one! Automatic extraction jump-starts the process, and then researchers improve upon the extracted data, “ironing out” inconsistencies by subsequent edits on a massive scale. Thin Schemas Average users turned off by traditional wide schemas Wide schemas need to be broken down.

Case Study with CBioC When the abstract is displayed, all of the interactions reported in the abstract are shown. The interactions are either automatically extracted in advance by our system or for brand new abstracts the extraction process is done at display time. Thus, data becomes immediately available. Researchers then edit the extracted data, add new interactions, vote on the accuracy of the extraction, assign a confidence rating, and read comments from other researchers. If one or more of them goes deep into obtaining related info, the effort is not wasted and the rest of the community benefits.

Basic curation with CBioC Interactions are corrected, incorrect extractions are “voted down”, and rated on reliability based on the experimental evidence presented by the author. It takes a few seconds to vote on the correctness of the extractions With little effort by each researcher, information is made available immediately to the whole community.

with more effort… Any researcher that wishes to do a bit more can: add interactions missed by the extraction system add interactions reported within the full article fill up more fields in the database (such as organism, experimental method, location of the interaction, or supporting evidence). Added interactions are subject to the community vote, just as the automatically extracted interactions.

Case Study 2: Modifying A researcher could also modify the reported interactions For example, consider the following statement in PMID : PKCalpha but not PKCepsilon phosphorylated the catalytic subunit of the p110alpha/p85alpha PI3K

Case Study 2: Modifying The automatic extraction system extracted (PKCepsilon, phosphorylates, p110alpha/ p85alpha PI3K), an error caused by the grammatical construction of the statement. In this case, the researcher should vote “No” on the accuracy of the extraction. This one cannot really be modified, it will eventually be “voted down” by enough “No” votes. and/or click “Modify” and edit the interaction and then rate its reliability based on the evidence presented by the author.

Addressing challenges Use ontologies and some automated tools to ensure consistency issues. To enter data user must register. Does each voter has equal weight? Trust management

Summary so far Information/curation window pops up automatically. Automatic extraction is used as a boot strap so that no user is working on a blank slate. Users vote on correctness, make corrections, add fact. Suppose 60% precision and recall of automatic extraction system A person will have an easier time discarding 40% of wrongly extracted text than identifying 60% of correct entries and entering them!

Very useful byproducts Avoids some problems with existing human curation approach Curators’ bias Curators miss things Curators have disagreements Slow access to newest findings Researchers at large have little or no control over what gets curated and when A large curated corpus of text gets created Very useful to evaluate and improve automated extraction systems.

Other features Other abstracts related to the specific interaction are accessible through the “More Articles” link. We are in the process of integrating data from publicly available databases. All data (raw and processed) will be publicly available Working on independent data access and querying engine.

Issues and further challenges Works well for certain kind of knowledge curation (interactions, …), but what about others (genome annotation ?) Null values Full papers versus abstracts Are thin schemas enough? Curating new kind of knowledge

Current status, current funding, call for collaboration Funded by Arizona State University Second (basic) beta version released. Proposals sent for a fully functional implementation. Some collaboration with outside groups are in works.

Current publications Collaborative Curation of Data from Bio-medical Texts and Abstracts and its integration. Chitta Baral, Hasan Davulcu, Mutsumi Nakamura, Prabhdeep Singh, Lian Yu and Luis Tari. Proceedings of the 2nd International Workshop on Data Integration in the Life Sciences (DILS'05), San Diego, July 20–22, In Lecture notes of computer science. Springer. An initial report. Ready to be submitted to a journal.

Group members and advisory board Post docs: Lian Yu and Graciela Gonzalez Biomedical expertise: Geeta Joshi-Tope (curation), Mike Berens (signal transduction in oncology) Students: Luis Tari, Prabhdeep Singh, Anthony Gitter, Amanda Ziegler Advisory board: Gary Bader, Ken Fukuda, Shankar Subramanian.

Thanks Questions!