Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Nature Publishing Group 11/2008 Antony Williams
Building a Structure Centric Community for Chemists Imagine a time when …. The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) The internet is searchable by chemical structure and substructure (e.g.Wikipedia, Google Scholar) Chemistry articles are indexed and searchable by a free online service Chemistry articles are indexed and searchable by a free online service The web is linked together through the “language of chemistry” The web is linked together through the “language of chemistry” Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Publicly funded research data can be shared and discussed in the Open, maybe as ONS? Cheminformatics has as much of a public face as bioinformatics Cheminformatics has as much of a public face as bioinformatics
Building a Structure Centric Community for Chemists ChemSpider - A Search Engine for Chemists Questions a chemist might ask… Questions a chemist might ask… What is the melting point of n-butanol? What is the melting point of n-butanol? What is the chemical structure of Xanax? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? What are the stereocenters of cholesterol? Where can I find publications about xylene? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue? What are the safety handling issues for Thymol Blue? ChemSpider can answer all of these questions ChemSpider can answer all of these questions
Building a Structure Centric Community for Chemists What is a Structure? Ask a computer…ask a chemist
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Tell Me About Glutathione
Building a Structure Centric Community for Chemists Link outs
Building a Structure Centric Community for Chemists Links out to KEGG Kyoto Encyclopedia of Genes and Genomes
Building a Structure Centric Community for Chemists How many names does a compound have?
Building a Structure Centric Community for Chemists ChemSpider Data Content Over 21.5 million unique chemical structures from ca. 150 data sources Over 21.5 million unique chemical structures from ca. 150 data sources Online Databases –PubChem, Drugbank, KEGG, Wikipedia Online Databases –PubChem, Drugbank, KEGG, Wikipedia Literature – PubMed, J Het Chem, Nature, RSC, Open Access Literature – PubMed, J Het Chem, Nature, RSC, Open Access Chemical Vendors – over 40 different vendors and growing Chemical Vendors – over 40 different vendors and growing Personal Depositions – individual contributions Personal Depositions – individual contributions Content database vendors Content database vendors Analytical data collections Analytical data collections Patents Patents Web scraping Web scraping Content is linked back to the original data sources
Building a Structure Centric Community for Chemists Other Searches What compounds have a mass of 300+/-0.001? What compounds have a mass of 300+/-0.001? or search a combination of intrinsic/predicted properties or search a combination of intrinsic/predicted properties
Building a Structure Centric Community for Chemists Other Searches
Building a Structure Centric Community for Chemists Complex Search
Building a Structure Centric Community for Chemists The Quality of Data Online… Aggregating data opens up quality issues Aggregating data opens up quality issues Structure-identifier associations are “dirty” Structure-identifier associations are “dirty” Structures are COMMONLY incorrect Structures are COMMONLY incorrect Manual curation of small databases is enough work – what about millions of structures? Manual curation of small databases is enough work – what about millions of structures? Structures are far from perfect. What is a “correct structure”? Structures are far from perfect. What is a “correct structure”? Full stereochemistry? Full stereochemistry? Historical timeline of structure? Historical timeline of structure? Who is the authority? Who is the authority?
Building a Structure Centric Community for Chemists Who holds THE Quality Authority? Chemical Abstracts Service is the structural authority today employees, world standard in chemistry information Chemical Abstracts Service is the structural authority today employees, world standard in chemistry information 101 years of knowledge, process and expertise. 101 years of knowledge, process and expertise. How can an online, free access system peacefully co- exist with the authority? How can an online, free access system peacefully co- exist with the authority?
Building a Structure Centric Community for Chemists Quality is a Major Issue- Search Butanol OLD fixed
Building a Structure Centric Community for Chemists Wikipedia Chemistry Curation project Only ca organic structures, 7000 total structures Only ca organic structures, 7000 total structures Almost a year of work so far for a team of 6 people Almost a year of work so far for a team of 6 people Many errors removed in the process. Curation process is a daily event for users/depositors Many errors removed in the process. Curation process is a daily event for users/depositors Slow and torturous process Slow and torturous process IUPAC_Name_and_structure IUPAC_Name_and_structure IUPAC_Name_and_structure IUPAC_Name_and_structure
Building a Structure Centric Community for Chemists Wikipedia Curation Looking for self-consistency across a Wikipedia Page Looking for self-consistency across a Wikipedia Page Primary key is the article TITLE Primary key is the article TITLE The chemical shown needs to match the title The chemical shown needs to match the title Cyclic self-consistency – and decisions must get made Cyclic self-consistency – and decisions must get made
Building a Structure Centric Community for Chemists Viagra or Sildenafil
Building a Structure Centric Community for Chemists Other issues…
Building a Structure Centric Community for Chemists Charges
Sugars – Machine Readable vs Aesthetics Haworth Stereo Fischer
Building a Structure Centric Community for Chemists Wikipedia – Crowdsourcing Chemistry
Building a Structure Centric Community for Chemists Thymol Blue on ChemSpider Data online includes: Data online includes: UV-vis spectrum UV-vis spectrum Measured experimental properties Measured experimental properties Link to Wikipedia article Link to Wikipedia article Links to chromatography details Links to chromatography details Multiple identifiers/trade names etc. Multiple identifiers/trade names etc. Links to vendors/suppliers/other databases Links to vendors/suppliers/other databases Safety information Safety information
Building a Structure Centric Community for Chemists Differences between ChemSpider/Wikipedia ChemSpiderWikipedia >21 million unique structures ~5000 organics, 2000 others Complex queries – Properties, Text, structure/substructure, OA publishers, Data Sources, … Text Prediction of properties No Analytical Data No, but links. Active depositors/curators – 30 Active editors > 50 (?) 6000 people/day; 1900 registered ???? Compound monographs linked Detailed compound monographs
Building a Structure Centric Community for Chemists Differences between Wikipedia/ChemSpider WikipediaChemSpider Supported by tried and tested Media-Wiki platform. Primarily Microsoft.NET technologies with OS components Established infrastructure and Wikipedia Foundation Team “Out of a basement” on three servers and 5 volunteers Chemistry is a subset of the ‘Pedia Chemistry is the focus of ‘Spider GFL licensing for everything Mixed “licensing” Strong team of WP:Chem advocates, curators and admins Growing team of advocates, curators and users Worldwide reputation as quality source – good and bad Growing reputation as focused on quality
Building a Structure Centric Community for Chemists Crowd-sourcing Curation How to curate data for millions of structures? How to curate data for millions of structures? Robot processes can clean up depositions Robot processes can clean up depositions Search for Chloride and check molecular formula for Cl Search for Chloride and check molecular formula for Cl Check for stereochemistry and remove names with stereo Check for stereochemistry and remove names with stereo Provide a simple-to-use platform to curate, annotate and tag data Provide a simple-to-use platform to curate, annotate and tag data Provide curator administration to prevent vandalism (Veropedia) Provide curator administration to prevent vandalism (Veropedia)
Building a Structure Centric Community for Chemists Post Comments Anyone can “Post Comments” associated with a structure. To curate data we require login to track Anyone can “Post Comments” associated with a structure. To curate data we require login to track
Building a Structure Centric Community for Chemists Multi-level Curation and Approval
Building a Structure Centric Community for Chemists Crowd-sourcing Chemistry Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation Crowd-sourced curation: identify and tag errors, edit names, synonyms, identify records for deprecation ALSO ALSO Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data) Crowd-sourced deposition: anyone can deposit data (structures, text, images, analytical data)
Building a Structure Centric Community for Chemists DailyMed
Quality of Structures
Building a Structure Centric Community for Chemists Quality of Structures!!!
Building a Structure Centric Community for Chemists Structure-Centric We want to search “information” by structure, substructure, similarity of structure We want to search “information” by structure, substructure, similarity of structure Specific focus on Open Chemistry at present Specific focus on Open Chemistry at present Standard approaches would be: Standard approaches would be: Identify chemical names “entity extraction” Identify chemical names “entity extraction” Convert chemical names to structures and index Convert chemical names to structures and index ChemSpider has a validated dictionary of structure-name pairs ChemSpider has a validated dictionary of structure-name pairs Use name extraction, name-conversion and dictionary look- up. THEN curate. Use name extraction, name-conversion and dictionary look- up. THEN curate.
Building a Structure Centric Community for Chemists “Entity Extraction” Rule-based recognition of systematic names: Rule-based recognition of systematic names: Use a lexeme of name fragments Use a lexeme of name fragments Rules for identifying bounds of a name Rules for identifying bounds of a name Look-up dictionary: Look-up dictionary: Drug Names Drug Names Trivial Names Trivial Names Numbers : Registry IDs, EINECS/ELINCS Numbers : Registry IDs, EINECS/ELINCS Massive look-up dictionary of validated identifiers on ChemSpider Massive look-up dictionary of validated identifiers on ChemSpider
Building a Structure Centric Community for Chemists
Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol). Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excces of anhydrous MgSO4 (2.00 g,16.67 mmol). The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane. Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
Building a Structure Centric Community for Chemists Name Recognition Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous MgSO4 (2.00 g,16.67 mmol). Azo aldehyde 2 was synthesized according to a reported method [17]. To a stirred solution of azo aldehyde 2 (1.08 g, 3.76 mmol ) in dry CH2Cl2 (30.00 mL) at 0 oC were successively added (3,4-diaminophenyl)phenyl methanone 1(0.40 g, 1.88 mmol) and a excess of anhydrous MgSO4 (2.00 g,16.67 mmol). The resulting mixture was stirred for 6 hours at room temperature [18]. The mixture was filtered and washed with dichloromethane. Then the solvent was evaporated under reduced pressure to give azo Schiff base 3 as a red solid which was recrystalized from ethanol 95% (1.28 g, 91 %)
Building a Structure Centric Community for Chemists How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”
Building a Structure Centric Community for Chemists How Many Chemical Names? “She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” drivesuccessversedKarateintartandaggerspatof commandoadvantagereleaserecoilHeaspirinthedrivesuccessversedKarateintartandaggerspatof commandoadvantagereleaserecoilHeaspirinthe
Building a Structure Centric Community for Chemists ChemMantis Chemical Markup And Nomenclature Transformation Integrated System Chemical Markup And Nomenclature Transformation Integrated System
Building a Structure Centric Community for Chemists Making Open Access Articles Searchable Proof of Concept Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we HOST Chemistry Open Access articles on ChemSpider and add-value Can we identify chemical names in Open Access articles in a user-friendly manner Can we identify chemical names in Open Access articles in a user-friendly manner Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we convert names to structures in Open-Access articles and expand ChemSpider and provide structure searching of Open Access chemistry articles? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive? Can we provide an environment for chemists to mark-up their own articles and crowd-source markup of an archive?
Building a Structure Centric Community for Chemists Document markup ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International ChemSpider now hosting Open Access articles from MDPI, Molecular Diversity Preservation International Hosting the Molbank collection at present Hosting the Molbank collection at present
Building a Structure Centric Community for Chemists A Standard for Document Markup? NLM-DTD: National Library of Medicine; Document Type Definition NLM-DTD: National Library of Medicine; Document Type Definition Approved markup definitions to apply to journal articles – extended as necessary for our purposes Approved markup definitions to apply to journal articles – extended as necessary for our purposes
Building a Structure Centric Community for Chemists NLM/DTD markup
Building a Structure Centric Community for Chemists Chemistry and Biology Menus can be extended as necessary Menus can be extended as necessary
Building a Structure Centric Community for Chemists Document markup
Building a Structure Centric Community for Chemists Markup – 3 seconds!
Building a Structure Centric Community for Chemists On the fly conversion
Building a Structure Centric Community for Chemists Shorthand Formulae Supported
Building a Structure Centric Community for Chemists One Click to more Info…
Building a Structure Centric Community for Chemists Structure Image Conversion
Building a Structure Centric Community for Chemists Two Seconds Later
Building a Structure Centric Community for Chemists Not Always Perfect….
Building a Structure Centric Community for Chemists A Platform for Markup Can we provide a platform for document markup for chemists? Can we provide a platform for document markup for chemists? Workflow: Workflow: Upload word docs, RTF files or point to HTML and load Upload word docs, RTF files or point to HTML and load Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Apply entity extraction, convert names to structures, mark-up automatically and ask for user participation Publish final version with NLM-DTD markup Publish final version with NLM-DTD markup Deposit all structures on ChemSpider under embargo and wait for article DOI to release Deposit all structures on ChemSpider under embargo and wait for article DOI to release
Building a Structure Centric Community for Chemists Challenges Computer software can generate chemical names better than the majority of chemists Computer software can generate chemical names better than the majority of chemists The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous The majority of chemical names are generated by humans, and Incorrect – convert to the wrong structure or are ambiguous One name, Multiple Structures One name, Multiple Structures
Building a Structure Centric Community for Chemists Names and Structures Dichloroacetone Dichloroacetone Trichloromethylsilane Trichloromethylsilane
Building a Structure Centric Community for Chemists Ambiguity
Ambiguity in Abbreviations - DPA
Building a Structure Centric Community for Chemists Ambiguity in Abbreviations - THF
Building a Structure Centric Community for Chemists Import is Easy Make articles Public/Private (embargo date soon) Make articles Public/Private (embargo date soon) Auto-markup and check by user Auto-markup and check by user
Building a Structure Centric Community for Chemists IUPAC PAC Articles
Building a Structure Centric Community for Chemists Supports Word.DOC, HTML, RTF
Building a Structure Centric Community for Chemists Drexel University Documents
Building a Structure Centric Community for Chemists Drexel University Documents
Building a Structure Centric Community for Chemists Drexel University Documents
Building a Structure Centric Community for Chemists Patents
Single Configuration File defines entities for markup Single Configuration File defines entities for markup Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical Algorithms can be built for certain entities but the majority are dictionaries – vendors, Phys Properties, Analytical We can extend our system to support your needs based on dictionaries – what does NPG need/not need? We can extend our system to support your needs based on dictionaries – what does NPG need/not need?
Building a Structure Centric Community for Chemists Nature Publications
Building a Structure Centric Community for Chemists Entity Balloons Structures are the language of chemistry Structures are the language of chemistry Show structures to chemists and search/link from there Show structures to chemists and search/link from there
Building a Structure Centric Community for Chemists Other Dictionaries - Species Other Dictionaries - Species We are considering We are considering Bacteria Bacteria Fungi Fungi Enzymes Enzymes Viruses Viruses PDB codes…. PDB codes….
Building a Structure Centric Community for Chemists Integrations Out to Other Sources
Building a Structure Centric Community for Chemists Integrations Out to Other Sources
Building a Structure Centric Community for Chemists Reactions
Manual Curation is Always Necessary
Building a Structure Centric Community for Chemists Text-Indexing and ChemSpider? ChemSpider text-indexes almost 500,000 Open Access and Free Access articles ChemSpider text-indexes almost 500,000 Open Access and Free Access articles Collection is growing and more publishers have already agreed. Including theses in the future. Collection is growing and more publishers have already agreed. Including theses in the future.
Building a Structure Centric Community for Chemists Open Access Literature Search
Building a Structure Centric Community for Chemists Conclusions The quality of structure-based data online should always be questioned – that includes ChemSpider The quality of structure-based data online should always be questioned – that includes ChemSpider Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always Data on ChemSpider are being added and curated on a daily basis but we need more eyeballs helping always ChemSpider has a large validated structure-name dictionary ChemSpider has a large validated structure-name dictionary Chemical name extraction and document markup is very enabling Chemical name extraction and document markup is very enabling
Building a Structure Centric Community for Chemists Oops…