Extracting Recipes from Chemical Academic Papers

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

University of Sheffield NLP Module 4: Machine Learning.
Content Categorization A Road Map Julia Marshall USAID (Bridgeborn Inc.)
Mobyle XML Vivek Gopalan Version history: First version for training Nick and Art – Vivek, 02/07/2011.
© by Pearson Education, Inc. All Rights Reserved.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Publishing Workflow for InDesign Import/Export of XML
Implementation of One Stop Search by XSLT By Dave Low University of Hong Kong 9-Dec-2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Introduction to XML Extensible Markup Language
Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.
Web Services with Apache CXF Part 2: JAXB and WSDL to Java Robert Thornton.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
SciFinder Web Version Pootorn R. Book Promotion & Service Co.,Ltd. Thailand.
Lab 8 – C# Programming Adding two numbers CSCI 6303 – Principles of I.T. Dr. Abraham Fall 2012.
Copyright © Texas Education Agency, All rights reserved. 1 Web Technologies Website Development with Dreamweaver.
September 7, September 7, 2015September 7, 2015September 7, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University.
CFT Offline Monitoring Michael Friedman. Contents Procedure  About the executable  Notes on how to run Results  What output there is and how to access.
XP New Perspectives on Microsoft Access 2002 Tutorial 51 Microsoft Access 2002 Tutorial 5 – Enhancing a Table’s Design, and Creating Advanced Queries and.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
ACL: Introduction & Tutorial
Survey of Semantic Annotation Platforms
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
NERC DataGrid Vocabulary Server Access Vocabulary Workshop, RAL, February 25, 2009.
Introduction to XML Extensible Markup Language. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information.
Querying Structured Text in an XML Database By Xuemei Luo.
Intro to XML Originally Presented by Clifford Lemoine Modified by Box.
1 Documenting with Javadoc. 2 Motivation  Why document programs? To make it easy to understand, e.g., for reuse and maintenance  What to document? Interface:
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
NCAR MDSS Functional Prototype Display System Preview – April 2002 Bill Mahoney National Center for Atmospheric Research Images shown are valid as of 15.
BIT 286: Web Applications Automated Web Testing. Selenium  Selenium Is moving from being Firefox based to being more of a 'normal desktop' program that.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
XML Study-Session: Part III
Unit 3 — Advanced Internet Technologies Lesson 11 — Introduction to XSL.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CO1552 – Web Application Development Further JavaScript: Part 1: The Document Object Model Part 2: Functions and Events.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
XML Extensible Markup Language
XML Schema – XSLT Week 8 Web site:
Getting data out of XML These exercises provide an overview of how to use the native Taverna XPath services to get data out of XML.
BIT 286: Web Applications Automated Web Testing. Selenium  Selenium Is moving from being Firefox based to being more of a 'normal desktop' program that.
How to Write a research paper
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
PRINCIPLES OF COMPILER DESIGN
Understanding Search Engines
Intro to XML.
Lecture 13 RPM and its advantages.
ADO.NET Entity Framework Marcus Tillett
Text Analytics Giuseppe Attardi Università di Pisa
Computer Programming.
Writing Analytics Clayton Clemens Vive Kumar.
Creating and Modifying Queries
RichAnnotator: Annotating rich (XML-like) documents
Part of the Multilingual Web-LT Program
Electronics II Physics 3620 / 6620
Tableau Groups VS Sets The difference between Tableau’s Groups and Tableau Sets was something that confused me a little when first started with Tableau.
Data Mining for Engineers
Introduction to Text Analysis
Introduction to Compiler Construction
Applying principles of computer science in a biological context
Databases and Information Management
Use Cases Simple Machine Translation (using Rainbow)
Information Retrieval
Introduction to Compiler Construction
Exploring Cognitive Services
Information system analysis and design
Presentation transcript:

Extracting Recipes from Chemical Academic Papers Lei Luo Today. I am going to give a presentation titled “Extracting Recipes from Chemical Academic Papers”. I will talk about what we have done as the domain analytics subteam on the LLNL project.

Extracting Recipes from Chemical Academic Papers Chemicals Extraction Tools Results Comparison Future Work Recipes Extraction Sample Results The final goal is to extract recipes from papers. That includes extracting chemicals information, such as chemical names, synthesis parameters like temperature and ph value. It also includes extracting the complete recipes for some certain chemicals. For the chemicals extraction part, I will talk about what tools we have explored, and we results we have found, and future work. We have not got to do much work for extracting recipes, so I am only going to show some simple prototype results, and I will also talk about what I envision where and how far we can go.

Chemicals Extraction Tools Brat ChemTagger ChemDataExtractor For extracting chemicals, we have explored three different tools. Namely, Brat, ChemTagger and ChemDataExtractor.

Chemicals Extraction Brat Web-based tool for text annotation; that is, for adding notes to existing text documents. Needs to define three things: Top level annotation definition. Second level annotation definition. Original text file. Needs manual annotation. Brat is a Web-based tool for text annotation; that is, for adding notes to existing text documents. In order to work with Brat, we needs to have three things: Top level annotation definition, Second level annotation definition, and the Original text file. We have to provide all three of them.

Brat Top level annotation This is what the top level annotation file looks like. In here, we need to define what the categories we will use to tag our target words. In this case, we have ORG, PER, LOC, MISC things.

Brat Second level annotation The next thing we need to provide is the location of the words we would like to tag and what catalogues we would like to give. For example, We tag “De Morgen” as an organization and it location is 989 to 998 in the text.

Brat Original text file The last thing we need is the original text file.

Brat Result Here is the Brat result. What it does is color-highlighting the words we manually tag. The goal of this project is to be able to automatically extract chemicals from texts. So if we need to manually pick them out. It is no use for us.

Chemicals Extraction ChemTagger Phrase-based semantic NLP tool for parsing the language of chemical experiments. Takes a string as input and produces an XML document as output. Uses a combination of OSCAR4, domain-specific regex and English taggers to identify parts-of-speech. The next tool we have used is ChemTagger.

ChemTagger Web-based interface It has two version. The web-based one and java source code for running it locally. This is the web-based version. In this textbox, we input the text we want to extract chemicals from, and then click “Process Text” button.

ChemTagger Web-based interface The result is shown. It color-highlights all the information we might be interested. Such as Molecules, Temperature, and Quantities.

ChemTagger Local This is the code snippet that uses its java api. It needs to input the a string text, and then it calls related api to produce the xml output.

ChemTagger Result – XML & Chemicals Here is the xml output. We can see the root of the tree is document, then sentence, and some other tags. We can use the tag <CHEM> that stands for chemicals to extract all the chemical entities, along with their properties.

Chemicals Extraction ChemDataExtractor Able to automatically extract chemical names, properties, and spectra from scientific papers. Uses machine learning, custom dictionaries, and rule-based parsing grammars. Able to resolve data interdependencies. Extracts data from tables. Another tool we have used is ChemDataExtractor.

ChemDataExtractor Web-based interface It also has a web-base version and an api for Python. Here is one sample result.

ChemDataExtractor Local We can also use it api locally to extract chemical information.

ChemTagger vs ChemDataExtractor Example 1 The next a few slides I am going to show comparison of results from ChemTagger vs ChemDataExtractor. They are analyzing the same text file, and here are the results they give us. It seems like ChemTagger gives more chemicals, such as water. But it also picks up those non-chemical words, such as ADVANCE and pdf2.

ChemTagger vs ChemDataExtractor Example 2 Here is another example. Again, ChamTagger seems to give more chemicals. But there are repetitive chemicals. Such as CZTSeLayer. There are possible non-chemicals, such as EQE.

ChemTagger vs ChemDataExtractor Example 3 Here is another result. Again, ChemTagger gives us repetitive results and indetifies non-chemical names. Such as NY, CA, and Inc.

ChemTagger vs ChemDataExtractor Example 4 Here is the last example. Because the text is cleaner than the previous ones. ChemTagger gives cleaner results.

ChemTagger vs ChemDataExtractor Results ChemTagger identifies chemicals and the properties. ChemDataExtractor tags chemicals. ChemTagger gives repetitive chemicals. ChemTagger also tags non-chemicals. ChemDataExtractor seems to be able to handle unclean text better than ChemTagger.

Chemicals Extraction Near Future Work Clean the results and combine. Chemical entities verification. Accuracy assessment. For near future work, I think we need to clean the result, such as repetitive ones and remove non-chemicals. Then, combine the results from the two tools. We also need to verify if words that get picked up are really chemicals. We can verify this against some chemical database, such as PubChem. To assess the tools’ performance, we need to do accuracy assessment. We manually annotated some text and compare with the results from them.

Recipes Extraction Sample Recipe We have not looked into getting recipes much. Here is a simple example. This is the code snippet manipulating the xml output from ChemTagger. It extracts nodes whose tag is “ActionPhrase” from the parse tree. So for each step, there is an action which maybe like some chemical is added in another chemical.

Recipes Extraction Future Work More literature review. From a large number of papers we can get many different recipes for the making the same chemical. For each paper we can extract chemicals and synthesis parameters. For future work. I think there are a couple of things we can do. First, there are not many papers on extracting recipes, so we need to try to find more papers to give us more ideas. Think about this, For a certain chemical, we can find many papers from which we can extract chemicals and their synthesis parameters

Recipes Extraction Future Work Build a database for chemicals. Use data mining to see under which condition the chemical is more likely to be produced. use machine learning models by providing examples of synthesis parameters and synthesis outcomes. Then, make prediction.