Procedural Information Extraction from Text: the Materials Informatics Domain Summer Work Review Sneha Gullapalli
CONTENTS Metadata Based Extractor Text Feature Analysis Upgrades to Recipes Webapp Improvements to Fast Annotator PDF to citation converter module Summer Intern work
Metadata based extractor The main idea behind the metadata extractor is to use the metadata features such as font size, box height etc. to contribute to extracting sections These measures are considered significant for extracting sections. PyMuPDF is a Python binding for MuPDF - “a lightweight PDF and XPS viewer”.
CONTD... PyMuPDF library offers text extraction capability and it offers following formats Pure Text HTML JSON XML General structure of a TextPage
XML Extraction Information up to character-level For each span: Font type Font size Bounding Box List of Characters
Dynamic section extraction Currently with the metadata extractor we are able to dynamically extract sections instead of using the hardcoded way However, ordering the sections on the webapp needs to be taken care of. Dictionaries are unordered in python and so we have looked into using a python subclass called “OrderedDict” that can order the contents in mongoDB as well as webapp
Screenshot- showing extracted sections
Text Feature Analysis In the initial stage, we have generated bag of words from 105 files. It consists of 7633 words and these are used as vocabulary while generating the tf-idf vectorizers In parallel, three(3) full batches of 2520 files each were annotated and best-of-three annotations is performed Machine learning algorithms such as Naïve Bayes, Logistic, IB1, Random Forest are applied and following are the results
Text Feature Analysis
Text Feature Analysis To improve the efficiency of generating bag of words for full batch, we are looking into ways for implementing using MLlib. It is Spark’s machine learning (ML) library. Goal is to make practical machine learning scalable and easy even for very large batches This module is currently under study and needs to be implemented
Upgrades to Recipes Webapp Breadcrumbs have been put on the webapp for easy navigation throughout the interface Breadcrumbs shows the current material, morphology and also offers a dropdown that lists all the materials and morphologies
Upgrades to Recipes Webapp Show Selected images option is added to the home page. User can view all the images related to the selected material and morphology This view allows the user to click on image and know all the details linked to the image such as its caption etc. User can download image and to know more details, there is link “Go to paper” which navigates to paper the image is linked to
Screenshot - Show Selected images view
Screenshot – Showing image details
Improvements to Fast annotator Resolution is improved to a good extent and is quite readable now Two text boxes are included as shown in screenshot below. One of the boxes shows the gazetteer words and other displays top tf-idf words of the current PDF Color Highlighting Yellow – Represents tf-idf words Green – Represents gazetteer vocabulary
Screenshot – fast annotator interface
Pdf to citation converter module A standalone java module has been designed to convert the citation to link that points to PDF Once the sections are extracted, citations in the reference section are taken and parsed and sent to google search API for results This module needs to be integrated to the current version of THOF crawler to improve the relevancy of crawl.
Summer intern work During Summer 2017, Interned as Software Developer at Network Computer Solutions, St George, KS Worked on designing a robust tablet application “timeclock” from scratch. Initially prototype is designed using “Materialize” cards interface Implemented this application using typescript, REST API, HTML, CSS and Materialize, MySQL.
Timeclock - components The application has two main views I) Clockin view : It has four(4) modules Clockin Viewtimesheet Missed Punch Missed Break ii) Clockout view : It has five(5) modules Clockout Change Job Change Sublocation Change Job and Sublocation.
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
Screenshots- timeclock app
timeclock This application is compiled and packaged as an electron app. It is deployed in client environments with some improvements Electron is an open source library developed by GitHub for building cross-platform desktop applications with HTML, CSS, and JavaScript.
THANK YOU