IDA2: Intelligent Discovery of Acronyms and Abbreviations Adam Mallen under the advisement of Dr. Craig Struble and Dr. Lenwood Heath
Example Medline Abstract AIM: In the present 6-month multicentre trial, the outcome of 2 different approaches to non-surgical treatment of chronic periodontitis, both involving the use of a locally delivered controlled-release doxycycline, was evaluated. MATERIAL AND METHODS: 105 adult patients with moderately advanced chronic periodontitis from 3 centres participated in the trial. Each patient had to present with at least 8 periodontal sites in 2 jaw quadrants with a probing pocket depth (PPD) of >=5 mm and bleeding following pocket probing (BoP), out of which at least 2 sites had to be >=7 mm and a further 2 sites >=6 mm. Following a baseline examination, including assessments of plaque, PPD, clinical attachment level (CAL) and BoP, careful instruction in oral hygiene was given. The patients were then randomly assigned to one of two treatment groups: scaling/root planing (SRP) with local analgesia or debridement (supra- and subgingival ultrasonic instrumentation without analgesia). …
System Outline I. Build initial dictionary database using the Schwartz and Hearst abbreviation finding algorithm. II. Use this dictionary as a labeled training set to build an abbreviation disambiguation classifier. III. Use the classifier to predict the expanded forms of ambiguous abbreviations and add them to the dictionary. IV. Implement a web-based front end interface for searching and interacting with the dictionary database.
Building the Dictionary Scan and find all abbreviations in all Medline baseline abstracts. Almost 19 million Medline abstracts. Use the Schwartz and Hearst algorithm to find abbreviations defined in the abstract following either form: i. long form ‘(‘ short form ‘)’ e.g. clinical attachment level (CAL) ii. short form ‘(‘ long form ‘)’ e.g. CAL (clinical attachment level) Create a database of the dictionary and the abstracts in which each abbreviation/long form pair has been found. Create a front-end web interface for searching and interacting with the database.
Training the Disambiguation Model Use abbreviation instances found in the building of the dictionary as labeled training data. Extract lexical features and MeSH headings from the abstract to use as training attributes for each abbreviation’s long form. Use machine learning algorithm (such as Naïve Bayes classifier, Support Vector Machine, and Vector Space Model) to build classifier for predicting long forms of ambiguous abbreviations.
Progress Wrote a Java program built on LingPipe’s Medline tutorial code as well as Dr. Struble’s java implementation of the Schwartz and Hearst Algorithm to parse Medline abstracts, find abbreviation/long form pairs, and add them to the dictionary database. Used Condor to run this program in parallel on the entire Medline baseline. Found 1,497,702 unique abbreviation/long form pairs in 4,126,655 abstracts
Database Schema Dictionary Id short_form long_form Abstracts Id Pmid
Future Work Current Work Exploring statistics and details of the dataset such as the number of associated long forms for each abbreviation and their frequency of appearing in the abstracts. Building the web-interface for interacting with the database. Future Work Decide on features for model training and write tools for extracting these features and training the classifier. Find ambiguous abbreviations in Medline abstracts and predict their long forms using the classifier. Add these entries to the database. Create a pipeline for automatically doing this as additional Medline abstracts are released.
Questions?