Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel
Motivation word2vec: An algorithm that associates closely-related words. Combining with the outcome of our project, this algorithm will help creating a medical text summarizer.
Project Goals ● Create a fast, scalable, highly accurate, machine-learning based classifier which predicts whether a given document is medical or not. ● Distributively run this classifier over a large amount of web content and extract medical documents.
Building a labeled dataset Problem: Manually collect medical and non- medical documents is almost impossible. Solution: Using Wikipedia’s archive files, we tagged wiki pages based on their category and title. Result: decent amount of medical and non-medical data.
Training Phase DocumentsBoilerpipeTokenizing TF- IDF Feature selection Data transformation flow:
Training Phase
Classifier Evalutaion - Measures
Evalutaion Phase Configuration Parameters: Classification algorithm Amount of features Stemming or not Each configuration was trained and then tested on a random 5% of the tagged dataset.
Results - Graph Average F-Measure Features Count
Distributed Programming Phase Use Apache Spark framework Iterate ClueWeb web archives (~14 TB) in a master-slave architecture Use the same training pipeline to convert web document to vector Tag each document and export medical- tagged documents’ IDs.
Questions?