Download presentation
Presentation is loading. Please wait.
Published byMargaretMargaret Hancock Modified over 9 years ago
1
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel
2
Motivation word2vec: An algorithm that associates closely-related words. Combining with the outcome of our project, this algorithm will help creating a medical text summarizer.
3
Project Goals ● Create a fast, scalable, highly accurate, machine-learning based classifier which predicts whether a given document is medical or not. ● Distributively run this classifier over a large amount of web content and extract medical documents.
4
Building a labeled dataset Problem: Manually collect medical and non- medical documents is almost impossible. Solution: Using Wikipedia’s archive files, we tagged wiki pages based on their category and title. Result: decent amount of medical and non-medical data.
5
Training Phase DocumentsBoilerpipeTokenizing TF- IDF Feature selection Data transformation flow:
6
Training Phase
7
Classifier Evalutaion - Measures
8
Evalutaion Phase Configuration Parameters: Classification algorithm Amount of features Stemming or not Each configuration was trained and then tested on a random 5% of the tagged dataset.
9
Results - Graph Average F-Measure Features Count
10
Distributed Programming Phase Use Apache Spark framework Iterate ClueWeb web archives (~14 TB) in a master-slave architecture Use the same training pipeline to convert web document to vector Tag each document and export medical- tagged documents’ IDs.
11
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.