Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.

Slides:



Advertisements
Similar presentations
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Advertisements

Co Training Presented by: Shankar B S DMML Lab
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Wincite Introduces Knowledge Notebooks A new approach to collecting, organizing and distributing internal and external information sources and analysis.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
The identification of interesting web sites Presented by Xiaoshu Cai.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
@delbrians Transfer Learning: Using the Data You Have, not the Data You Want. October, 2013 Brian d’Alessandro.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Spam Detection Ethan Grefe December 13, 2013.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Class Imbalance in Text Classification
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Big Data Processing of School Shooting Archives
A Simple Approach for Author Profiling in MapReduce
Big Data is a Big Deal!.
Clustering of Web pages
Web News Sentence Searching Using Linguistic Graph Similarity
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Source: Procedia Computer Science(2015)70:
Extraction, aggregation and classification at Web Scale
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Clustering tweets and webpages
CS110: Discussion about Spark
Text Categorization Assigning documents to a fixed set of categories
Enriching Taxonomies With Functional Domain Knowledge
Clinically Significant Information Extraction from Radiology Reports
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
CMPT 733, SPRING 2017 Jiannan Wang
Using Link Information to Enhance Web Page Classification
Wil Collins, Will Dickerson Client: Mohamed Magdy and CTRnet
Presentation transcript:

Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel

Motivation word2vec: An algorithm that associates closely-related words. Combining with the outcome of our project, this algorithm will help creating a medical text summarizer.

Project Goals ● Create a fast, scalable, highly accurate, machine-learning based classifier which predicts whether a given document is medical or not. ● Distributively run this classifier over a large amount of web content and extract medical documents.

Building a labeled dataset Problem: Manually collect medical and non- medical documents is almost impossible. Solution: Using Wikipedia’s archive files, we tagged wiki pages based on their category and title. Result: decent amount of medical and non-medical data.

Training Phase DocumentsBoilerpipeTokenizing TF- IDF Feature selection Data transformation flow:

Training Phase

Classifier Evalutaion - Measures

Evalutaion Phase Configuration Parameters: Classification algorithm Amount of features Stemming or not Each configuration was trained and then tested on a random 5% of the tagged dataset.

Results - Graph Average F-Measure Features Count

Distributed Programming Phase Use Apache Spark framework Iterate ClueWeb web archives (~14 TB) in a master-slave architecture Use the same training pipeline to convert web document to vector Tag each document and export medical- tagged documents’ IDs.

Questions?