Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Clustering Basic Concepts and Algorithms
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Indian Statistical Institute Kolkata
Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee.
CS292 Computational Vision and Language Pattern Recognition and Classification.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Distributed Representations of Sentences and Documents
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Advanced Multimedia Text Classification Tamara Berg.
Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Efficient Model Selection for Support Vector Machines
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Bayesian Networks. Male brain wiring Female brain wiring.
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Visual Information Systems Recognition and Classification.
Spam Detection Ethan Grefe December 13, 2013.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Predicting Voice Elicited Emotions
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.
Machine Learning in CSC 196K
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Simple Approach for Author Profiling in MapReduce
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning & Deep Learning
Information Retrieval and Web Search
Unsupervised Learning and Autoencoders
Efficient Estimation of Word Representation in Vector Space
Machine Learning Week 1.
Classification Discriminant Analysis
Text Categorization Assigning documents to a fixed set of categories
Word embeddings based mapping
Word embeddings based mapping
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Word embeddings (continued)
Machine Learning – a Probabilistic Perspective
Introduction to Sentiment Analysis
Presentation transcript:

Final Presentation Tong Wang

1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification

Automatic Article Screening Review Question: Vitamin C for preventing and treating common cold? Data set: 17 References articles. 664 Not references articles.

Problem Definition Input : document d classes(c1 = Reference, c2 = not a reference) Output: predicted class of d Goal: find all articles belong to c1(Reference)

Build Features “Bag of Words” assumption: the order of words in a document can be neglected Preprocessing: tokenization, lemma, remove stop words, remove some part of speech. Need a step: Name Entity Recognizer(NER), it labels sequences of words which are the name of things. It is implemented by linear chain Conditional Random Field(CRF)

Build features Vector space model Extract vocabulary over all articles. Each document can be represented by a vector, value in each dimension is the word frequency in this article N = size of vocabulary w1, w2, w3, w4… wN d … 0 d … 0

Naïve Bayes

Logistic Regression

Discuss Define loss matrix, give high penalty for false negative. Another way is to use Cosine distance to compute similarity between articles. Wiki def: Use other nlp probability model, like LSA, LDA

Compression The basic idea is the data contains patterns that occur with a certain regularity will be compressed more efficiently It is generally inexpensive

d(x, y) = c(x y)/(c(x) + c(y)) x: A document c(x) : size of compressed file x xy: the file obtained by concatenating x and y d(x,y) – 1/2 >= 0 X X y y xy C(x) C(y) C(xy)

Compression Matrix a1 a2 a3 a4…. b1 d(b1, a1) d(b1, a2) b2 d(b2, a1) d(b2, a2) b3 b4 …

Experiments Two groups of drug review(ADHD) articles. Two groups of machine learning articles. Each group has 15 articles Intuitively d(ADHD, ADHD) < d(ADHD, machine learning) d(machine learning, machine learning) < d(ADHD, machine learning)

Future work More experiments Compare cosine(x, y) and d(x, y)