How To Extend the Training Data

Slides:



Advertisements
Similar presentations
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Advertisements

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Semantic Analysis of Movie Reviews for Rating Prediction
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Introduction to Data Mining Engineering Group in ACL.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
An Example of Course Project Face Identification.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
ACTIVE LEARNING USING CONFORMAL PREDICTORS: APPLICATION TO IMAGE CLASSIFICATION HypHyp Introduction HypHyp Conceptual overview HypHyp Experiments and results.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Explorations into Internet Distributed Computing Kunal Agrawal, Ang Huey Ting, Li Guoliang, and Kevin Chu.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Data Mining in Germany IIM Conference, Oct. 24, 2012 Gottfried Schwarz, DLR > Lecture > Author Document > Datewww.DLR.de Chart 1.
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Introduction to Machine Learning, its potential usage in network area,
Brief Intro to Machine Learning CS539
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Semi-Supervised Learning Using Label Mean
Queensland University of Technology
Bridging Domains Using World Wide Knowledge for Transfer Learning
Sentiment analysis algorithms and applications: A survey
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Calibration from Probabilistic Classification
Siemens Enables Digitalization: Data Analytics & Artificial Intelligence Dr. Mike Roshchin, CT RDA BAM.
Semi-supervised Machine Learning Gergana Lazarova
Eick: Introduction Machine Learning
Efficient Image Classification on Vertically Decomposed Data
MSC projects for for CMSC5720(term1), CMSC5721(term2)
Combining Labeled and Unlabeled Data with Co-Training
Supervised Time Series Pattern Discovery through Local Importance
Restricted Boltzmann Machines for Classification
Introductory Seminar on Research: Fall 2017
Machine Learning Dr. Mohamed Farouk.
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
Using Transductive SVMs for Object Classification in Images
convolutional neural networkS
Efficient Image Classification on Vertically Decomposed Data
Distributed Representation of Words, Sentences and Paragraphs
Master dissertation Proposals
Weakly Learning to Match Experts in Online Community
A survey of network anomaly detection techniques
convolutional neural networkS
Prepared by: Mahmoud Rafeek Al-Farra
Label Propagation for Tax Law Thesaurus Extension
iSRD Spam Review Detection with Imbalanced Data Distributions
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
prerequisite chain learning and the introduction of LectureBank
Prepared by: Mahmoud Rafeek Al-Farra
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Summarization for entity annotation Contextual summary
Concave Minimization for Support Vector Machine Classifiers
Zhedong Zheng, Liang Zheng and Yi Yang
Christoph F. Eick: A Gentle Introduction to Machine Learning
Semi-Supervised Time Series Classification
Mingzhen Mo and Irwin King
CS565: Intelligent Systems and Interfaces
Lecture 21: Machine Learning Overview AP Computer Science Principles
Introducing Apache Mahout
Improving Cross-lingual Entity Alignment via Optimal Transport
Lecture 9: Machine Learning Overview AP Computer Science Principles
Presentation transcript:

How To Extend the Training Data How To Extend the Training Data? Comparison of Two Methods Applied for the training-intensive algorithms Shabnam Sadegharmaki, Oct 2018

Outline Motivation Euler Hermes project at Allianz Legal text classification at Sebis Problem Statement Supervised classification challenges Learning with minimum supervision Solution Text Data Augmentation Semi Supervised Learning Graph based SSL Research approach Comparison of the two methods Across the two datasets Timeline Overview 171103 Matthes English Master Slide Deck © sebis

Euler Hermes Project An Early Warning System Financial Experts  Read News and Signals  Grade the companies Vast amount of coming News  Not all of them are critically important Phase 1: Filtering out the important news about a company to utilize human time and effort  Classification of News based on their criticalness  News are labeled by financial experts Phase n: An early warning system 171103 Matthes English Master Slide Deck © sebis

Sebis Project Legal Text Annotation/Classification Classification of legal sentences in norms (laws) and clauses (contracts) semantic and functionality A taxonomy constituting 9 different functional classes exist Different datasets ~600 Sentences from the German BGB with regard to the tenancy law ~600 Sentences from German AGB with regard to the sales of good law ~300 Sentences from German rental agreements ~200 Sentences from German purchasing agreements 171103 Matthes English Master Slide Deck © sebis

Outline Motivation Euler Hermes project at Allianz Legal text classification at Sebis Problem Statement Supervised classification challenges Learning with minimum supervision Solution Text Data Augmentation Semi Supervised Learning Graph based SSL Research approach Comparison of the two methods Across the two datasets Timeline Overview 171103 Matthes English Master Slide Deck © sebis

Supervised Classification Training Classification 𝐿 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝐿 :𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑑𝑎𝑡𝑎 𝑈 𝑢𝑛𝑠𝑒𝑒𝑛 = 𝑈 𝑢𝑛𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝐿 𝑇𝑒𝑠𝑡 𝐿 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑼 𝒖𝒏𝒔𝒆𝒆𝒏 Classifier 𝐿 𝑇𝑒𝑠𝑡 Classifier ML

How to extend the labeled data? The Challenge Labeled Data: The More, The Better However: Expensive and Scarce On the other hand, Vast amount of unlabeled data How to extend the labeled data? Machine Learning Techniques With Minimal Supervision

Outline Motivation Euler Hermes project at Allianz Legal text classification at Sebis Problem Statement Supervised classification challenges Learning with minimum supervision Solution Text Data Augmentation Semi Supervised Learning Graph based SSL Research approach Comparison of the two methods Across the two datasets Timeline Overview 171103 Matthes English Master Slide Deck © sebis

Two Approaches 1. Text Data Augmentation 2. Semi-Supervised Learning Still no use of unlabeled data Training ML 𝐿 𝑇𝑒𝑠𝑡 𝐿 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝐿 𝐀𝐮𝐠𝐦𝐞𝐧𝐭𝐞𝐝 Classifier 𝑼 𝒖𝒏𝒔𝒆𝒆𝒏 Classification 2. Semi-Supervised Learning Training ML 𝐿 𝑇𝑒𝑠𝑡 𝐿 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑼 𝑼𝒏𝒍𝒂𝒃𝒆𝒍𝒆𝒅

1. Text Data Augmentation Add other variants of a text to the train data with the same label Comes from Image Processing research area. But cannot be directly applied in the text area. Because the order of the words matters in this case. Applied on text data: first time by X. Sun & J. He

1. Text Data Augmentation hotel on-line evaluation dataset Chinese Sentiment Analysis Models used: SVM CNN(Convolutional Neural Network) LSTM(Long Short Term Memory) LSTM+CNN [1] X. Sun and J. He, “A novel approach to generate a large scale of supervised data for short text sentiment analysis,” Multimedia Tools and Applications, pp. 1–21, 2018.

1. Text Data Augmentation The Augmentation has increased the performance Also compared with GAN Results 

2. Semi-Supervised Learning Generative models Self training Co training Graph based Active learning

2. Semi-Supervised Learning Generative models Self training Co training Graph based Active learning Graph: Nodes are both labeled and unlabeled Edges reflect the similarity of examples. Classification: Label Propagation

2. Semi-Supervised Learning

2. Semi-Supervised Learning

Outline Motivation Euler Hermes project at Allianz Legal text classification at Sebis Problem Statement Supervised classification challenges Learning with minimum supervision Solution Text Data Augmentation Semi Supervised Learning Graph based SSL Research approach Comparison of the two methods Across the two datasets Timeline Overview 171103 Matthes English Master Slide Deck © sebis

Research Approach Datasets Financial news dataset (in German, provided by Allianz) Law and contract dataset (in German, provided by the chair) Methods Text augmentation Graph-based SSL Research possible solutions for the Text Data Augmentation Implementation of a supervised learning suitable for the dataset as a base of the comparison Implementation of the two text augmentation methods Analysis/Comparison of the results for both methods Analysis/Comparison of the results between datasets © sebis

Outline Motivation Euler Hermes project at Allianz Legal text classification at Sebis Problem Statement Supervised classification challenges Learning with minimum supervision Solution Text Data Augmentation Semi Supervised Learning Graph based SSL Research approach Comparison of the two methods Across the two datasets Timeline Overview © sebis

Timeline Guided Research = 300 h Research 80 hours end of Oct Implementation 120 hours 21th Dec Analysis of the results 60 hours 15thJan Document & Presentation 40hours Feb © sebis

Guided Research Overview Motivation: Amount of labeled training data is limited and costly to produce Idea: Extend training data by machine learning Scope: Compare two text data augmentation approaches on two datasets and investigate effects on model performance Planned duration: Oct 18 – Feb 1st Supervision: Jointly by AZ(Basil Komboz) and TUM(Ingo Glaser, Prof. Matthes) Datasets Financial news dataset (in German, provided by Allianz) Law and contract dataset(in German, provided by the chair) Methods Text augmentation Graph-based SSL

References [1] Sun, X., & He, J. (2018). A novel approach to generate a large scale of supervised data for short text sentiment analysis. Multimedia Tools and Applications, 1-21. [2] Ravi, S., & Diao, Q. (2016, May). Large scale distributed semi-supervised learning using streaming approximation. In Artificial Intelligence and Statistics (pp. 519-528). [3] Hussain, A., & Cambria, E. (2018). Semi-supervised learning for big social data analysis. Neurocomputing, 275, 1662-1673. [4] Shams, R. (2014). Semi-supervised Classification for Natural Language Processing. arXiv preprint arXiv:1409.7612. [5] Zhu, X. (2006). Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3), 4. [6] Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78-94. [7] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). ACM.

Thank You Question?