Master Thesis Kick-Off: Interactive Analysis of a Corpus of General Terms and Conditions for Variability Modeling David Koller 23.11.2018.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Search Engines and Information Retrieval
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.
Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de A.
Chapter 5: Information Retrieval and Web Search
Yoonjung Choi.  The Knowledge Discovery in Databases (KDD) is concerned with the development of methods and techniques for making sense of data.  One.
CCSA 221 Programming in C CHAPTER 2 SOME FUNDAMENTALS 1 ALHANOUF ALAMR.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Search Engines and Information Retrieval Chapter 1.
IMSS005 Computer Science Seminar
Chapter 6: Information Retrieval and Web Search
Developing a Research Proposal. I. Writing a research proposal A short paragraph A formal, multipage report.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK ABSTRACT SUMMARY Rachel Carriere, Thu-Mai Christian,
Bike Sharing Demand Prediction PRESENTED BY:- AKSHAY PATIL 14MCB1031 RESEARCH FACILITATOR: PROF. BVANSS PRABHAKAR RAO M.TECH 1 ST YEAR RBL.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
Introduction to Machine Learning, its potential usage in network area,
Avoiding Redundancy in the Management of Technical Documentation and Models: Requirements Analysis and Prototypical Implementation for Enterprise Architecture.
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Master thesis: Automatic Extraction of Design Decision Relationships from a Task Management System Kick-Off Matthias Ruppel, 8th of May 2017, Munich.
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
<Student’s name>
CIS 376 Bruce R. Maxim UM-Dearborn
<Student’s name>
Preface to the special issue on context-aware recommender systems
Predicting Enterprise Application Performance Measures through Time-series Forecasting Daniel Elsner, 21st August 2017, Scientific advisor: Pouya Aleatrati.
By Dr. Abdulrahman H. Altalhi
FHIR BULK DATA API April 2018
Multimedia Information Retrieval
Identification of Cross-Blockchain Transactions: A Feasibility Study
Classifying enterprises by economic activity
Lecture 12: Data Wrangling
CIS 210 Systems Analysis and Development
TED Talks – A Predictive Analysis Using Classification Algorithms
Label Propagation for Tax Law Thesaurus Extension
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Using Smart Contracts for Digital Services: A Feasibility Study based on Service Level Agreements Stephan Zumkeller, 20th August 2018, Scientific advisors:
Rob Cotterill & Helen Jones DNV Consulting, UK
An API Morphology Analysis of Different Organizations in the API Economy Joan Disho, , Master Thesis Proposal.
Tutorial for WEKA Heejun Kim June 19, 2018.
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
CSE 635 Multimedia Information Retrieval
Text Mining & Natural Language Processing
Chapter 5: Information Retrieval and Web Search
Grid Based Data Integration with Automatic Wrapper Generation
Resource Recommendation for AAN
Text Mining & Natural Language Processing
Intro to Machine Learning
Temple Ready within an Hour of Collection Capture
How To Extend the Training Data
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Sentiment Analysis In Student Learning Experience By Obinna Obeleagu
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hierarchical, Perceptron-like Learning for OBIE
Master Thesis: Imputation of missing Product Information using Deep Learning A Use Case on Amazon Product Catalogue Aamna Najmi, , Kickoff Advisor:
Azure Machine Learning
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Master Thesis Kick-Off: Extraction of Legal Term Definitions from German Statutory Texts and Court Decisions Fabian Thomas,
Bachelor’s Thesis Kick-Off: Empirical Task Analysis of Data Protection Management Michael Vilser
Presentation transcript:

Master Thesis Kick-Off: Interactive Analysis of a Corpus of General Terms and Conditions for Variability Modeling David Koller 23.11.2018

Key Facts Title: Interactive Analysis of a Corpus of General Terms and Conditions for Variability Modeling Author: B.Sc. David Koller Advisor: M.Sc. Ingo Glaser Supervisor: Prof. Dr. Florian Matthes Start: October 15th, 2018 End: April 15th , 2019 181123 David Koller Master's Thesis KickOff © sebis

Outline Introduction Motivation Problem Statement Research Questions Interactive Machine Learning Introduction Data Collection and Preprocessing Text Classification Approach Workflow Timetable 181123 David Koller Master's Thesis KickOff © sebis

Motivation Problem: No standardized Terms & Conditions Question of legality of certain clauses Solution: Classifying clauses from Terms & Conditions Identifying universally needes clauses and dependencies Recognizing unlawful clauses Goal: Creating a Variability Model of Terms & Conditions containing dependencies, necessary and unlawful clauses 181123 David Koller Master's Thesis KickOff © sebis

Interactive Machine Learning Problem Statement T&C Online-Shop 1 Cluster 1 T&C Online-Shop 2 Cluster 2 Variability Model Cluster 3 T&C Online-Shop 3 …. T&C Online-Shop N Interactive Machine Learning Classified Clauses Interactive Process Variability Model Dataset 181123 David Koller Master's Thesis KickOff © sebis

Interactive Machine Learning Problem Statement T&C Online-Shop 1 Kang K.C., Lee H. (2013) Cluster 1 T&C Online-Shop 2 Cluster 2 Variability Model Cluster 3 T&C Online-Shop 3 …. T&C Online-Shop N Interactive Machine Learning Classified Clauses Interactive Process Variability Model Dataset 181123 David Koller Master's Thesis KickOff © sebis

Outline Introduction Motivation Problem Statement Research Questions Interactive Machine Learning Introduction Data Collection and Preprocessing Text Classification Approach Workflow Timetable 181123 David Koller Master's Thesis KickOff © sebis

Research Questions How can interactive Machine Learning support clustering clauses from general Terms & Conditions? What are suitable classes and how can we classify clauses by their semantic context? How does a prototypical implementation enabling the semantic text matching of clauses in Terms & Conditions look like? How should the User Interface for the interaction between software and human expert look like? What is Variability Modeling and how can it help in observing relations between / lawfulness of clauses? 181123 David Koller Master's Thesis KickOff © sebis

Analyzing current interactive Machine Learning (iML) solutions to identify possible approach What is interactive Machine Learning? “Machine learning with a human in the learning loop, observing the result of learning and providing input meant to improve the learning outcome.”(1) „Es handelt sich dabei um Algorithmen, die mit – teils menschlichen – Agenten interagieren und durch diese Interaktion ihr Lernverhalten optimieren können.“ (2) Why interactive ML? Interactive Machine Learning shows increase in both speed of learning and higher accuracy compared to other approaches. Solution with purely unsupervised approaches seems unfeasable 181123 David Koller Master's Thesis KickOff © sebis

Machine Learning Model Analyzing current interactive Machine Learning (iML) solutions to identify possible approach Example: Starting with unsupervised learning Human expert gives feedback on quality of predictions Model incorporates feedback for new predictions Improved Predictions after incorporating human feedback Predictions Unlabled and Preproccesed Data Machine Learning Model Feedback 181123 David Koller Master's Thesis KickOff © sebis

Building a usable Dataset: Data Collection Creating .csv File containing general information and URLs for Terms and Conditions of partnered Online Shops Crawling idealo.de for Online Shops Based on a crawler from Daniel Braun Extracting the <body> of HTML form collected URLs and creating .txt File for each Online Shop (unfortunately containing irrelevant Information) Extracting AGBs from Shops Manually removing irrelevant data, adding placeholders (e.g. for address of Online Shops), and general formatting Manual Cleanup of collected Text Important No labeled Dataset available 181123 David Koller Master's Thesis KickOff © sebis

Building a usable Dataset: Data Preprocessing Terms and Conditions Die Lieferzeit beträgt, sofern nicht beim Angebot anders angegeben, 3-5 Tage. Tokenization ['Die', 'Lieferzeit', 'beträgt', ',', 'sofern', 'nicht', 'beim', 'Angebot', 'anders', 'angegeben', ',', '3', '-', '5', 'Tage', '.'] Special Character Removal ['Die', 'Lieferzeit', 'beträgt', 'sofern', 'nicht', 'beim', 'Angebot', 'anders', 'angegeben', '3', '5', 'Tage'] Stopwords Removal ['Die', 'Lieferzeit', 'beträgt', 'sofern', 'Angebot', 'anders', 'angegeben', '3', '5', 'Tage'] Lemmatization [‚der', 'Lieferzeit', 'betragen', 'sofern', 'Angebot', 'anders', 'angeben', '3', '5', ‚tagen'] Word Embedding Algorithm Word2Vec, Doc2Vec etc. 181123 David Koller Master's Thesis KickOff © sebis

Identification of suitable classes in context of Terms & Conditions Two possible levels of abstraction for classes: Paragraph Level: Each class contains all clauses semantically belonging to one paragraph Much easier since paragraph title's contain valuable descriptors Few, but big Clusters (Total number of classes easier to predict: one for each type of Paragraph) Possible Classes: “§1 Grundlgende Bestimmung”, “§2 Zustandekommen des Vertrages” etc. Clauses Level: Each class only contains clauses that are semantically the same Very hard, since wording in Terms & Conditions differs greatly, which makes finer classification difficult Many, but more precise Clusters (Total number hard to predict: “how many types of clauses exist?”, Overlap) Possible classes: “Vertragsgegenstand”, “Vertragsabschluss” AGBs mit abweichenden Formulierungen und Sortierungen „Vertragsgegenstand“ rechts eigenständiger Punkt Source: https://website-tutor.com/agb-muster/ 181123 David Koller Master's Thesis KickOff © sebis

Identification of suitable classes in context of Terms & Conditions Vertragsgegenstand Vertragsabschluss AGBs mit abweichenden Formulierungen und Sortierungen „Vertragsgegenstand“ rechts eigenständiger Punkt Paragraph Clause Sources: https://website-tutor.com/agb-muster/ https://www.juraforum.de/muster-vorlagen/agb-online-shop 181123 David Koller Master's Thesis KickOff © sebis

Outline Introduction Motivation Problem Statement Research Questions Interactive Machine Learning Introduction Data Collection and Preprocessing Text Classification Approach Workflow Timetable 181123 David Koller Master's Thesis KickOff © sebis

Design & Implementation Workflow Literature Research Interactive Machine Learning NLP and Semantic Text Matching Variability Modeling Literature Research Evaluation Accuracy, Recall? Evaluation of clustering difficult without labels Problem: Human feedback to predictions is subjective. The quality of the results will depend on the quality of the expert interacting with the system. Evaluation Design & Implementation Design & Implementation Data collection and preparation Word Embeddings and Clustering Implementation in Python User interface for interactive Machine Learning 181123 David Koller Master's Thesis KickOff © sebis

Timetable October November December January February March April Literature Research Data Collection and Preparation Design & Implementation Evaluation Variability Modeling Writing Review Start Date Kickoff Submission 181123 David Koller Master's Thesis KickOff © sebis

B. Sc. David Koller 17132 matthes@in.tum.de

References Kang K.C., Lee H. (2013) Variability Modeling. In: Capilla R., Bosch J., Kang KC. (eds) Systems and Software Variability Management. Springer, Berlin, Heidelberg http://iml.media.mit.edu/ Gesellschaft für Informatik: https://gi.de/informatiklexikon/interactive-machine-learning-iml/ https://website-tutor.com/agb-muster/ https://www.juraforum.de/muster-vorlagen/agb-online-shop 181123 David Koller Master's Thesis KickOff © sebis