Unsupervised Machine Learning: Clustering Assignment

Slides:

Advertisements

Similar presentations

Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.

Advertisements

Chapter 5: Introduction to Information Retrieval

Albert Gatt Corpora and Statistical Methods Lecture 13.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Text Classification With Support Vector Machines

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Introduction to machine learning

Lab 8 – C# Programming Adding two numbers CSCI 6303 – Principles of I.T. Dr. Abraham Fall 2012.

Rapid Miner Session CIS 600 Analytical Data Mining,EECS, SU Three steps for use  Assign the dataset file first  Select functionality  Execute.

Text Classification, Active/Interactive learning.

Machine Learning Lecture 1. Course Information Text book “Introduction to Machine Learning” by Ethem Alpaydin, MIT Press. Reference book “Data Mining.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Project 1: Machine Learning Using Neural Networks Ver 1.1.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

Clustering Unsupervised learning introduction Machine Learning.

CSC 594 Topics in AI – Text Mining and Analytics

Enriched Knowledge Service Platform and Cross-Database Search September, 2015.

Text Annotation By: Harika kode Bala S Divakaruni.

Object-Oriented Application Development Using VB.NET 1 Chapter 2 The Visual Studio.NET Development Environment.

Unit 72 – Game Design Linked in with Unit 02: Communications Skills Introduction.

Reading comprehension exercise using PowerPoint. Question Slides 1.Start a new PowerPoint presentation with 4 slides on it. 2.On Slides 2, insert the.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Implementing Taxonomy Taxonomy Talk from the Publishing World Special Libraries Association Philadelphia, Pennsylvania 14 June 2016.

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

WP4 Models and Contents Quality Assessment

Big data classification using neural network

Name: Sushmita Laila Khan Affiliation: Georgia Southern University

Automated Experiments on Ad Privacy Settings

Machine Learning Clustering: K-means Supervised Learning

System for Semi-automatic ontology construction

After this course you will be able to:

Tutorial: Big Data Algorithms and Applications Under Hadoop

Future-oriented Benchmarking Through Social Media Analysis

Restricted Boltzmann Machines for Classification

cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14

Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,

Data Mining 101 with Scikit-Learn

Introduction to Programmng in Python

Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi

Tagging documents made easy, using machine learning

Basic machine learning background with Python scikit-learn

Waikato Environment for Knowledge Analysis

Learning to Rank Shubhra kanti karmaker (Santu)

Self organizing networks

Unsupervised Learning and Autoencoders

Machine Learning Week 1.

John Nicholas Owen Sarah Smith

Applying Key Phrase Extraction to aid Invalidity Search

CLSciSumm-2018 What to submit Task Framework Task 1A Task 1B

Fuzzy Clustering.

Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.

Project 1: Text Classification by Neural Networks

Classification of Matter Task Card Classification of Matter Task Card

Revision (Part II) Ke Chen

Supervised vs. unsupervised Learning

MBL 400 Week 1 Individual Assignment Sales Pitch Instructions: Choose a mobile platform and prepare a 3- to 5-slide Microsoft® PowerPoint® presentation.

MBL 400 Week 1 Individual Assignment Sales Pitch Instructions: Choose a mobile platform and prepare a 3- to 5-slide Microsoft® PowerPoint® presentation.

Introduction to Text Analysis

Text Categorization Berlin Chen 2003 Reference:

Junheng, Shengming, Yunsheng 11/09/2018

Assignment 1: Classification by K Nearest Neighbors (KNN) technique

HappyAImen WANG, Chenghui SHEN, Kairan WU, Shukun

Using Supervised Machine Learning to Classify Customer Input

Presentation transcript:

Unsupervised Machine Learning: Clustering Assignment

Assignment Investigate how an unsupervised machine learning k-means clustering algorithm performs in a text classification task. Change input parameters and then compare the k-means cluster results side-by-side with the human-annotated results.

Assignment: About the Data The tool takes a set of unlabeled business articles (published by Knowledge@Wharton) and groups the data using the k-means clustering algorithm based on term frequency. For comparison, the tool also shows the same set of articles with their human-annotated labels. Subject matter experts have classified these documents into five main topics: finance, management, marketing, public policy, and technology.

Link to Assignment Click the Link to Platform button or use the following link to directly access the assignment: https://wrds-classroom.wharton.upenn.edu/unsupervised-machine-learning-text-classification/

Assignment: Select Parameters Select the Number of Clusters for the k-means algorithm. Select the total Number of Documents. Click the Run Process button.

Assignment: Compare Results Compare the machine learning clusters with the human-annotated topics (shown in greyscale). Business articles classified according to main topic by human annotators. The same set of business articles grouped into clusters using unsupervised machine learning.

Assignment: Rerun the Process Note that even with the same parameters selected, the results change when you rerun the process. This is because the k-means algorithm randomly assigns the initial centroid each time, and this ultimately affects the resulting clusters.

Assignment: Review the Terms The results allow you to examine the top 10 weighted words in each cluster. The words can be interesting to analyze given that the clusters typically do not exactly match the human-annotated classification of the documents.

Assignment: Assessment Are you able to discern any patterns from the clusters that the k-means algorithm generates? (Hint: note geographic terms.) What have you learned from this side-by-side comparison between unsupervised machine learning clustering and human-annotated classification?

Background Information on Tool To create our unsupervised machine learning text classification visualization tool, we used Scikit-learn, a free software machine learning library for the Python programming language. A tf-idf weighting method was employed. We used k-means as the process for clustering, and k-means++ as the initialization method for centroid selection.