TEAM 2 EMERGING INFORMATION TECHNOLOGIES I

TEAM 2 EMERGING INFORMATION TECHNOLOGIES I
TEAM MEMBERS: Lisa Ellrodt, Tonya Fields, Ion Freeman, Ashley Haigler and Suzanna Schmeelk (Pace University, USA) COURSE: DCS-860A PROFESSORS: Dr. Charles Tappert and Dr. Tilak Agerwala DATE: SEPTEMBER 29, 2018

Emerging Information Technology - Machine Learning
“Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. “ The SAS Institute Some Examples: Netflix : based on our viewing habits suggests new shows Pandora/Apple Music/Google Play Music: Bases on our listening habits suggests new music Waze: Optimizes driving experience based on known road data Amazon: offers recommendations based on our shopping habits

Machine Learning Video
IBM Watson Machine Learning: Example of Machine Learning: and Softbank Robotics, Inc

PAPER TOPIC: BACKGROUND
Fall 2017, characterize DPS dissertations produced in program Worked to hand classify the papers found standardization was challenging Resolved by using machine learning to cluster the dissertation abstracts Original approach: TF-IDF (Term Frequency - Inverse Document Frequency) Spring 2018, we tried 4-5 different algorithms in Weka and IBM BlueMix to see differences Fall 2018, we plan to clean the data described in the following pages

WORK TO DATE – Publishing Teamwork Research
Fall 2017 paper was published in Pace Research Day Spring 2018, after Pace Research Day we improved the thrust of the paper; it was accepted into: IEEE-FIE 2018 (48th Annual IEEE-Frontiers in Education) San Jose, CA Summer 2018 our paper included the works from IBM and Weka It is under review at IEEE-ICMLA 2018 (17th IEEE International Conference on Machine Learning and Applications)

DISTANT FUTURE PLANS DETAILS (Which Could be Dissertation)
First, refine the methodology on the dissertation database. Apply method to other databases to show that it generalizes Categorizes the variety research papers Example, apply to 5-10 years of articles from several journals

PLANS FOR FALL 2018 Redo the first study
Focus on data preparation which is usually the most tedious and important step in data analytics Data preparation steps: Tokenization – separate words from the text Case folding – reduce all letters to lowercase Lemmatization – find dictionary base forms Stemming (e.g., Porter’s stemming algorithm) - similar to lemmatization but dictionary not required Eliminate stop words - such as the, a, of, and Eliminate domain-specific stop words - not useful for the study – in this case “study”, “dissertation”, etc. Other Considerations Emphasis: Add to the abstract words the words from the dissertation titles doubling or tripling the dissertation title words, to increase the bag of words from each dissertation. Outliers: Removing might improve cluster distributions

Background Information on Data Preparation
Chapter 9 in Related slides at

Tokenization Breaks up a sequence of strings into pieces called tokens
Words Keywords Phrases Symbols Other Elements Tokens can be individual words, phrases or whole sentences Tokens can only be made up of all alpha characters, alphanumeric characters or numeric characters Tokens are separated by whitespace, punctuation marks or line breaks White space or punctuation marks may or may not be included depending on the need All characters within contiguous strings are part of the token In the process of tokenization, characters like punctuation marks are discarded The tokens become the input for process like parsing and text mining Tokenization is used in computer science Plays a large part in the process of analytics

Tokenization

Case folding – reduce all letters to lowercase

Lemmatization – find dictionary base forms
Determining the root stem of the words Notes about lemmatization: Lemmatize takes a part of speech parameter, "pos." If not supplied, the default is "noun." This means that an attempt will be made to find the closest noun, which can create trouble for you. Must watch out for this! Show video on below? Reference: nltk-tutorial/

Lemmatization

Stemming (e.g., Porter’s stemming algorithm) - similar to lemmatization but dictionary not required

Eliminate stop words, such as the, a, of,

FUTURE WORK FOR THIS SEMESTER
Eliminate domain-specific stop words not useful for the study – in this case “study”, “dissertation”, etc. FUTURE WORK FOR THIS SEMESTER

Good news! We’ve started our paper already! Team: Lisa Ellrodt
Action Items Good news! We’ve started our paper already! Team: Lisa Ellrodt Tonya Fields Ion Freeman Ashley Haigler Suzanna Schmeelk

Need a new venue for publication Research Questions:
Conclusion Need a new venue for publication Research Questions: Will we see differences in clustering based on data preparation? TF-IDF, IBM, Spark, PiTorch, TensorFlow Identify new clustering algorithms Hyperparameters

TEAM 2 EMERGING INFORMATION TECHNOLOGIES I

Similar presentations

Presentation on theme: "TEAM 2 EMERGING INFORMATION TECHNOLOGIES I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TEAM 2 EMERGING INFORMATION TECHNOLOGIES I

Similar presentations

Presentation on theme: "TEAM 2 EMERGING INFORMATION TECHNOLOGIES I"— Presentation transcript:

Similar presentations

About project

Feedback