TEAM 2 EMERGING INFORMATION TECHNOLOGIES I

Slides:



Advertisements
Similar presentations
Chapter 6 Flowcharting.
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Information Retrieval in Practice
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Learn how to search for information the smart way Choose your own adventure!
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Slide 1 School of Electrical Engineering & Computer Science Basics of IEEE referencing system.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.
How to Write Literature Review ww.ePowerPoint.com
By: Wilmer Arellano FIU Summer Overview s Introduction to Proposal Style General Recommendations ▫Section Headings ▫References Title Page.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
09/07/20161 Searching Online Resources For more advanced guidance see Advanced Online Searching & Research slides. Also available in print from the library.
ABSTRACT This is the template for preparing posters for the Electrical Safety Workshop (ESW). It is intended to define the required format for printing.
The Abstract: A Key Component of a Proposal/Publication/Thesis 15th Annual HuQAS Scientific Conference Dr Margaret Muturi (KU) Kenya Institute of Curriculum.
Information Retrieval in Practice
Writing Technical Reports
Finding Scholarly Articles in a Library Database
Search Engine Architecture
EDUC 6540: Data Based Decision-Making for School Leaders
Clustering of Web pages
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
Natural Language Processing (NLP)
APA Style for Scientific Documents
CS 430: Information Discovery
Library Workshop for ENG1377 Exploring iSearch & Google Scholar
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Searching for and Accessing Information
Concept of a document Lesson 3.
Extracting Semantic Concept Relations
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
IL Step 3: Using Bibliographic Databases
Ellrodt L, Freeman IC, Haigler AJ, Larkin LE, Schmeelk s, Williams R
Accessing and searching for journals and wider material
Text Mining & Natural Language Processing
Text Analytics and Machine Learning Workshop
Overview of Emerging IT1 & IT2
Overview of Emerging IT1 & IT2
Chapter 5: Information Retrieval and Web Search
Content Analysis of Text
Text Mining & Natural Language Processing
Introduction to Text Analysis
Natural Language Processing (NLP)
Cell Biology and Genetics
TEAM 2 EMERGING INFORMATION TECHNOLOGIES I
From Unstructured Text to StructureD Data
Introduction to Search Engines
Unit II Game Playing.
Natural Language Processing (NLP)
Presentation transcript:

TEAM 2 EMERGING INFORMATION TECHNOLOGIES I TEAM MEMBERS: Lisa Ellrodt, Tonya Fields, Ion Freeman, Ashley Haigler and Suzanna Schmeelk (Pace University, USA) COURSE: DCS-860A PROFESSORS: Dr. Charles Tappert and Dr. Tilak Agerwala DATE: SEPTEMBER 29, 2018

Emerging Information Technology - Machine Learning “Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. “ The SAS Institute https://www.sas.com/en_us/insights/analytics/machine-learning.html Some Examples: Netflix : based on our viewing habits suggests new shows Pandora/Apple Music/Google Play Music: Bases on our listening habits suggests new music Waze: Optimizes driving experience based on known road data Amazon: offers recommendations based on our shopping habits

Machine Learning Video IBM Watson Machine Learning: https://youtu.be/5kMDIBpxi_k Example of Machine Learning: https://youtu.be/eZGSsLq28vY https://developer.ibm.com/clouddataservices/docs/ibm-watson-machine-learning/ and Softbank Robotics, Inc

PAPER TOPIC: BACKGROUND Fall 2017, characterize DPS dissertations produced in program Worked to hand classify the papers found standardization was challenging Resolved by using machine learning to cluster the dissertation abstracts Original approach: TF-IDF (Term Frequency - Inverse Document Frequency) Spring 2018, we tried 4-5 different algorithms in Weka and IBM BlueMix to see differences Fall 2018, we plan to clean the data described in the following pages

WORK TO DATE – Publishing Teamwork Research Fall 2017 paper was published in Pace Research Day Spring 2018, after Pace Research Day we improved the thrust of the paper; it was accepted into: IEEE-FIE 2018 (48th Annual IEEE-Frontiers in Education) San Jose, CA Summer 2018 our paper included the works from IBM and Weka It is under review at IEEE-ICMLA 2018 (17th IEEE International Conference on Machine Learning and Applications)

DISTANT FUTURE PLANS DETAILS (Which Could be Dissertation) First, refine the methodology on the dissertation database. Apply method to other databases to show that it generalizes Categorizes the variety research papers Example, apply to 5-10 years of articles from several journals

PLANS FOR FALL 2018 Redo the first study Focus on data preparation which is usually the most tedious and important step in data analytics Data preparation steps: Tokenization – separate words from the text Case folding – reduce all letters to lowercase Lemmatization – find dictionary base forms Stemming (e.g., Porter’s stemming algorithm) - similar to lemmatization but dictionary not required Eliminate stop words - such as the, a, of, and Eliminate domain-specific stop words - not useful for the study – in this case “study”, “dissertation”, etc. Other Considerations Emphasis: Add to the abstract words the words from the dissertation titles doubling or tripling the dissertation title words, to increase the bag of words from each dissertation. Outliers: Removing might improve cluster distributions

Background Information on Data Preparation Chapter 9 in http://csis.pace.edu/~ctappert/cs816-17fall/books/2015DataScience&BigDataAnalytics.pdf Related slides at http://csis.pace.edu/~ctappert/cs816-17fall/slides/datascience09.pptx

Tokenization Breaks up a sequence of strings into pieces called tokens Words Keywords Phrases Symbols Other Elements Tokens can be individual words, phrases or whole sentences Tokens can only be made up of all alpha characters, alphanumeric characters or numeric characters Tokens are separated by whitespace, punctuation marks or line breaks White space or punctuation marks may or may not be included depending on the need All characters within contiguous strings are part of the token In the process of tokenization, characters like punctuation marks are discarded The tokens become the input for process like parsing and text mining Tokenization is used in computer science Plays a large part in the process of analytics

Tokenization

Case folding – reduce all letters to lowercase

Lemmatization – find dictionary base forms Determining the root stem of the words Notes about lemmatization: Lemmatize takes a part of speech parameter, "pos." If not supplied, the default is "noun." This means that an attempt will be made to find the closest noun, which can create trouble for you. Must watch out for this! Show video on below? Reference: https://pythonprogramming.net/lemmatizing- nltk-tutorial/

Lemmatization

Stemming (e.g., Porter’s stemming algorithm) - similar to lemmatization but dictionary not required

Eliminate stop words, such as the, a, of,

FUTURE WORK FOR THIS SEMESTER Eliminate domain-specific stop words not useful for the study – in this case “study”, “dissertation”, etc. FUTURE WORK FOR THIS SEMESTER

Good news! We’ve started our paper already! Team: Lisa Ellrodt Action Items Good news! We’ve started our paper already! Team: Lisa Ellrodt Tonya Fields Ion Freeman Ashley Haigler Suzanna Schmeelk

Need a new venue for publication Research Questions: Conclusion Need a new venue for publication Research Questions: Will we see differences in clustering based on data preparation? TF-IDF, IBM, Spark, PiTorch, TensorFlow Identify new clustering algorithms Hyperparameters