Detecting Cyberbullying using Latent Semantic Indexing(LSI)

Slides:



Advertisements
Similar presentations
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Advertisements

Labeling and Annotation
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Adapted from and Reproduced with permission from BESTEAMS Learning Styles In and Around Team Work: What’s Your Learning Pattern?
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Chapter 5: Information Retrieval and Web Search
Systems Engineer An engineer who specializes in the implementation of production systems This material is based upon work supported by the National Science.
Beyond datasets: Learning in a fully-labeled real world Thesis proposal Alexander Sorokin.
Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools Zak Fry.
Abstract A software development life cycle can be divided into requirements elicitation, specification, design, implementation, testing, and maintenance.
Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs ReporterHsan-Yu Lin.
Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.
Engineering Design By Brian Nettleton This material is based upon work supported by the National Science Foundation under Grant No Any opinions,
Chinese Firewall Update Leif Guillermo, Veronika Strnadova This material is based upon work supported by the National Science Foundation under Grant No.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Querying Structured Text in an XML Database By Xuemei Luo.
AUTOMATION IN MANUFACTURING 1 of 12 MADE IN FLORIDA - INDUSTRY TOURS.
Planning a search strategy.  A search strategy may be broadly defined as a conscious approach to decision making to solve a problem or achieve an objective.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Computer Aided Design By Brian Nettleton This material is based upon work supported by the National Science Foundation under Grant No Any opinions,
This material is based upon work supported by the National Science Foundation under Grant No. ANT Any opinions, findings, and conclusions or recommendations.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
IDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF ).
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Lesson Title: Social Implications of RFID Dale R. Thompson Computer Science and Computer Engineering Dept. University of Arkansas
Stephanie Long Science Live, Director Science Museum of Minnesota
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
QUALITY MEASURES: METROLOGY MADE IN FLORIDA - INDUSTRY TOURS 1 of 12.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Product Project Prototype Standards 2P, Q 9F, H This material is based upon work supported by the National Science Foundation under Grant No
BACKGROUND The Web is a global information resource Web users that seek information vary, culturally and ethnically Users of different cultural backgrounds.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Robust Requirements Tracing Via Internet Tech:Improving an IV&V Technique SAS 2004July 20, 2004 Alex Dekhtyar Jane Hayes Senthil Sundaram Ganapathy Chidambaram.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
Prototyping Brian Nettleton This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings.
 Wind Power TEAK – Traveling Engineering Activity Kits Partial support for the TEAK Project was provided by the National Science Foundation's Course,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
The Diary of Anne Frank The Great Debate.
Best pTree organization? level-1 gives te, tf (term level)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Based Information Retrieval
Technological Design VS Engineering Design
The Binary Number System
Discussion and Conclusion
Clustering Algorithms for Noun Phrase Coreference Resolution
Title of Poster Site Visit 2017 Introduction Results
People Who Did the Study Universities they are affiliated with
Title of session For Event Plus Presenters 12/5/2018.
Supporting Material for the Biodiversity Teaching Experiment
CSE 635 Multimedia Information Retrieval
A Modular Approach to Document Indexing and Semantic Search
Chapter 5: Information Retrieval and Web Search
Title of Poster Site Visit 2018 Introduction Results
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
Precision and Recall Reminder:
Precision and Recall.
Project Title: I. Research Overview and Outcome
Presentation transcript:

Detecting Cyberbullying using Latent Semantic Indexing(LSI) Jacob Bigelow, April Edwards, Lynne Edwards Ursinus College

Motivation for using LSI Latent Semantic Indexing is thought to bring out the latent semantics amongst a corpus of texts Breaks a term by document matrix down and reduces the sparseness adding values that represent relationships between words w=qAk

Our Dataset Data gathered from 18,554 users on Formspring.me 13,159 posts Amazon Mechanical Turk used for HIT (human intelligence tasks) Needed workers to label posts as cyberbullying due to computers inability to 848 identified bullying posts

Methods for Pruning Started with 40,000 unique terms Normalize emoticons Remove punctuation Remove one character words Then deal with spelling issues :), (:, :], :D, etc. smileyface :(, :[, etc. frownyface ;) winkyface <3 heart :p, :P, etc. tongueoutface

Spellchecking Check against list of commonly misspelled internet terms ex. Lol, idk, hahah, wht Using Language Tools, open source spell checking software wht what u you Okie okay

Results for precision Initial simple query: “you dirty, ugly, piece of shit. I hope you die” First match found: “Q: Hi asslee asshole :D A: Hi anonymous” Precision measured as number of true positives divided by n For top ranks our precision is 7-9 times the baseline

Recall vs. Precision Large decline after recall of .1 Leads us to believe our system pushes some cyberbullying posts to the top but still missing quite a few

Conclusion We’ve developed a system to detect cyberbullying in short posts littered with spelling errors, abbreviations, and odd punctuation. Do not need bullying terms database Uses LSI to find posts by highlighting relationships between terms Preliminary results look promising

Future research Adding different weights to terms More extensive pruning Testing on other domains Work with larger data set

Acknowledgements NSF acknowledgement This material is based upon work supported in part by the National Science Foundation under Grant Nos. 0916152 and 1421896. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. My advisor, Dr. April Edwards