Breaking News Exploring Israeli News Bias using Simple Textual Analysis Yuval Pinter Shuki Tausig Oren Persico.

Slides:



Advertisements
Similar presentations
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Advertisements

Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
® Microsoft Office 2010 Word Tutorial 4 Desktop Publishing and Mail Merge.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
Presentation by: Bill Lage, Marketing Technology Manager RE/MAX North Central July 15, :00pm – 3:00pm.
Spam Detection Ethan Grefe December 13, 2013.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Algorithmic Detection of Semantic Similarity WWW 2005.
Gefördert durch das Kompetenzzentrenprogramm © Know-Center 2012 Measuring the Quality of Web Content using Factual Information 16. April.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Using Classification Trees to Decide News Popularity
Ricardo Garcia Bertin Pablo Eric Gomez Period:
Reputation Management System
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.
Criminal Justice and Criminology Research Methods, Second Edition Kraska / Neuman © 2012 by Pearson Higher Education, Inc Upper Saddle River, New Jersey.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”
A Simple Approach for Author Profiling in MapReduce
Metagenomic Species Diversity.
Clickprints on the Web: Are there Signatures in Web Browsing Data?
A Straightforward Author Profiling Approach in MapReduce
Shannon Entropy Shannon worked at Bell Labs (part of AT&T)
Data Encoding Characters.
Stock Market Prediction
Chapter 9 Designing Databases
Exam #3 Review Zuyin (Alvin) Zheng.
Tree Growth Static Add Subtitle Here Your Text Here Your Text Here
Sampling techniques & sample size.
Revision (Part II) Ke Chen
Revision (Part II) Ke Chen
Quantitative vs. Qualitative Data
Text Analysis and Search Analytics
iSRD Spam Review Detection with Imbalanced Data Distributions
Review-Level Aspect-Based Sentiment Analysis Using an Ontology
Contents 1 Click to add title in here 2 Click to add title in here 3
Family History Technology Workshop
Project 1 General Approach.
Machine Learning in Practice Lecture 23
Text Mining & Natural Language Processing
Video Ad Mining for Predicting Revenue using Random Forest
Ensemble learning Reminder - Bagging of Trees Random Forest
Ppt宝藏提供支持.
Diagram ThemeGallery is a Design Digital Content & Contents mall developed by Guild Design Inc. Add your text in here Your text in here Add your text in.
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
CIC Identifying smart contract users by analyzing their coding style
Speech recognition, machine learning
PURE Learning Plan Richard Lee, James Chen,.
Text Analysis and Search Analytics
Predicting Loan Defaults
Introduction to Sentiment Analysis
From Unstructured Text to StructureD Data
Speech recognition, machine learning
System Model Acquisition from Requirements Text
Outlines Introduction & Objectives Methodology & Workflow
Add Your Company Slogan
Presentation transcript:

Breaking News Exploring Israeli News Bias using Simple Textual Analysis Yuval Pinter Shuki Tausig Oren Persico

Motivation / Hypotheses Media is biased Israeli media is super-biased Machine Learning detects bias Headlines could be enough "כותרות הן עיתונות בצורתה הצרופה ביותר" Simon Jenkins, 1992 Which is more significant – class bias or agenda bias? The idea: classify the news outlet using basic features Most of the “agenda bias” part will have to wait No prior work AFAIK; closest field – Authorship attribution

github.com/yuvalpinter/MediaAnalysis Data General news sites only Homepage headlines only Scraped in 15-minute intervals July 2014 – May 2015 Most experiments on February Data and extraction code is available: github.com/yuvalpinter/MediaAnalysis

Data Samples Nov 23, 15:00: Feb 15, 15:30:

Text Processing Consecutive appearance de-duping Tokenization (inc. lemmatization, affix deletion) using hspell (Har’el and Kenigsberg) Mostly good, sometimes not so much הפרלמנט הירדני עמד דקת דומייה לזכר המחבלים => פרלמנט ירדן עימד דקה דומייה זכר מחבל (NRG, 20/11/2014, 0:15) רעידת אדמה קטלנית כאלף נהרגו בנפאל: "שעות קריטיות"‎ => רעידה דימה קטלוניה אילף נהרג נפאל שעה קריטי (Mako, 25/4/2015, 19:30)

Features Form: character length, word count, word length (average/min/median/max), punctuation token count Lexicon: quantile word/lemma frequencies average/min/median/max Wordlists (Hermit Dave), Israblog (Linzen 2009) Morphology: affix letters Word features Probably the media cycle Features and extraction code are available http://www.the7eye.org.il/50916

Setup & Results 7 classes, 1785 headlines (all of February) Weka’s Random Forest Accuracy: 10 trees: 45.4% 50 trees: 49.5% Most significant features: Number of words Average word length Average position in word frequency table

Feature Example Character length Character count Add text descriptions

Binary classifier accuracy Pairwise Setup Binary classifier accuracy 72.3 88 92.1 73.4 78.5 76.5 84.5 91.8 75.8 78.1 77.9 72.9 86.7 79 79.4 88.9 74 78.2 69.4 64.9 58.6 (Higher = easier to classify = less similar) Class over agenda: Mako, Walla, NRG form a cluster – “online ethos” Ha’aretz and Ma’ariv relatively unique (newspaper-derived) Israel Hayom resembles tabloid competitor ynet most, more than agenda-sharing NRG

Future Work Better content (“agenda”) features Topic Models? Sentiment? Some weird phenomena to be ironed out Alternating headlines: dedup based on recent k Very similar headlines: merge or use edit distance Location-sensitive features Headlines starting with נתניהו: ~ balanced Headlines starting with רה"מ: 50% in Israel Hayom, another 25% in NRG More text: main leads / other headlines

github.com/yuvalpinter/MediaAnalysis Thanks! github.com/yuvalpinter/MediaAnalysis