Download presentation
Presentation is loading. Please wait.
Published byPatience Dorsey Modified over 6 years ago
1
Breaking News Exploring Israeli News Bias using Simple Textual Analysis Yuval Pinter Shuki Tausig Oren Persico
2
Motivation / Hypotheses
Media is biased Israeli media is super-biased Machine Learning detects bias Headlines could be enough "כותרות הן עיתונות בצורתה הצרופה ביותר" Simon Jenkins, 1992 Which is more significant – class bias or agenda bias? The idea: classify the news outlet using basic features Most of the “agenda bias” part will have to wait No prior work AFAIK; closest field – Authorship attribution
3
github.com/yuvalpinter/MediaAnalysis
Data General news sites only Homepage headlines only Scraped in 15-minute intervals July 2014 – May 2015 Most experiments on February Data and extraction code is available: github.com/yuvalpinter/MediaAnalysis
4
Data Samples Nov 23, 15:00: Feb 15, 15:30:
5
Text Processing Consecutive appearance de-duping
Tokenization (inc. lemmatization, affix deletion) using hspell (Har’el and Kenigsberg) Mostly good, sometimes not so much הפרלמנט הירדני עמד דקת דומייה לזכר המחבלים => פרלמנט ירדן עימד דקה דומייה זכר מחבל (NRG, 20/11/2014, 0:15) רעידת אדמה קטלנית כאלף נהרגו בנפאל: "שעות קריטיות" => רעידה דימה קטלוניה אילף נהרג נפאל שעה קריטי (Mako, 25/4/2015, 19:30)
6
Features Form: character length, word count, word length (average/min/median/max), punctuation token count Lexicon: quantile word/lemma frequencies average/min/median/max Wordlists (Hermit Dave), Israblog (Linzen 2009) Morphology: affix letters Word features Probably the media cycle Features and extraction code are available
7
Setup & Results 7 classes, 1785 headlines (all of February)
Weka’s Random Forest Accuracy: 10 trees: 45.4% 50 trees: 49.5% Most significant features: Number of words Average word length Average position in word frequency table
8
Feature Example Character length Character count Add text descriptions
9
Binary classifier accuracy
Pairwise Setup Binary classifier accuracy 72.3 88 92.1 73.4 78.5 76.5 84.5 91.8 75.8 78.1 77.9 72.9 86.7 79 79.4 88.9 74 78.2 69.4 64.9 58.6 (Higher = easier to classify = less similar) Class over agenda: Mako, Walla, NRG form a cluster – “online ethos” Ha’aretz and Ma’ariv relatively unique (newspaper-derived) Israel Hayom resembles tabloid competitor ynet most, more than agenda-sharing NRG
10
Future Work Better content (“agenda”) features
Topic Models? Sentiment? Some weird phenomena to be ironed out Alternating headlines: dedup based on recent k Very similar headlines: merge or use edit distance Location-sensitive features Headlines starting with נתניהו: ~ balanced Headlines starting with רה"מ: 50% in Israel Hayom, another 25% in NRG More text: main leads / other headlines
11
github.com/yuvalpinter/MediaAnalysis
Thanks! github.com/yuvalpinter/MediaAnalysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.