Extraction of relevant Evidences of Factual Claim Validation Deep Learning Research & Application Center 31 October 2017 Claire Li
Claim Validation Verify claims with respect to the evidences measurement of relatedness measurement of reliability News database for reliability assessment Meta Search Engine
Claim Validation Approaches The problem statement Input: give a claim and set of articles of facts Output: the truth label of the claim (true, mostly true, mostly false/barely true, false ) Approaches: measurement of relatedness and reliability Semantic Similarity based, for the repetition and paraphrase claims Calculate the semantic similarity between the given claim and the already fact-checked ones, return the label in K-nearest neighbor Deep learning model , for novel Claim Validation For a claim with associating evidences, learn the support/deny features from the related evidences, and use the learned features to verify the new claim-evidence component Extract the related evidences over a claim based on semantic similarity Models Construct the claim-evidence training corpus by extending Liars evidence which include meta-data such as url, speaker etc Search knowledge base or google search engine, for world knowledge claims (e.g. population, GDP rate) Wolfram Alpha search API Wikipedia – calculation needed
Extract the relevant evidences given a claim based on semantic similarity Models Get true claims (true-1683, mostly true-1966) and false claims (mostly false/barely true-1657, false-1998) from Liars dataset Relevant evidence Retrieval: for each claim, use google search engine/meta search engine to get the top-20 HTML pages and extract the textual context using BoilerPipe for each document with the textual context, measure the semantic similarity between the sentences in the relevant documents and the given claim Semilar – an open source platform for similarities in the document/paragraph/sentences level English only Possible for Chinese by developing LSA/LDA models using Semilar API Word2vec based Word embedding for a semantic spaces of either similarities or relatedness [2] Achieved by learning from both a general corpus and a specialized thesaurus From scratch- Create triplets(subject-verb-object) of sentences in the document using Stanford parser. And then create a triplet of the claim and find similarity between these triplets, https://github.com/SilentFlame/Fact-Checker
Meta Search Engine Consults with several search engines and combines answers Dogpile Accept customizable list of search engines, directories and specialty search sites Winner of Best Meta Search Engine award from Search Engine Watch for 2003 SurfWax Using the “SiteSnaps” feature, you can preview any page in the results and see where your terms appear in the document. Allows results or documents to be saved for future use Vivisimo not only pull back matching responses from major search engines but also automatically organize the pages into categories Metacrawlers news searching is also offered
Semilar API based Similarity calculation based on Wordnet 3.0 Latent Semantic Analysis (LSA) model Latent Dirichlet Allocation (LDA) model N-gram overlap BLEU Meteor Pointwise Mutual Information (PMI) Syntactic dependency based methods optimized methods based on Quadratic Assignment Semantic Similarity & semantic relatedness granularities Word-to-word Sentence-to-sentence, demo LSA space based or develop from word-to-word model Paragraph-to-paragraph Document-to-document LDA model using whole Wikipedia articles and TASA corpus word-to-sentence paragraph-to-document Combination of the above Corpus TASA corpus and the Wikipedia
Sentence-to-sentence, demo "I belong to the AFL-CIO", said rick perry. "I did not belong to the AFL-CIO", said rick perry. – 0.78 "I am a member of the AFL-CIO", said rick perry. - 0.89 Rick Perry's AFL-CIO Membership. - 0.66
Measurement of evidence reliability Opensources: a curated resource for assessing online information sources in fake, satire, bias, conspiracy, rumor, state, hate, clickbait, unreliable, reliable Title/Domain Analysis Appearing of .wordpress, .com.co in title/domain might be a sign of problem E.g. 70news.wordpress.com,questionable; aceflashman.wordpress, staire; deadlyclear.wordpress.com, bias, fake About Us Analysis Check for Wikipedia page with citations Source Analysis Does the website mention/link to studies, sources, quotes Writing Style Analysis frequently use ALL CAPS If the language is free of emotion/encourage, e.g. use the phrases like “WOW!”, ”Please” etc Social Media Analysis Check for attention attraction and likes/click-throughs/shares encourage
Measurement of evidence reliability The site is professional (domain analysis) Professional sites include: .edu/.gov/.mil/.museum/.aero Site is published and copyrighted Look on the webpage to see if the website has a sponsor or affiliation Check for bias Are there advertisements on the page? Check out the date & last updated date If an author list presented Compile a list of content farm sites
checkSource() – scoring the evidence reliability Compile and rank the most reliable websites and the least reliable websites from news database , opensources etc PolitiFact Channel 4 Opensources: http://www.opensources.co/ mediabiasfactcheck.com: left bias, left-center bias, least biased, right-certer bias, right bias,pro-science, satire, pseudoscience, questionable Right-center bias: these media sources are slightly to moderately conservative in bias Left-center bias: these media sources have a slight to moderate liberal bias and compile the trust levels for news Checklist by Ideological Group
Page Rank algorithm & HITS If a web page i contains a hyperlink to the web page j, then j is relevant for the web page i If there are a lot of pages that link to (in/out) j , j is important If j has one backlinks, but it comes from a credible site k likes .gov/.edu/www.google.com/www.wikipedia.org, k asserts j is reliable Credible HITS (Hyperlink-Induced Topic Search) To rate a web page ni, calculate its hub-authority scores, credible hub- authority, and incredible hub-authority scores Update ni's Authority score to be equal to the sum of the Hub Scores of each node that points to ni Update ni's Hub Score to be equal to the sum of the Authority Scores of each node that ni points to a page’s real/fake authority (incoming hub scores) represented the page that was linked by many different credible/incredible hubs Enhanced by how many are credible sites scored by checkSource() And how many are incredible sites scored by checkSource() a page’s real/fake hub (outgoing authority scores) represented a page that pointed to many other credible/incredible pages
Real/FakeHITS - Score of the real/fake HITS Given a claim ci as the query, n1, n2,… nj are the top-j web pages returned by the search engine n1, n2,… nj are called the root set Construct the base set n1, n2,… nk, k >= j, iteratively from each ni in the root set by augmenting the root set with all the web pages that are linked from each ni (ni‘s outlink)and some of the pages that link to ni (backlink) Mozscape API to find ni‘s backlinks with the pageRank score Run checkSource() to get a scoreni of each ni , for i in [1,k] Return real/fakeHITS for each node ni in root set Extract evidence ei in the top-j web pages which are tired to the scores of real/fakeHITS
MozTrust Moz tools are free tools for link building and analysis, keyword research, webpage performance, local listing audits etc Measures a form of link equity that is tied to the “trustworthiness” of the linking website Link equity (also known as page rank) is the number of incoming links (backlinks) going to any given page on the target web site Determines MozTrust by calculating link “distance” between a given page and a “seed” site — a specific, known trust source (website) on the Internet The closer you are linked to a trusted website, the more trust you have score MozTrust on a logarithmic scale between 0 and 10 a higher score generally means more — and more trustworthy Domain-Level MozTrust is like regular MozTrust, but instead of being calculated between web pages, it is calculated between entire domains.
Related works Web credibility: Features exploration and credibility prediction, European conference on information retrieval, Springer (2013), pp. 557-568 Predicting webpage credibility using linguistic features, Proceedings of the 23rd international conference on world wide web, ACM (2014), pp. 1135-1140 Building trusted social media communities: A research roadmap for promoting credible content, Roles, trust, and reputation in social media knowledge markets, Springer (2015), pp. 35-43 Understanding and predicting Web content credibility using the Content Credibility Corpus, Information Processing and Management 53 (2017) 10