Simply Visualizing Politics: voxPolitico Adrian Besimi, Visar Shehu Contemporary Sciences and Technologies South East European University www.seeu.edu.mk.

Slides:



Advertisements
Similar presentations
Improved TF-IDF Ranker
Advertisements

Use of open data by academia and start-ups Adrian Besimi 11/26/2014 Adrian Besimi, Use of Open Data by academia and start-ups 1.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Doncho Minkov Telerik School Academy schoolacademy.telerik.com Technical Trainer
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Aki Hecht Seminar in Databases (236826) January 2009
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Views Chapter 12. What Are Views? A virtual table that comprises the fields of one or more tables in the database It is a virtual table since it does.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Innovations in Justice Information Sharing Strategies and Best Practices February 2007 Melissa R. Johnson, CCA Communications Director, International Association.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
2008 History and Social Science Standards of Learning: Using Student Engagement To Support Active Learning and Assessment January 2013.
Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Custom driven scientific information extraction from digital libraries using integrated text mining services Betim Çiço, Adrian Besimi, Visar Shehu 14th.
Week 1: Find resources, Summarize, paraphrase, thesis, and outline Week 2: Research and Write, incorporate evidence and transitions (1/2 done) Week 3:
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
WJEC Applied ICT Spreadsheet Skills 1.Introduction to Financial Modelling Definition A model is a program which has been developed to copy the way.
Visualization, analysis and mining of geo- spatial information in educational data sets using web-based tools Aniruddha Desai |Winter 2013 Presentation.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Chapter 1 Introduction to Data Mining
Innovations in Justice Information Sharing Strategies and Best Practices November 30, 2006 Lisa M. Palmieri, CCA-Supervisory Intelligence Analyst President,
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Audio and Video Chris McConnell Department of Radio-TV-Film November 30, 2006.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
The ISI Web of Knowledge nce/training/wok/#tab3.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Microsoft Access is a database program to manage sort retrieve group filter for certain records.
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Data analytics and mash-up Real time analytics of employment data Team Shadowfax 1/25/2016 CMPE Class Project 0.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.
Description and exemplification use of a Data Dictionary. A data dictionary is a catalogue of all data items in a system. The data dictionary stores details.
Using the Internet for academic purposes Your Logo Birkbeck Library.
Reports. Reports display information retrieved from a database in an attractive printed format. Reports can be created directly from tables, but More.
Frompo is a Next Generation Curated Search Engine. Frompo has a community of users who come together and curate search results to help improve.
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Data Mining and Text Mining. The Standard Data Mining process.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Open data in the Parliament
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
DATA MINING © Prentice Hall.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Scholastica C Ukwoma, Ph.D
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Title of poster... M. Author1, D. Author2, M. Author3
Title of ePoster... M. Author1, D. Author2, M. Author3
Presentation transcript:

Simply Visualizing Politics: voxPolitico Adrian Besimi, Visar Shehu Contemporary Sciences and Technologies South East European University |

Outline Introduction The Problem vs. The idea and the student involvement Web crawling for data sources / documents The problems and the Data structure Data Mining algorithms The future Conclusion

Introduction A project sponsored by Global Integrity Fund, Washington D.C. Selected among 370 project ideas $ in seed money to test our idea With the main aim to create a future platform (open source)

“We are a society that is data rich, but information poor.” -Robert H. Waterman

The problem vs. The idea Unstructured Data: information that does not have a pre-defined data model The Macedonian Parliament has transcripts of all sessions as e-documents (PDF, DOC) Huge amount of documents with lots of anomalies The main idea: Organizing all speeches of the Parliament since 1991 the first session, until nowadays into a new data structure. Apply Data Mining algorithms to retrieve different conclusions and different results Visualize them!

Student Involvement How to involve undergraduate students in “real world” projects? Not all students are happy to work on real projects. Challenging students sometimes works!

Representatives: Crawling Assembly Party Representative

Representatives: Mapping Assembly Party Representative

The sources Started with analyzing the structure of a single document. Different document types (PDF, new PDF, Word) Since 2006 the documents are in word format. The structure was not well defined. Manually analyzing the documents and identify the elements that would make the process of organization easier

Sample document FORMATS: PDF WORD 95 WORD 97 WORD 2003 WORD 2007

Sample document: Pre-processing Date Representative Speech Document

Sample document: Mapping Date Representative Speech Document

The Database

The problems Different typo’s, problems with delimiter: Some documents contain “FirstName LastName” format, some “LastName FirstName” Some documents contain “:” to determine new speech, some do not Sometimes there are no spaces between phrases Sometimes a phrase ends with brackets “)” Governmental representatives do have speeches which we record, but ignore in the display of the results

The structured data ASSEMBLY REPRESENTATIVE – SPEECH SPEECH PHRASES – WORDS – FREQUENT WORDS ASSEMBLY DOCUMENTS – SPEECHES SPEECH PHRASES – WORDS

Data Mining Two main algorithms: Cosine Similarity: to calculating the similarity between two vectors -> generated a similarity matrix for each two representatives based on their speeches Apriori algorithm for generating frequent item sets of words for particular representative

Data Mining: Cosine Similarity -A large similarity matrix, around records -95% is pre-calculated, before the launch -No need to modify, except when new documents are submitted -Example: On the fly, find the MP similarity with other MPs, current or from the past

Data Mining: Frequent Item sets generation using Apriori Algorithm -Find the most common words (eliminating the outliers) -Create combinations of the words that meet a certain minimum support -Finally, create the vector of frequent words used by MPs. -Example: Find the most common phrase used by this or another MP

Visualization Visualizing the results was the overall goal of the project! Using WebGL, D3.JS and Google Charts Visualize general statistics: Number of speeches, most used words, comparing two MPs, displaying the similarity between representatives, find the most common phrases used by representatives, time series of data (words used), etc.

Next steps Include translation on-the-fly using Google or BING Include voting by MPs for different Laws Try to link sessions of the parliament with particular events?

Conclusion Challenging students may result in good project Even documents without a predefined structure can be used for data mining if one can know how to “read” them Structured data gathered from text can be used for various statistical calculations If these data are put on the web and visualized appropriately, they can be easily used by any one without the need to read all the hundred of thousand of speeches.

Thank You! Any questions? Contacts: |