Fidel Castro: Information Retrieval for 37 years of Socialist Oratory

Slides:



Advertisements
Similar presentations
Building a Semantic IntraWeb with Rhizomer and a Wiki Roberto Garcia and Rosa Gil GRIHO (Human Computer Interaction Research Group) Universitat de Lleida,
Advertisements

History Data Service1 Good Design for Historical source based Databases History Data Service Hamish James.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Multimedia Database Systems
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Retrieval in Practice
1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Overview of Search Engines
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
NoteSearch - Find what you’re looking for. Prototype Team B.
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
University of Malta CSA3080: Lecture 4 © Chris Staff 1 of 14 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Searching Tutorial By: Lola L. Introduction:  When you are using a topic, you might want to use “keyword topics.” Using this might help you find better.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Document Databases for Information Management Gregor Erbach FTW, Wien DFKI, Saarbrucken ETL, Tsukuba
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Overviews of the Library of Texas & ZLOT Project Dr. William E. Moen Principal Investigator.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Quiz Week 8 Topical. Topical Quiz (Section 2) What is the difference between Computer Vision and Computer Graphics What is the difference between Computer.
Using Bayesian Networks to Predict Plankton Production from Satellite Data By: Rob Curtis, Richard Fenn, Damon Oberholster Supervisors: Anet Potgieter,
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Digital Video Library - Jacky Ma.
Visual Information Retrieval
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Introduction Multimedia initial focus
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Monitoring and Evaluation Systems for NARS Organisations in Papua New Guinea Day 3. Session 8. Routine monitoring.
Dynamic Web Pages (Flash, JavaScript)
Multimedia Information Retrieval
Introduction to Search Engines
Data Integration for Relational Web
Disambiguation Algorithm for People Search on the Web
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Manuscript Transcription Assistant Initiative
Magnet & /facet Zheng Liang
Introduction to Information Retrieval
CS246: Information Retrieval
Search Engine Architecture
PURE Learning Plan Richard Lee, James Chen,.
Information Retrieval and Web Design
Kittiya Poonsilp, Rujijan Vichivanives, Attakorn Poonsilp
Introduction to Search Engines
Presentation transcript:

Fidel Castro: Information Retrieval for 37 years of Socialist Oratory Yevgeni Berzak, Carsten Ehler, Michal Richter & Todd Shore Project Seminar: Text Mining/NLP for Historical Documents Caroline Sporleder & Michael Schreiber Computational Linguistics & Phonetics Saarland University 19.06.2018 http://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/02.Trinidad_%2859%29.JPG/800px-02.Trinidad_%2859%29.JPG

Berzak, Ehrler, Richter & Shore Overview Introduction Goals Methods C onclusion http://lanic.utexas.edu/project/castro/image/fidel3.jpg 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 1) Introduction U of Texas DB: “includes speeches, interviews, etc by Fidel Castro from 1959 – 1996” Organised by date, annotated with basic metadata: date, location, document type (e.g. 'speech', 'interview') Keyword search http://www.nndb.com/people/118/000023049/fidel-castro-sm.jpg 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 1) Introduction Issues: Keyword-based search is unsophisticated, not designed specifically for historians/intelligent searching Relative lack of structure within documents makes standard methods of retrieving document info (e.g. categorisation, summarisation) difficult No explicit link between documents in corpus 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 2) Goals Index corpus by named entities Calculate similarity of documents to each other based on named entities and lexical similarity Implement metadata-driven document retrieval system Provide linking between results of retrieval system Represent similarity of documents based on one type of entity (e.g. persons or locations) or many Search for documents related to designated named-entity keyword(s), input by user Interactive visualisation of results 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods Download data, extract text Extract metadata from header Date, location, document type (e.g. “speech”, “interview”) Generate database 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods Stanford Named Entity Recognizer The visits to these places have made unforgettable impressions on us. Murmansk, for example, and Moscow, Volgograd, Uzbekistan, Tashkent, and Samarkand--in each of these places we have seen the efforts by the Soviet people, and how the Soviet workers, led by the Communist Party, are creating great things. We have seen regions separated by great distances... 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods Stanford Named Entity Recognizer The visits to these places have made unforgettable impressions on us. Murmansk, for example, and Moscow, Volgograd, Uzbekistan, Tashkent, and Samarkand--in each of these places we have seen the efforts by the Soviet people, and how the Soviet workers, led by the Communist Party, are creating great things. We have seen regions separated by great distances... 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods Stanford Named Entity Recognizer Fidel Castro Fidel CASTRO Dr. Fidel Castro Castro Fidel CASTRO Dr. Castro FidelCastro Dr Fidel Castro 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods Calculate: NE similarities with p-spectrum string kernels Calculate overall lexical similarity of documents Fidel Dr. Fidel Allende 1 0.8 Allendre NOTE: Mock data 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 3) Methods NE counts include not only NEs actually in document, but also equivalent NEs NE query mechanism includes all NEs similar to query, not only exact string matches For each pair of documents, similarity measurement is produced: either for named entities or lexical similarity 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 4) Methods Interface: Interactive GUI, allowing for visualisation of custom queries based on different combinations of similarity measures Graphical representation Node: Document Edge: Similarity measure 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 4) Methods 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore 4) Conclusion Representation of similarity information enables further information to be inferred than through metadata alone (e.g. “topics”, correlation to other variables – time, location) NE similarity info can be used to find related documents Guided exploration of information tailored to (historical) researcher's interest based on similarities is possible and effective 19/06/2018 Berzak, Ehrler, Richter & Shore

Berzak, Ehrler, Richter & Shore Works cited Castro Speech Database. Retrieved 1 Mar 2010, from University of Texas at Austin, LANIC website: http://lanic.utexas.edu/la/cb/cuba/castro.html. Stanford Named Entity Recognizer. Retrieved 4 Mar 2010, from Stanford University, The Stanford NLP (Natural Language Processing) Group website: http://nlp.stanford.edu/software/CRF-NER.shtml. 19/06/2018 Berzak, Ehrler, Richter & Shore http://upload.wikimedia.org/wikipedia/commons/5/55/Trinidad_(Kuba)_02.jp g

Berzak, Ehrler, Richter & Shore 1) Introduction “Castro Warns Against Complacency” “Speaking here tonight I am presented with one of the most difficult obligations in this long struggle which began on Nov. 30, 1956, in Santiago. The people are listening, the revolutionaries are listening, and the soldiers whose destinies are in other hands are listening also. this is a decisive moment in our history: The tyranny has been overthrown, but there is still much to be done. Let us not fool ourselves into believing that the future will be easy; perhaps everything will be more difficult in the future.” –Fidel Castro, 9 Jan, 1959 http://lanic.utexas.edu/project/castro/db/1959/19590109.html 19/06/2018 Berzak, Ehrler, Richter & Shore