Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Day Modified over 6 years ago
1
Fidel Castro: Information Retrieval for 37 years of Socialist Oratory
Yevgeni Berzak, Carsten Ehler, Michal Richter & Todd Shore Project Seminar: Text Mining/NLP for Historical Documents Caroline Sporleder & Michael Schreiber Computational Linguistics & Phonetics Saarland University
2
Berzak, Ehrler, Richter & Shore
Overview Introduction Goals Methods C onclusion 19/06/2018 Berzak, Ehrler, Richter & Shore
3
Berzak, Ehrler, Richter & Shore
1) Introduction U of Texas DB: “includes speeches, interviews, etc by Fidel Castro from 1959 – 1996” Organised by date, annotated with basic metadata: date, location, document type (e.g. 'speech', 'interview') Keyword search 19/06/2018 Berzak, Ehrler, Richter & Shore
4
Berzak, Ehrler, Richter & Shore
1) Introduction Issues: Keyword-based search is unsophisticated, not designed specifically for historians/intelligent searching Relative lack of structure within documents makes standard methods of retrieving document info (e.g. categorisation, summarisation) difficult No explicit link between documents in corpus 19/06/2018 Berzak, Ehrler, Richter & Shore
5
Berzak, Ehrler, Richter & Shore
2) Goals Index corpus by named entities Calculate similarity of documents to each other based on named entities and lexical similarity Implement metadata-driven document retrieval system Provide linking between results of retrieval system Represent similarity of documents based on one type of entity (e.g. persons or locations) or many Search for documents related to designated named-entity keyword(s), input by user Interactive visualisation of results 19/06/2018 Berzak, Ehrler, Richter & Shore
6
Berzak, Ehrler, Richter & Shore
3) Methods Download data, extract text Extract metadata from header Date, location, document type (e.g. “speech”, “interview”) Generate database 19/06/2018 Berzak, Ehrler, Richter & Shore
7
Berzak, Ehrler, Richter & Shore
3) Methods Stanford Named Entity Recognizer The visits to these places have made unforgettable impressions on us. Murmansk, for example, and Moscow, Volgograd, Uzbekistan, Tashkent, and Samarkand--in each of these places we have seen the efforts by the Soviet people, and how the Soviet workers, led by the Communist Party, are creating great things. We have seen regions separated by great distances... 19/06/2018 Berzak, Ehrler, Richter & Shore
8
Berzak, Ehrler, Richter & Shore
3) Methods Stanford Named Entity Recognizer The visits to these places have made unforgettable impressions on us. Murmansk, for example, and Moscow, Volgograd, Uzbekistan, Tashkent, and Samarkand--in each of these places we have seen the efforts by the Soviet people, and how the Soviet workers, led by the Communist Party, are creating great things. We have seen regions separated by great distances... 19/06/2018 Berzak, Ehrler, Richter & Shore
9
Berzak, Ehrler, Richter & Shore
3) Methods Stanford Named Entity Recognizer Fidel Castro Fidel CASTRO Dr. Fidel Castro Castro Fidel CASTRO Dr. Castro FidelCastro Dr Fidel Castro 19/06/2018 Berzak, Ehrler, Richter & Shore
10
Berzak, Ehrler, Richter & Shore
3) Methods Calculate: NE similarities with p-spectrum string kernels Calculate overall lexical similarity of documents Fidel Dr. Fidel Allende 1 0.8 Allendre NOTE: Mock data 19/06/2018 Berzak, Ehrler, Richter & Shore
11
Berzak, Ehrler, Richter & Shore
3) Methods NE counts include not only NEs actually in document, but also equivalent NEs NE query mechanism includes all NEs similar to query, not only exact string matches For each pair of documents, similarity measurement is produced: either for named entities or lexical similarity 19/06/2018 Berzak, Ehrler, Richter & Shore
12
Berzak, Ehrler, Richter & Shore
4) Methods Interface: Interactive GUI, allowing for visualisation of custom queries based on different combinations of similarity measures Graphical representation Node: Document Edge: Similarity measure 19/06/2018 Berzak, Ehrler, Richter & Shore
13
Berzak, Ehrler, Richter & Shore
4) Methods 19/06/2018 Berzak, Ehrler, Richter & Shore
14
Berzak, Ehrler, Richter & Shore
4) Conclusion Representation of similarity information enables further information to be inferred than through metadata alone (e.g. “topics”, correlation to other variables – time, location) NE similarity info can be used to find related documents Guided exploration of information tailored to (historical) researcher's interest based on similarities is possible and effective 19/06/2018 Berzak, Ehrler, Richter & Shore
15
Berzak, Ehrler, Richter & Shore
Works cited Castro Speech Database. Retrieved 1 Mar 2010, from University of Texas at Austin, LANIC website: Stanford Named Entity Recognizer. Retrieved 4 Mar 2010, from Stanford University, The Stanford NLP (Natural Language Processing) Group website: 19/06/2018 Berzak, Ehrler, Richter & Shore g
16
Berzak, Ehrler, Richter & Shore
1) Introduction “Castro Warns Against Complacency” “Speaking here tonight I am presented with one of the most difficult obligations in this long struggle which began on Nov. 30, 1956, in Santiago. The people are listening, the revolutionaries are listening, and the soldiers whose destinies are in other hands are listening also. this is a decisive moment in our history: The tyranny has been overthrown, but there is still much to be done. Let us not fool ourselves into believing that the future will be easy; perhaps everything will be more difficult in the future.” –Fidel Castro, 9 Jan, 1959 19/06/2018 Berzak, Ehrler, Richter & Shore
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.