SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:

Slides:



Advertisements
Similar presentations
Database Management Using Microsoft Access Xinhua Chen, Ph.D. Chinese Association of Professionals in Science and Technology March 23, 2003.
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Information Retrieval in Practice
Search Engines and Information Retrieval
Xyleme A Dynamic Warehouse for XML Data of the Web.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Chapter 5: Information Retrieval and Web Search
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
Page 1 ISMT E-120 Introduction to Microsoft Access & Relational Databases The Influence of Software and Hardware Technologies on Business Productivity.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CEDROM-SNi’s DITA- based Project From Analysis to Delivery By France Baril Documentation Architect.
Webpage Understanding: an Integrated Approach
Tag-based Social Interest Discovery
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Search Engines and Information Retrieval Chapter 1.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.
Start the slide show by clicking on the "Slide Show" option in the above menu and choose "View Show”. or – hit the F5 Key.
Database Management 9. course. Execution of queries.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.
1 Data Warehouses BUAD/American University Data Warehouses.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
MedKAT Medical Knowledge Analysis Tool December 2009.
NETWORK VISUALIZATION ABHISHEK KUMAR (2011CS50272)
CS5604: Final Presentation ProjOpenDSA: Log Support Victoria Suwardiman Anand Swaminathan Shiyi Wei Department of Computer Science, Virginia Tech December.
CSC 405: Web Application Engineering II Course Preliminaries Course Objectives Course Objectives Students’ Learning Outcomes Students’ Learning Outcomes.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
Search Engine Optimization
Search Engine Architecture
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Text Based Information Retrieval
Editing Your Website on SharePoint 2013
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Databases.
Lecture 12: Data Wrangling
CSE 635 Multimedia Information Retrieval
Boolean and Vector Space Retrieval Models
Building Topic/Trend Detection System based on Slow Intelligence
Information Retrieval and Web Design
Information system analysis and design
Presentation transcript:

SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor: Dr. Caroline Sporleder Text Mining for Historical Documents WS 2011/12

History of Saarland: Motivation Medieval History of Saarland Query: Which records talked about financial matters involving Peter in the year 600 to 1300

SaariStory Enabling keyword based search Answering complex queries Providing topic based search Showing temporal changes in the number of results for a time independent query

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor Workflow of SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Description of the data Two parts of the data Data block: Chronologically sorted records of events Index block: index for all these data blocks with alphabetically sorted keywords

Components of a data block timestamp data

Characteristics of the data blocks Number of words200,646 Number of lines15,021 Unique data blocks1,490 Pages612

Components of a index block keywords Dates connecting to index block

Characteristics of the index blocks Number of words86,485 Number of lines10,803 Unique index blocks934 Pages277

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Preprocessing of the Data block Basic strategy: Convert pdf to text using nitro pdf Parse text to separate the data blocks Problem: How to separate data blocks? New lines do not indicate starting of data blocks Distinguish between start of a page and start of a data block

Preprocessing of the Data block Solution: Regular expression Each data block starts with a date: yyyy-mm-dd, yyyy-mm, yyyy/mm, yyyy Use them to search “Regest” and “Druck” too

Preprocessing of the Data block Data structure to present the processed data

Preprocessing of the Index block Problem: How to separate index blocks? The only way to separate them is to use the fact: titles are in bold text The bold annotation is lost in pdf to text conversion

Preprocessing of the Index block Solution: pdf doc html The bold annotations are preserved Use regular expression to search for the tag Took care of broken lines for line breaks

Preprocessing of the index block Data structure to present the processed index data

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Tokenizer and POS tagger Need to process data from each data block We use openNLP for this purpose Open source Easy to use Pre-trained for German

Noun Extraction Straightforward to get from POS tagged text ~6000 nouns from 1490 data blocks Problem: 22 minutes for only 6000 nouns

Noun Extraction Solution: Run the POS tagger concurrently over data blocks Feed only the tokens which could be nouns we use the optimized version to have better quality of nouns MethodTime (seconds) Speed up Sequential Concurrent x Concurrent with optimization x

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Topic extraction We use Latent Dirichlet allocation (LDA) Proposed by Biel et al. in 2002 Assumption: Each document is a mixture of small number of topics Needed to set two parameters: Used trial and error to get meaningful topics

Topic extraction Using LDA, try 1: Use blindly Some very frequent no so meaningful words E.g. Saarbrücken They are in every topic and every data block Every topic is assigned to every data block

Topic extraction Using LDA, try 2: Remove some nouns We remove the 15 most frequent nouns Covered 24% of all nouns Problem: some of these nouns may be relevant to some documents

Topic extraction Using LDA, try 3: Only keep essential nouns Calculate tf-idf score for word w in a data block d tf w,d = #of occurrences of w in d idf w = tf-idf w,d = tf w,d x idf w If tf-idf w,d < 3.0 we remove w from d Use LDA on the modified data block

Topic extraction We use LDA on the modified data blocks after using tf-idf score LDA identifies 7 meaningful topics with 10 words each Bekanntmachung, Besitz, Finanzen, Vereinbarungen, Familie, Schulden, Recht

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Putting the data into database Type of queries we want to answer Enabling keyword based search Answering complex queries Providing topic based search Showing temporal changes in the number of results for a time independent query

Database schema SELECT d.*, GROUP_CONCAT(DISTINCT t.topic_name) AS topic_names FROM (`data` AS d LEFT OUTER JOIN `data_topics` AS dt ON d.id = dt.data_id) LEFT OUTER JOIN `topics` AS t ON dt.topic_id = t.id WHERE d.startDate >= ' ' AND d.endDate <= ' ' AND dt.topic_id = 1 OR dt.topic_id = 6 AND d.data_block LIKE '%Keyword%' GROUP BY d.id ORDER BY d.startDate

Database settings We use MySQL database First we used a web based sql provider Not always live Too slow in times of high load We set up a local database Database# Rows filledTime to fill the DB (seconds) Average time for query (seconds) Remote25,2744,80010 Local25,

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Graphical user interface

Evaluation Precision and recall is 100% for keyword based random queries We compared manual results with results from our system The topic detection works! The following text is labeled as “Familie” and “Finanzen”

Future work Improving our pre-processing step to include more corner cases Extract and save footnotes from the text Possibility to add more data blocks to the database Taking care of line breaks in data blocks

POS tagger Noun extractor DB of the text indexed with person, location, dates and topics Topic extractor SaariStory PDF Search result annotated with the topics QUERY Preprocessing text Graphical User Interface for querying Tokeni zer

Back up slides

SaariStory: Database design User name locationdateNAME_ID PeterSaarbruecken6011 PaulSaarbruecken6992 DATA_IDdateData block from text TOPIC_IDtopic 1CRIME 2PUNISHMENT NAME_IDDATA_ID DATA_IDTOPIC_ID 12 23

Design challenge-1: stripping data from the pdf Text = data-block + index of places/names How to parse data from data blocks in pdf Convert the data into text using an online tool Parse it to get data blocks using regular expression How to parse data from the index The bold ones are the keywords Convert the index into html and then parse

Design challenge-2: Extracting topics