Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006.

Slides:



Advertisements
Similar presentations
Understanding American Citizenship
Advertisements

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
WEB MINING. Why IR ? Research & Fun
Processing of large document collections Part 8 (Information extraction) Helena Ahonen-Myka Spring 2005.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Information Retrieval
Citances and What should our UI look like? Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Processing of large document collections Part 1 (Introduction, text representation, text categorization) Helena Ahonen-Myka Spring 2005.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information extraction from text Spring 2003, Part 1 Helena Ahonen-Myka.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
CSC1401: Introductory Programming Steve Cooper
AELDP ACADEMIC READING. Questions Do you have any questions about academic reading?
Statistics are ubiquitous “Statistics are generated today about nearly every activity on the planet. Never before have we had so much statistical information.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
Subject (Exam) Review WSTA 2015 Trevor Cohn. Exam Structure Worth 50 marks Parts: – A: short answer [14] – B: method questions [18] – C: algorithm questions.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Text Based Information Retrieval Text Based Information Retrieval H02C8A H02C8B Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
STATISTICS Topic A: Understanding Distributions. What is Statistics all about? Statistics is about using data to answer questions. In this module, the.
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2005.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Near Real-Time Verification At The Forecast Systems Laboratory: An Operational Perspective Michael P. Kay (CIRES/FSL/NOAA) Jennifer L. Mahoney (FSL/NOAA)
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Text Based Information Retrieval H02C8A Marie-Francine Moens Karl Gyllstrom Katholieke Universiteit Leuven Study points: 4 Language: English Periodicity:
HOW CAN WRITING BE INCLUDED EFFECTIVELY IN THE SOCIAL STUDIES CURRICULUM?
G042 - Lecture 09 Commencing Task A Mr C Johnston ICT Teacher
Web Advanced Learning Technologies WebALT EDC Mika Seppälä.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
LB160 (Professional Communication Skills For Business Studies)
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Slides Template for Module 3 Contextual details needed to make data meaningful to others CC BY-NC.
Academic year: 2017/18 – winter semester
Lecture 1: Introduction and the Boolean Model Information Retrieval
Proposal for Term Project
Text Based Information Retrieval
Welcome to [Course Title]
Based on Menu Information
Robust Semantics, Information Extraction, and Information Retrieval
Introduction to Information Retrieval
Statistical n-gram David ling.
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
Information Retrieval and Web Design
Presentation transcript:

Processing of large document collections Part 1 (Introduction) Helena Ahonen-Myka Spring 2006

2 1. Introduction course organization introduction to the topic –applications –methods learning goals schedule

3 Organization of the course lectures (Helena Ahonen-Myka) –Tue 12-14, Thu B222 – (no lectures and 18.4.) exercise sessions (Miro Lehtonen) –Wed B222 – (no exercises 19.4.) exam: Thu 4.5. at points: exam 50 pts, exercises 10 pts –required: ~30 pts (= 1)

4 Course material slides on the course web page also other material available on the page –handouts used in the class (sample documents etc.) –original articles

5 Large document collections What is a document? –“a document records a message from people to people” (Wilkinson et al., Document Computing, 1998) each document has content, structure, and metadata (context) –in this course, we concentrate on content –particularly: textual content

6 Large document collections large? –some person may have written a document, but it is not possible later to process the document manually -> automatic processing is needed –large w.r.t to the capacity of a device (e.g. a mobile phone) collection? –documents somehow similar -> automatic processing is possible

7 Applications text categorization text summarization information extraction question answering text compression text indexing and retrieval machine translation …

8 Text categorization given a predefined set of categories and a set of documents label each document with one or more categories

9 Text summarization ”Process of distilling the most important information from a source to produce an abridged version for a particular user or task” (Mani & Maybury, Advances in Automatic Text Summarization, 1999)

10 Example A Spanish priest was charged here today with attempting to murder the Pope. Juan Fernandez Krohn, aged 32, was arrested after a man armed with a bayonet approached the Pope while he was saying prayers at Fatima on Wednesday night. According to the police, Fernandez told the investigators today that he trained for the past six months for the assault. He was alleged to have claimed the Pope ’looked furious’ on hearing the priest’s criticism of his handling of the church’s affairs. If found quilty, the Spaniard faces a prison sentence of years.

11 Example summary could be, e.g. –“A Spanish priest is charged after an unsuccessful murder attempt on the Pope” or a set of phrases: –a Spanish priest was charged –attempting to murder the Pope –he trained for the assault –Pope furious on hearing priest´s criticisms

12 Information extraction ”Information extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text” (Grishman, Information Extraction: Techniques and Challenges, 1997)

13 Example: terrorist events 19 March - A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualties have been reported. According to unofficial sources, the bomb - allegedly detonated by urban guerrilla commandos - blew up a power tower in the northwestern part of San Salvador at 0650 (1250 GMT).

14 Example: terrorist events Incident typebombing DateMarch 19 LocationEl Salvador: San Salvador (city) Perpetratorurban guerilla commandos Physical targetpower tower Human target- Effect on physical targetdestroyed Effect on human targetno injury or death Instrumentbomb

15 Example: terrorist events a document collection is given for each document, decide if the document is about terrorist event for each terrorist event, determine –type of attack –date –location, etc. = fill in a template (~database record)

16 Question answering systems the user asks a question in a natural language the question answering system finds answers from a document collection, e.g. from a collection of newspaper stories

17 Example question: –When did Chuck Yeager break the sonic barrier? a text fragment in the collection: –“For many, seeing Chuck Yeager – who made his historic supersonic flight Oct. 14, 1947 – was the highlight of this year’s show, in which…” answer: Oct. 14, 1947

18 Methods typically several methods (from several research fields) are combined in each application –statistics (or simply counting frequencies…) –machine learning –knowledge-based methods –linguistic methods –algorithmics

19 Learning goals learn to recognize components of applications/processes learn to recognize which (kind of) methods could be used in each component learn to implement some methods (meta)learn to control learning processes (What do I know? What should I know to solve this problem?)

20 Schedule –text representation, text categorization, term selection –text summarization –information extraction –question answering systems –closing