1 IR Project 90522017 黃楹芸 90522017 孫怡明 90522026. 2 Reference Collections The TREC Collection The TREC Collection  Built under the TIPSTER program  Documents.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Operations Management Unit 7: Managing Quality (2) 授課教師: 國立臺灣大學工商管理學系 黃崇興 教授 本課程指定教材為 Operations Management: Processes and Supply Chains, 10th ed., Lee.
網際網路資料庫連結 2004 Php Web Programming. 上完這段課程,你將學會  一般靜態網頁與互動式網頁的區別。  網際網路上大量資料的存取。  資料庫的角色與功能。  Web Server 的角色與功能。  網際網路資料庫的應用。  基本的程式寫作技巧及網頁的應用。
Information Retrieval in Practice
 課程網頁 :  講師姓名 : 張苑 ( ㄩㄢˋ ) 瑩  實驗室 : 資電館 734 室 
無名哇哇哇 ?. 封包 header & 內文 Form 位置 找到發送 POST 的封包 找到密碼位置.
IR Project, Team 91 Information Retrieval Project Team 9 資研一 黃國瑜 資研一 何聰鑫 資研一 丁智凱.
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Introduction to Computer Science Fall 2003, 劉震昌 Ref: Computer Science: an overview J. Glenn Brookshear.
Multimedia Search and Retrieval: New Concepts, System Implementation, and Application Qian Huang, Atul Puri, Zhu Liu IEEE TRANSACTION ON CIRCUITS AND SYSTEMS.
統計資訊軟體應用 授課者:蔡桂宏 系別:應用統計資訊系 職務:專任副教授 連絡: 轉 3485 系辦
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
IR 組員 : 資工 4A 王俊傑 資工 4B 陳國富 資工 4B 夏希璿.
演算法 李朱慧. 演算法的課程目的 學習已知常用的演算法 分析程式複雜度 複雜度 vs 執行時間 學習思考過程方式.
校友資料庫系統 說明會 公共事務室 廖建翔 2015年6月21日 2015年6月21日 2015年6月21日.
VCON 設備展示. 福茂視訊會議系統架構圖 ADSL 512/512 VCB MXM 台北總公司 高雄 陽明山 五股 VPN Network E1 專線 ADSL 512/512 ADSL 512/512 6F 會議室 Multicast Server Demo Room.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
Information Retrieval Demo Program 1 組別 : 第一組 組員 : 陳文鏘 黃慶順 鄒修銘.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
INFORMATION RETRIEVAL AND EXTRACTION 作業: Program 1 第十四組 組員:林永峰、洪承雄、謝宗憲.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Control and monitoring of on-line trigger algorithms using a SCADA system Eric van Herwijnen Wednesday 15 th February 2006.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Bluetooth Remote Control This paper appears in: Information and Communication Technologies, ICTTA '06. 2nd On page(s): Location: Damascus.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
CLOUD COMPUTING 黃政明 陳其偉 李佳融 廖柏威 吳宜憲 簡瑋男 朱思翰.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
STUDENT NAME: YEN-TING LIN STUDENT ID: Computational Photography Final Project Image effect machine.
Effective Web Data Extraction with Standard XML Technologies Source : International World Wide Web Conference Proceedings of the tenth international conference.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
UML 範圍 :CH25 CH27 CH31 莫家仁 日期 :Sep.5. Outline Components Component Diagrams System And Models.
By Chung-Hong Lee ( 李俊宏 ) Assistant Professor Dept. of Information Management Chang Jung Christian University 資料庫與資訊檢索系統的整合 - 一個文件資料庫系統的開發研究.
Introduction to database technology (Based on Chapters 1-2 in Fundamentals of Database Systems by Elmasri and Navathe, Ed. 4)
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Information Management : 31 CHAP 3: Describing and Evaluating Business Processes What is the relationship between process architecture, purpose, performance,
SQL Server 2005 使用與管理 建國科技大學 資管系 饒瑞佶. SQL Server Management Studio.
Introduction to Database System Wei-Pang Yang, IM.NDHU, Midterm Test-1 Example: Banking Database 1. branch 2. customer 客戶 ( 存款戶, 貸款戶 ) 5. account.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
非同步互動式網頁程式設計 - 實作練習 I 資料表的 CRUD - 使用 HTML/CSS & JavaScript 報告人: Dennis ( 嚴志和 ) 日期: 2014/11/10.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Index in Database Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
哼唱檢索用於嵌入式系統 張智星 多媒體資訊檢索實驗室 台灣大學 資訊工程系.
Using Sequence Files. Mahout Installation – wget distribution-0.9.tar.gz
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Incremental Context Mining for Adaptive Document Classification Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Rey-Long Liu Yun-Ling Lu.
Information Data of Redundancy Chia-Chen Hsieh Department of Computer Science and Information Engineering Fu Jen Catholic University.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
SERIALIZED DATA STORAGE Within a Database James Devens (devensj)
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
CS520 Web Programming Full Text Search
Why indexing? For efficient searching of a document
義守大學資訊工程學系 作者:郭東黌, 張佑康 報告人:徐碩利 Date: 2006/11/01
Building Search Systems for Digital Library Collections
Implementation Issues & IR Systems
Database & Record Structure
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
資料庫管理作業(一).
Example: Banking Database
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Presentation transcript:

1 IR Project 黃楹芸 孫怡明

2 Reference Collections The TREC Collection The TREC Collection  Built under the TIPSTER program  Documents from all sub-collections are tagged with SGML to allow easy parsing.  FBIS (Foreign Broadcast Information Service)  Size : 470 Mb  Number : 130,471 Docs  Words/Doc. (median) : 322  Words/Docs. (mean) : 543.6

3 Document Parsing: sample document <DOC> FBIS3-50 FBIS3-50 "cr " "cr " <HEADER> 23 March March 1994 Article Type:FBIS Document Type:FOREIGN MEDIA NOTE--FB PN JAPAN JAPAN: SPOTLIGHT ON JAPAN ASSOCIATION OF DEFENSE INDUSTRY JAPAN: SPOTLIGHT ON JAPAN ASSOCIATION OF DEFENSE INDUSTRY </HEADER><TEXT> The Japan Association of Defense Industry (JADI), existing in its present form since 1988 and tracing its origin back to 1951, is an industry association under the supervision of the Ministry of International Trade and Industry (MITI) and the Japan Defense Agency (JDA). JADI promotes the development of Japanese defense technology and equipment, monitors foreign technology, lobbies on behalf of its corporate members for government defense spending, and cooperates with the government on export controls. (AUTHOR: MERCADO. QUESTIONS AND/OR COMMENTS, PLEASE CALL CHIEF, (AUTHOR: MERCADO. QUESTIONS AND/OR COMMENTS, PLEASE CALL CHIEF,</TEXT></DOC>

4 Document Parsing Process each document to extract: Process each document to extract:  Document ID  Segment the text into tokens  In our case, separate the text by white-spaces and newlines  Case conversion (make all tokens lowercase)  Discard stopwords and other non-content words (e.g. numbers)  Word stemming  Count term frequencies, record positions  Update indices Write out the index to file, according to alphabetical order from a to z Write out the index to file, according to alphabetical order from a to z

5 Project Introduction 作業平台 作業平台  a. CPU : Celeron 450 MHz  b. RAM 大小: 256 RAM  c. 作業系統: Win 2000 Server  d. 處理程式: Java + JDBC  e. 資料儲存: SQL Server 2000 使用的 Indexing 方法 使用的 Indexing 方法  Inverted indexing

6 System Architecture

7 Implement Our Use Interface Our Use Interface Our Use Interface Our Use Interface  Indexing Time Indexing Time  120 sec ~ 140 sec per file  Total ~ 16 Hour Searching Time Searching Time  “Information” Records ~ 15 sec  “mobilize” – 866 Records ~ 3 sec Indexing File Indexing File  850 MB