Design a full-text search engine for a website based on Lucene

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
1 Chapter 12 Working With Access 2000 on the Internet.
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
IP Address Management and Request Service Kim Huynh CS491B.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Overview of Search Engines
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Databases & Data Warehouses Chapter 3 Database Processing.
PubMed/How to Search, Display, Download & (module 4.1)
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
PubMed/How to Search, Display, Download & (module 4.1)
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawlers - Presentation 2 - April (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
PubMed Overview From the HINARI Content page, we can access PubMed by clicking on Search inside HINARI full-text using PubMed. Note: If you do not properly.
PubMed/How to Search, Display, Download & (module 4.1)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Navigation Framework using CF Architecture for a Client-Server Application using the open standards of the web Kedar Desai presented by.
Restricted Search Engine Laurent Balat Christophe Decis Thomas Forey Sebastien Leclercq ESSI2 Project Supervisor: Johny BOND June 2002.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Integrate Full-Text Retrieval with Digital Archives System Reporter : Chia-Hao Lee Computer System and Communication Lab, Academia Sinica Institute of.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
Partner Publishers’ Websites From the Partner publisher services dropdown menu, click on the Elsevier Science - Science Direct website. Note that this.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 Centroid Based multi-document summarization: Efficient sentence extraction method Presenter: Chen Yi-Ting.
Lucene Jianguo Lu.
Notes Test #2 will be held one week from this Thursday Check to see if you have a Vision account –Launch Netscape –Point & Click to location and type vision.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
General Architecture of Retrieval Systems 1Adrienn Skrop.
PubMed/How to Search, Display, Download & (module 4.1)
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Data mining in web applications
4.01 How Web Pages Work.
義守大學資訊工程學系 作者:郭東黌, 張佑康 報告人:徐碩利 Date: 2006/11/01
Web Programming Language
Search Engine Architecture
OUTLINE Basic ideas of traditional retrieval systems
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Search Engine Architecture
Getting Started With Solr
PubMed Database Interface (Basic Course: Module 4)
4.01 How Web Pages Work.
Information Retrieval and Web Design
Presentation transcript:

Design a full-text search engine for a website based on Lucene Chinese Students and Scholars Association Presented by: Lijia Li, Yingyu Wu, Xiao Zhu

Outline Introduction Our goal System architecture Conclusion and future work Show demo

Introduction With the development of the network, the amount of information on the Internet showed explosive growth, increased the difficulty of finding the target information, the search engine has brought great convenience to people looking for information, internet has become an indispensable tool.

Our goal In this project, our goal is to implement a full-text retrieval engine based on Lucene.

Full-text retrieval engine The full-text search engine based on the entire text retrieval technology for indexing and searching. Features: (1) The unstructured index file database (2) Flexible retrieval methods (3) Support nature language retrieval (4) Retrieval efficiency (1)索引文件能存储不同格式的数据文件 (2)支撑字符串检索等 多种检索方式 (4)不需要搜索整个文档,只需要索引 If we do full-text retrieval once time, we can use the result as many times as you want in a long peroid.

System Architecture Search Engine is used to provide searching service to users. Our search engine has two main parts: online and offline.

Users website User Interface analyzer Search module Index File Result sorting Search module Index File Index module Website database crawler Enter keyword Online Search website offline Request webpage

Lucene Why The index file format independent of the application platform Inverted index Object-oriented system architecture Chinese parser (SmartchineseAnalyzer, IKAnalyzer) Implement a set of powerful Query engine(RangeQuery, FuzzyQuery……) Open Source

Web Crawler Analysis robots.txt Get robots.txt URL Analysis URL Page database Collection of start URL URL Analysis URL Unprocessed URL queue Page fetch module Internet Extract Links Page analysis module Extract the initial URL into unprocessed URL queue Get a URL address from the head of the queue Download pages according to their URL Extract hyperlink from the download page Extracted hyperlinks added to unprocessed URL queue Check whether the unprocessed URL queue is null if yes the program will be terminated otherwise step 2 will be executed. 7. Loop Architecture of web crawler

Work flow of web crawler Extract the initial URL into unprocessed URL queue Get a URL address from the head of the queue Download pages according to their URL Extract hyperlink from the download page Extracted hyperlinks added to unprocessed URL queue Check whether the unprocessed URL queue is null if yes the program will be terminated otherwise step 2 will be executed. 7. Loop

Index Whether Indexed? yes no Determine the type of document Aset of documents to be index Read and Analysis document Whether Indexed? Determine the type of document no Date of index ealier than the creation data yes Whether exist same type Parse document Build index file Call the corresponding document parser to parse document Work flow

Document indexing steps 1. Creating a IndexWriter instance IndexWriter writer = new IndexWriter(indexPath, analyzer, boolean, maxFieldLength) 2. Creating a recode of Document Document doc = new Document() 3. Add Field Object in recode of Document doc.add(new Filed(string, tokenstream)) 4. Write recode of Document in Index writer.addDocument(doc); 5. Close Index Writer Object, end indexing writer.close()

Flow chart of searching start Example: User input: “ 大连理工 计算机”, “america ohio” After QueryParser: “大连理工” AND“计算机”, “america” AND “ohio” Accept search string from user QueryParser analyze search string, output Query object Set up Searcher IndexSearcher object search related document in Index File Output related document end

Highlight search key word Get position value of search key word Get fragment of search key word, according position value of search key word Use HTML and CSS attributes to highlight search key word

Conclusion and future work What we learn through this project is how to use web crawler and Lucene to implement a full-text search engine. Working on hadoop Thank you!