1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
WMES3103 : INFORMATION RETRIEVAL
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
(c) Maria Indrawan Distributed Information Retrieval.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Databases & Data Warehouses Chapter 3 Database Processing.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Master Thesis Defense Jan Fiedler 04/17/98
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Chapter 5: Information Retrieval and Web Search
The Search Engine Architecture
Information Retrieval and Web Design
Presentation transcript:

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology, Japan 10 th July 2008

2 Outlines 1. Introduction 2. Design of Crawler 3. Evaluation 4. Conclusions 5. Limitations 6. Themes for Doctoral Study

3 Internet users are 0.1% of population Few Myanmar language contents found on the Web No search engine is available for Myanmar language CountryPopulation#of internet users Internet users (%) Myanmar (.mm) 52,373,95863, Introduction

4 Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Myanmar Pages Non-Myanmar Pages Challenges for Language Specific Crawler (LSC) for Myanmar

5 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture

6 Objectives To propose Language Specific Crawler (LSC) which enables maximum collection of web pages written in target language, independent of domains. To efficiently collect Myanmar web pages which then can be indexed and sorted and finally to be used in Search Engine.

7 2. Design of Crawler (cont.) Challenges Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Design of Crawler Automatic Language Identification (LI) capable of multiple encodings Language-based tracing of links Choice of seed-URLs Multi-thread crawling Robot-text exclusion

8 Get URLs Language Identifier 1. Extract URLs 2. Language Identification 3. Saving into Database World Wide Web Crawling Process

9 A single crawling loop spends a large amount of time. Multi-threading, can provide reasonable speed- up and efficient use of available bandwidth. Multi-threaded Crawler

10 G2LI: is an algorithm from n-gram based Language Identification for Web Documents. Advantages  Requires small computing resources.  Small training set (5~20 KB. Length is enough). Language Identification (cont.)

11 Various Myanmar Fonts and Encodings Font NameEncoding Scheme BITPartial Unicode CE ClassicGraphic Encoding Myanmar1Unicode Myanmar2Unicode MyaZediPartial Unicode MyMyanmarPartial Unicode PopularGraphic Encoding WininwaGraphic Encoding Zawgyi-OnePartial Unicode

12 Save URLs in CSV file Save pages content in Dearby databaseDearby URL ID URL 1 CONTENT ID ParentURL URL Level Content xxx… xxx… xxx… Database Design Cont..

13 A) Evaluation on the Language Identification (G2LI) B) Evaluation on Crawling efficiency by means of precision and recall C) Evaluation on the crawling coverage. 3.Evaluation

14 G2LI’s Guessing Verified Language MyanmarNon- Myanmar Total Identified as Myanmar 763 (92%) [87%]37 (8%) 800 (100%) Identified as Non- Myanmar 106[13%] Total869[100%] A) Evaluation of Language Identifier

15 ( )/2000 = 93% (37+106)/2000 = 7% T = Downloaded pages Relevant sites Retrieved sites T X Y Accuracy Rate and Error Rate

16 1) not being retrieved but relevant case: Bilingual Page: written in Myanmar and English. Web page using numeric character reference. eg; (&#4156, &#4153) 2) being retrieved but not relevant case: the misclassified pages are all English Web pages Misclassified Cases

17 B) Precision and Recall Precision  The ability to retrieve top-ranked documents that are mostly relevant. Recall  The ability of the search to find all of the relevant items in the entire Web space.  Where X= relevant documents Y= retrieved documents

18 Second Keyword AB First Keyword X = the estimated no of total Myanmar pages on the Web = first keyword = second keyword How to estimate total number of Web pages

19 Total numbers of URLs returned by Google for each Keyword KeywordsNumbers of URLs (Day) 68,500 (But) 41,000 (Human Being)117,000 (Now)31,500 (Myanmar) 56,500 (He)46,600 Total361,100 Experiment period 25th June 2008 to 27th June 2008.

20 DayBut68,50045,20013,700205,000 DayHuman68,500120,00014,200564,401 DayNow68,50035,30011,800182,860 : ::::: : ::::: : ::::: NowHe31,50046,60010,000140,805 MyanmarHe56,50046,60011,200225,496 Total4,905,169 Average of 15 pairs of Keyword combination327,011 Estimated X

21 Precision and Recall of crawling Entertainment site case

22 Precision and Recall of crawling Blog site case

23 Precision and Recall of crawling News site case

24 Crawling parameters  Seed URLs 35  Level of depth 6  Crawling time 2 weeks  CPU 2.40 GHz  Memory 1 GB  Internet connection: 100 M bit per second DomainsThe Number of Pages Collected.mm3,555 [ 1.1%].com276,554 [ 83.2%] Other gTLDs 52,245 [ 15.7%] Total332,354 [100.0%] C) Crawling Coverage

25 Distribution of estimated total number of Myanmar pages Estimated Average 327,011 Collected 332,354

26 4.Conclusion Proposed design of crawler proved to work as a LSC for Myanmar Languages LSC can download Myanmar pages on the Web at satisfactory level Proposed LSC can be used for the part of Myanmar search engine

27 5.Limitations of LSC How to reach isolated Myanmar pages (choice of seed-URLs, etc.) Misidentification of Language Identifier (in particular, need to collect bilingual pages - English and Myanmar) Improved speed of LSC

28 6.Themes for doctoral study 1. Lexicon 2. Indexing 3. Code conversion (Transcoding) 4. Stop words removal 5. Stemming algorithm

29 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture Language specific Search Engine

30 1. Lexicon Lexicon is also a synonym for dictionary or encyclopedic dictionary. In linguistic, the lexicon of a language is its vocabulary, including its words and expressions. Daily News Paper Web pages URLs Dictionary Lexicon

31 DatabaseIDWeb PagesLexicon 12, N: 54: ::: ::: N-15: N7: Page 1 Page 3 Page N Page 2 :::::: DatabaseIndexer 2. Indexing Indexing is a process by which a keywords is assigned to which documents of a corpus

32 3. Code Conversion Unicode Lexicon encoded in Unicode Web Page (contents) UnicodeNon-Unicode Transcoding Client Server

33 4. Stop Words Removal Stop words are defined as non-information- bearing words. Myanmar sentences can be tokenized by eliminating stop words. computer students useful N N Adj

34 1. Subject personal pronouns I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf? 2. Object personal pronouns 3. Reflexive personal pronouns 4. Relative pronouns 5. Possessive pronouns and adjectives 6. Indefinite pronouns and adjectives 7. Demonstrative pronouns and adjectives 8. Conjunctions 9. Questions 10. Other (pronouns, prepositions) Stop-words list English Vs Myanmar

35 5. Stemming Stemming algorithm is a conflation procedure  reduces all words with same root into a single root A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes)  e.g., connect is the stem for the variants connected, connecting, and connections  e.g., is the stem for the variants, and

36 Thank you! Any question ?