Download presentation
Presentation is loading. Please wait.
Published byCharla Barrett Modified over 8 years ago
1
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology, Japan 10 th July 2008
2
2 Outlines 1. Introduction 2. Design of Crawler 3. Evaluation 4. Conclusions 5. Limitations 6. Themes for Doctoral Study
3
3 Internet users are 0.1% of population Few Myanmar language contents found on the Web No search engine is available for Myanmar language CountryPopulation#of internet users Internet users (%) Myanmar (.mm) 52,373,95863,7000.123 1.Introduction
4
4 Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Myanmar Pages Non-Myanmar Pages Challenges for Language Specific Crawler (LSC) for Myanmar
5
5 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture
6
6 Objectives To propose Language Specific Crawler (LSC) which enables maximum collection of web pages written in target language, independent of domains. To efficiently collect Myanmar web pages which then can be indexed and sorted and finally to be used in Search Engine.
7
7 2. Design of Crawler (cont.) Challenges Multiple encodings used Myanmar pages are sparsely scattered over the entire Web Collect as much pages as possible with limited time and computer resources Design of Crawler Automatic Language Identification (LI) capable of multiple encodings Language-based tracing of links Choice of seed-URLs Multi-thread crawling Robot-text exclusion
8
8 Get URLs Language Identifier 1. Extract URLs 2. Language Identification 3. Saving into Database World Wide Web Crawling Process
9
9 A single crawling loop spends a large amount of time. Multi-threading, can provide reasonable speed- up and efficient use of available bandwidth. Multi-threaded Crawler
10
10 G2LI: is an algorithm from n-gram based Language Identification for Web Documents. Advantages Requires small computing resources. Small training set (5~20 KB. Length is enough). Language Identification (cont.)
11
11 Various Myanmar Fonts and Encodings Font NameEncoding Scheme BITPartial Unicode CE ClassicGraphic Encoding Myanmar1Unicode Myanmar2Unicode MyaZediPartial Unicode MyMyanmarPartial Unicode PopularGraphic Encoding WininwaGraphic Encoding Zawgyi-OnePartial Unicode
12
12 Save URLs in CSV file Save pages content in Dearby databaseDearby URL ID URL 1 http://www.google.com CONTENT ID ParentURL URL Level Content 1 http://www.google.com http://www.google.com 0 xxx… 1 http://www.google.com http://www.google.com/mail 1 xxx… 1 http://www.google.com/mail http://www.google.com/mail/signout 2xxx… Database Design Cont..
13
13 A) Evaluation on the Language Identification (G2LI) B) Evaluation on Crawling efficiency by means of precision and recall C) Evaluation on the crawling coverage. 3.Evaluation
14
14 G2LI’s Guessing Verified Language MyanmarNon- Myanmar Total Identified as Myanmar 763 (92%) [87%]37 (8%) 800 (100%) Identified as Non- Myanmar 106[13%]10941200 Total869[100%]11312000 A) Evaluation of Language Identifier
15
15 (763+1094)/2000 = 93% (37+106)/2000 = 7% T = Downloaded pages Relevant sites Retrieved sites T X Y Accuracy Rate and Error Rate
16
16 1) not being retrieved but relevant case: Bilingual Page: written in Myanmar and English. Web page using numeric character reference. eg; (ြ, ္) 2) being retrieved but not relevant case: the misclassified pages are all English Web pages Misclassified Cases
17
17 B) Precision and Recall Precision The ability to retrieve top-ranked documents that are mostly relevant. Recall The ability of the search to find all of the relevant items in the entire Web space. Where X= relevant documents Y= retrieved documents
18
18 Second Keyword AB First Keyword X = the estimated no of total Myanmar pages on the Web = first keyword = second keyword How to estimate total number of Web pages
19
19 Total numbers of URLs returned by Google for each Keyword KeywordsNumbers of URLs (Day) 68,500 (But) 41,000 (Human Being)117,000 (Now)31,500 (Myanmar) 56,500 (He)46,600 Total361,100 Experiment period 25th June 2008 to 27th June 2008.
20
20 DayBut68,50045,20013,700205,000 DayHuman68,500120,00014,200564,401 DayNow68,50035,30011,800182,860 : ::::: : ::::: : ::::: NowHe31,50046,60010,000140,805 MyanmarHe56,50046,60011,200225,496 Total4,905,169 Average of 15 pairs of Keyword combination327,011 Estimated X
21
21 Precision and Recall of crawling Entertainment site case
22
22 Precision and Recall of crawling Blog site case
23
23 Precision and Recall of crawling News site case
24
24 Crawling parameters Seed URLs 35 Level of depth 6 Crawling time 2 weeks CPU 2.40 GHz Memory 1 GB Internet connection: 100 M bit per second DomainsThe Number of Pages Collected.mm3,555 [ 1.1%].com276,554 [ 83.2%] Other gTLDs 52,245 [ 15.7%] Total332,354 [100.0%] C) Crawling Coverage
25
25 Distribution of estimated total number of Myanmar pages Estimated Average 327,011 Collected 332,354
26
26 4.Conclusion Proposed design of crawler proved to work as a LSC for Myanmar Languages LSC can download Myanmar pages on the Web at satisfactory level Proposed LSC can be used for the part of Myanmar search engine
27
27 5.Limitations of LSC How to reach isolated Myanmar pages (choice of seed-URLs, etc.) Misidentification of Language Identifier (in particular, need to collect bilingual pages - English and Myanmar) Improved speed of LSC
28
28 6.Themes for doctoral study 1. Lexicon 2. Indexing 3. Code conversion (Transcoding) 4. Stop words removal 5. Stemming algorithm
29
29 Corpus/ Lexicon WWW Ranking engine Query engine Parser Indexer Language specific crawler Page repository query results Crawler Language Identification Language Specific Search Engine Basic Architecture Language specific Search Engine
30
30 1. Lexicon Lexicon is also a synonym for dictionary or encyclopedic dictionary. In linguistic, the lexicon of a language is its vocabulary, including its words and expressions. Daily News Paper Web pages URLs Dictionary Lexicon
31
31 DatabaseIDWeb PagesLexicon 12,3 28 36 4N: 54: ::: ::: N-15: N7: Page 1 Page 3 Page N Page 2 :::::: DatabaseIndexer 2. Indexing Indexing is a process by which a keywords is assigned to which documents of a corpus
32
32 3. Code Conversion Unicode Lexicon encoded in Unicode Web Page (contents) UnicodeNon-Unicode Transcoding Client Server
33
33 4. Stop Words Removal Stop words are defined as non-information- bearing words. Myanmar sentences can be tokenized by eliminating stop words. computer students useful N N Adj
34
34 1. Subject personal pronouns I, you, he, she, it, we, you, they uRefawmf? uRefr? ig? usKyf? uREkfyf? usaemf? 2. Object personal pronouns 3. Reflexive personal pronouns 4. Relative pronouns 5. Possessive pronouns and adjectives 6. Indefinite pronouns and adjectives 7. Demonstrative pronouns and adjectives 8. Conjunctions 9. Questions 10. Other (pronouns, prepositions) Stop-words list English Vs Myanmar
35
35 5. Stemming Stemming algorithm is a conflation procedure reduces all words with same root into a single root A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes) e.g., connect is the stem for the variants connected, connecting, and connections e.g., is the stem for the variants, and
36
36 Thank you! Any question ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.