It has been stated that Portuguese is the sixth language in the world in terms of native speakers, the fourth most used in Internet Interaction, and that.

Slides:



Advertisements
Similar presentations
Planning Your web content
Advertisements

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,
Teaching Using the Internet in Your Classroom.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Chapter 5: Introduction to Information Retrieval
Web Intelligence Text Mining, and web-related Applications
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Serving up Statistics to an International Community IASSIST Conference Brian Buffett May 2003.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Internet Research Search Engines & Subject Directories.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Software Evaluation Criteria Automated Assignment Applications RSCoyner 10/8/04.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Instructors begin using McGraw-Hill’s Homework Manager by creating a unique class Web site in the system. The Class Homepage becomes the entry point for.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
INTERLEGES AGM KIEV THE “ESSENTIALS” OF LAW FIRM WEBSITES.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Evaluating Web Pages Techniques to apply and questions to ask.
Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Search Engines By: Faruq Hasan.
HTML Basic. What is HTML HTML is a language for describing web pages. HTML stands for Hyper Text Markup Language HTML is not a programming language, it.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Evaluating Web Pages Techniques to apply and questions to ask.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Internet Power Searching Finding Pearls in a Zillion Grains of Sand.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Evaluating Sources: How Credible Are They?
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Searching the Web for academic information Ruth Stubbings.
Internet Searching: Finding Quality Information
Search Engines & Subject Directories
Information Retrieval
Data Mining Chapter 6 Search Engines
Azores Genealogy Research Resources
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Introduction to Information Retrieval
Search Engines & Subject Directories
Search Engines & Subject Directories
Query Type Classification for Web Document Retrieval
Information Retrieval and Web Design
Presentation transcript:

It has been stated that Portuguese is the sixth language in the world in terms of native speakers, the fourth most used in Internet Interaction, and that Portuguese native speakers constitute 3% of Internet population. Our focus is on the finer characterisation of this increasingly large community of users in terms of: sheer size, user population, content distribution, parallel content and search engine coverage. Size in pages and words A. Queries with several of the most frequent grammatical items in Portuguese We tried 7 queries with sizes between 10 and 5 words and got 6 queries that roughly represent Portugues. We present the results for two of them: 1. The one that returned the least number of pages que AND o AND e AND do AND da AND em AND um AND para 2. The one that returned the largest number of pages de AND a AND o AND do AND da AND e B. Common words We used 3 queries with content words that are common in Portuguese. C. Low frequency words We performed queries with 8 low frequency words: austero, arrumação, cara-dura, cara(s)-de-pau, cegonha(s), guarda-no(c)turno, gato-pingado, matrimônio / matrimónio. The aim of the experiments B and C was to estimate the size of the Web in Portuguese using the relative frequencies of the words in Portuguese corpora as estimators of the percentage of the Web covered. The final estimation remains to be done, however, since while the frequency of grammatical words increases linearly with text size, the picture is different for content words, which tend to appear in bursts. So words like cegonha or arrumação tend to appear in texts about these subjects very probably more than once, and one has to correct their relative frequency by a factor that models this (for example 1/2.5). This factor is probably lexically dependent as well: may vary with part of speech and the words themselves. D. Words belonging both to English and Portuguese We made 7 queries at using this engines language facility: Israel, Shakespeare, Timor, legal, Portugal, Jorge Amado, Eça de Queiroz. It returned from 4,61 to 46,55 times more documents in English than in Portuguese, except for the last two where the picture was reversed. E. Corresponding distinct words in English and Portuguese We then input to the same engine 17 pairs of queries: perigoso*/dangerous, fidelidade*/fidelity*, cavalo*/horse*, universidade*/university*, and one for each month (Janeiro/January, …). The search engine returned from 3.18 to times more documents in English than in Portuguese. F. Reproduction of Grefenstette´s estimation method Replicating Grefenstette's estimation method and Portuguese words (com, uma, os, não, ao, mas, muito, seu, são, eu, foi, você, ele, pela, quando, pode, brasil, seus, um) on Altavista, we got 5,090,230,228 words in early November Search engine coverage and evolution Our experiments showed that Alltheweb has the biggest database of indexed pages in Portuguese. Google comes second, followed by Altavista. Based on the query 2 and on the information that alltheweb searchs on 2,095,568,809 pages and Google on 3,000,000,000, the percentage of Portuguese content on the indexed pages of these search engines can be estimated as 0.99% and 0.14% respectively. Comparing the Portuguese size of Altavista in 2000 with December 2002, we observe that it grew 5.5 times. Parallel content Following Resnik, we have looked for pages containing at least one hyperlink where English appears in the text or URL associated with the link, and at least one such link for Portuguese. We got 4,246,916 pages. But with these queries we also found dictionaries, automatic translators, language courses and pointers to products (books) in other languages. So, we looked for pages in Portuguese containing at least one hyperlink where expression A appears in the text associated with the link (and the other way around for pages in English). This was we found 300,303 pages in Portuguese that have a parallel version in English and 105,365 pages in English that have a parallel version in Portuguese. Expression A: english version or version in english or this page in english or this homepage in english or versão em inglês or versão inglesa, esta página em inglês. Diana SantosRachel Aires Measuring the Web in Portuguese Linguateca SINTEF Tele og Data Pb 124, Blindern NO-0314 Oslo, Norway Content distribution (subject, size and type) We investigated the content distribution on 5 subject areas: technical texts, news, cooking, dating, and sales on the web. We got 131,226 technical pages and 69,431 pages with recipes. We don´t yet have a definitive number for the three remaining subjects because for them we used many query expressions that are not necessarily independent. An interesting research question is experimentally assigning weights to attribute to the different expressions we found out to be relevant to identify genre or content. Remaining work These are just preliminary results, that require further processing and the confirmation and independent verification of the estimation clues. We intend to do more experiments, and on a regular basis. We would also like to answer the question Who accesses Web content in Portuguese? Some hints on how this might be done are to count how many references there are in other domains to Web in Portuguese; if link statistics are made available by public multilingual sites, count how many times Portuguese pages are accessed instead of the corresponding in other languages. We intend to further explore better the content distribution investigating documents in Portuguese about other subjects, for example health, and compare it to general (all languages) content distribution. We also want to investigate the coverage, strengths and weaknesses of Portuguese-dedicated search engines (todobr.com.br, tumba.pt, etc.) compared to general ones.