By Professor Vasile AVRAM, PhD Informatics in Economy Department Academy of Economic Studies – Bucharest ROMANIA Defining Metrics to Automate the Quantitative.

Slides:



Advertisements
Similar presentations
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Advertisements

SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Introducing new web content management tools for Priority...
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
IS 360 Web Promotion. Slide 2 Overview How to attract visitors.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Popularity of browsers:  Popularity of search.
Internet Research Search Engines & Subject Directories.
Meta Tags What are Meta Tags And How Are They Best Used?
1.Learning the Terms Learning the TermsLearning the Terms 2.Accessing the Internet from a PC Accessing the Internet from a PCAccessing the Internet from.
For REAL MEN REAL STYLE.  Search Engine Optimization  SEO is strategies, techniques and tactics to improve or promote a website in order to get a.
Lesson 12 — The Internet and Research
Search Engine optimization.  Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine's.
 Popularity of browsers:  Popularity of search.
How To Get Your Website Indexed Efficiently? Robin Liu.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Lesson 1 What Is the World Wide Web?. Objectives Upon completion of this lesson, you should be able to: Explain what the World Wide Web is and how it.
Do's and don'ts to improve your site's ranking … Presentation by:
Search Engine Optimization & Pay Per Click Advertising
SEO : Search Engine Optimization. SEO : How It Works Web is a Network of Links Search Engines use automated robots or crawlers to scour the Web for content.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Web Search Engines AGED Search Engines Search engines (most have directories, too)  Yahoo  AltaVista  Lycos
Search Engines By: Faruq Hasan.
A Brief Digression on Search Engine Optimization (SEO)
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
What is Seo? Search Engine Optimization for Dummies.
Search Engine Optimization SEO… In Design. Introduction: What is SEO? - Is a process of improving the visibility of a website/ webpage in search engine.
 2003 Prentice Hall, Inc. All rights reserved. Outline Chapter 2 HTML (Hypertext Markup Language) Part II.
Understanding Web-Based Digital Media Production Methods, Software, and Hardware Objective
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Week 5  SEO  CSS Please Visit: to download all the PowerPoint Slides for.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
General Architecture of Retrieval Systems 1Adrienn Skrop.
Copyright © Terry Felke-Morris Web Development & Design Foundations with HTML5 8 th Edition CHAPTER 13 KEY CONCEPTS 1.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Creating & Customizing Business for Sale Websites
Search Engine Optimization(S.E.O)
Search Engine Optimization
Chapter Five Web Search Engines
IS 360 Web Promotion.
Understand Internet Search Tools
By Tommy Koh – SEO GEEK PTE LTD
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Software and Multimedia
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Search Engines & Subject Directories
Software and Multimedia
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Hvhmi ارائه دهنده : ندا منقاش. Hvhmi ارائه دهنده : ندا منقاش.
Created By: MelissaRitter.Com
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
ثانيا :أدوات البحث عبر الانترنت
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engines & Subject Directories
Search Engines & Subject Directories
Presentation transcript:

by Professor Vasile AVRAM, PhD Informatics in Economy Department Academy of Economic Studies – Bucharest ROMANIA Defining Metrics to Automate the Quantitative Analysis of Textual Information within a Web Page

1 Search Engines 1 st Collect information (keywords, url, content, links in/out etc); 2 nd Analyze the collected information: - ranked - indexed 3 rd Store in the database (compressed?) Search Engine Crawler follow links Spider find pages Web Pages Downloads Pages Indexer Analyze Information Database Results Engine Cataloged Information Search request Search results

2 Ranking and SEO Common page ranking criteria: -Location – position of the keyword; -Frequency – the frequency with which the search term appears on the page; - Links – the type and number of links on a web page; - Click-throughs – the number of click-throughs has the site versus click-throughs the other pages that are shown in the page ranking.

2 Hiding information by exploiting CSS features Figure 1 Aspect of a webpage with CSS enabled (left) and CSS disabled (right)

2 Hiding information by exploiting CSS features Figure 2 The source of the web page

2 Hiding information by exploiting CSS features Figure 3 What a spider sees in the page

4 Determining the effective amount of text information (EATI) within a web page Figure 4 A snapshot of the page (IECapt; [6])

4 Determining the effective amount of text information (EATI) within a web page Effective Amount of Text Information (EATI) is determined as a ratio between the amount of text information (we denote this by ATIOCR) obtained by applying an optical character recognition (OCR) to the snapshot of the web page (figure 4) over the text information extracted by spider (denoted by TIES) as shown in figure 3. (1)

4 Determining the effective amount of text information (EATI) within a web page The value of the ratio can be: -less than 1, case in which the page contains hidden information in reverse proportion with value of the metric (as less the metric is as huge the hidden text amount is); -equal to 1, the ideal case when what shown is what contained; - greater than 1, case in which we have extra text information and signals that the page have images containing text information which, in most cases, not considered when ranking. As big as much extra text we have.

4 Determining the effective amount of text information (EATI) within a web page The working procedure used to valuate the metric involves the following three steps and corresponding type tools: 1 st. Use a spider to extract the text information within a webpage and determine TIES value required in formula (1). The spider we build is based on theory in [4] and libraries available at [6] and our functions to clean up the extracted text; 2 nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format; 3 rd. Apply an OCR tool (here applied ReadIRIS Pro 11) to the image saved at previous step and obtain the recognized text required to determine ATIOCR in (1).

A. Determine textual information contained by graphic elements (TIG) metric The procedure used to determine the textual information contained by graphic elements (I denote that by TIG) within a web page is: 1 st. Use a spider to extract the graphic elements (images, pictures, shapes etc) together with their positional coordinates and recompose a working web page of the same size as the original and containing only that graphic elements positioned at their proper coordinates; 2 nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format accepted as input by OCR tool; 3 rd. Apply an OCR tool to the image saved at previous step and obtain the recognized text required to determine the textual information contained by graphic elements (TIG) value.

B. Determining the quantity of textual information shown to the user (QTISU) metric (2) C. Determining the text information shown to the user (TISU) from tags metric (3) - TISU=100  what is shown = what extracted by the spider (no hidden information used); - TISU<100  the percent of hiding textual information from the one contained by tags. As less is as much hidden textual information is.

(4) D. The percent of textual information revealed by graphic elements to the user (TIRGU) metric TIRGU=100  the entire text information shown to the user is contained only by the graphic elements; TIRGU<100  the percent of textual information revealed to the user by graphic elements. As less is as much shown textual information comes from tags.

5 Conclusions

References [1]Jorge Cardoso (ed), Semantic Web Services: Theory, Tools and Applications, IGI Global © 2007 Books24x7. [2] Vasile Avram, “Effective Amount of Text Information (EATI) in a Web Page – A Proposal for a New Metric and Method to Determine”, The proceedings of the 9th international conference on Informatics in Economy may 2009, Editura Economică, ISBN , pp [3] Jerri L. Ledford – SEO Search Engine Optimization Bible, Wiley Publishing 2008 [4] Google - Hidden text and links, Webmaster Tools, [5] Michael Schrenk - Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL, No Starch Press Books24x7.

References [ [6] – Open Source PHP libraries for robots developmenthttp:// [7] P.J. Deitel, H.M. Deitel – Internet and World Wide Web How to Program, fourth edition, Prentice Hall 2008, pages [8] World Wide Web Consortium - The Specification of Standards for HTML, XHTML, CSS, XML: [9] Vasile Avram – Internet Technologies for Business: Documents and Websites-structure and description languages, [10] Yahoo! Search Content Quality Guidelines, [11] SEO tools-Search Engine marketing,