by Professor Vasile AVRAM, PhD Informatics in Economy Department Academy of Economic Studies – Bucharest ROMANIA Defining Metrics to Automate the Quantitative Analysis of Textual Information within a Web Page
1 Search Engines 1 st Collect information (keywords, url, content, links in/out etc); 2 nd Analyze the collected information: - ranked - indexed 3 rd Store in the database (compressed?) Search Engine Crawler follow links Spider find pages Web Pages Downloads Pages Indexer Analyze Information Database Results Engine Cataloged Information Search request Search results
2 Ranking and SEO Common page ranking criteria: -Location – position of the keyword; -Frequency – the frequency with which the search term appears on the page; - Links – the type and number of links on a web page; - Click-throughs – the number of click-throughs has the site versus click-throughs the other pages that are shown in the page ranking.
2 Hiding information by exploiting CSS features Figure 1 Aspect of a webpage with CSS enabled (left) and CSS disabled (right)
2 Hiding information by exploiting CSS features Figure 2 The source of the web page
2 Hiding information by exploiting CSS features Figure 3 What a spider sees in the page
4 Determining the effective amount of text information (EATI) within a web page Figure 4 A snapshot of the page (IECapt; [6])
4 Determining the effective amount of text information (EATI) within a web page Effective Amount of Text Information (EATI) is determined as a ratio between the amount of text information (we denote this by ATIOCR) obtained by applying an optical character recognition (OCR) to the snapshot of the web page (figure 4) over the text information extracted by spider (denoted by TIES) as shown in figure 3. (1)
4 Determining the effective amount of text information (EATI) within a web page The value of the ratio can be: -less than 1, case in which the page contains hidden information in reverse proportion with value of the metric (as less the metric is as huge the hidden text amount is); -equal to 1, the ideal case when what shown is what contained; - greater than 1, case in which we have extra text information and signals that the page have images containing text information which, in most cases, not considered when ranking. As big as much extra text we have.
4 Determining the effective amount of text information (EATI) within a web page The working procedure used to valuate the metric involves the following three steps and corresponding type tools: 1 st. Use a spider to extract the text information within a webpage and determine TIES value required in formula (1). The spider we build is based on theory in [4] and libraries available at [6] and our functions to clean up the extracted text; 2 nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format; 3 rd. Apply an OCR tool (here applied ReadIRIS Pro 11) to the image saved at previous step and obtain the recognized text required to determine ATIOCR in (1).
A. Determine textual information contained by graphic elements (TIG) metric The procedure used to determine the textual information contained by graphic elements (I denote that by TIG) within a web page is: 1 st. Use a spider to extract the graphic elements (images, pictures, shapes etc) together with their positional coordinates and recompose a working web page of the same size as the original and containing only that graphic elements positioned at their proper coordinates; 2 nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format accepted as input by OCR tool; 3 rd. Apply an OCR tool to the image saved at previous step and obtain the recognized text required to determine the textual information contained by graphic elements (TIG) value.
B. Determining the quantity of textual information shown to the user (QTISU) metric (2) C. Determining the text information shown to the user (TISU) from tags metric (3) - TISU=100 what is shown = what extracted by the spider (no hidden information used); - TISU<100 the percent of hiding textual information from the one contained by tags. As less is as much hidden textual information is.
(4) D. The percent of textual information revealed by graphic elements to the user (TIRGU) metric TIRGU=100 the entire text information shown to the user is contained only by the graphic elements; TIRGU<100 the percent of textual information revealed to the user by graphic elements. As less is as much shown textual information comes from tags.
5 Conclusions
References [1]Jorge Cardoso (ed), Semantic Web Services: Theory, Tools and Applications, IGI Global © 2007 Books24x7. [2] Vasile Avram, “Effective Amount of Text Information (EATI) in a Web Page – A Proposal for a New Metric and Method to Determine”, The proceedings of the 9th international conference on Informatics in Economy may 2009, Editura Economică, ISBN , pp [3] Jerri L. Ledford – SEO Search Engine Optimization Bible, Wiley Publishing 2008 [4] Google - Hidden text and links, Webmaster Tools, [5] Michael Schrenk - Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL, No Starch Press Books24x7.
References [ [6] – Open Source PHP libraries for robots developmenthttp:// [7] P.J. Deitel, H.M. Deitel – Internet and World Wide Web How to Program, fourth edition, Prentice Hall 2008, pages [8] World Wide Web Consortium - The Specification of Standards for HTML, XHTML, CSS, XML: [9] Vasile Avram – Internet Technologies for Business: Documents and Websites-structure and description languages, [10] Yahoo! Search Content Quality Guidelines, [11] SEO tools-Search Engine marketing,