Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks.

Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks. Hidden Web: (Dynamic or Invisible Web): Web Pages that require HTML forms to access the information. Dynamically generated Web Pages: Interactive and responsive Ajax applications. For instance: GoogleMaps

Obstacles in Accessing hidden Web Certain Properties of Ajax Applications (Client Side Dynamism)  Client-side Execution  State Changes & Navigation  Dynamic Representational Model  Clickables Access interface using HTML forms (Input Dynamism)  Filtering of information which is then passed to the server to be processed which generates the answer page Issue of Scale  Comprehensive coverage of the hidden web not possible due to enormous size.

CRAWLJAX An approach to improve search engine discoverability of Ajax applications. Linked multi page mirror site (which is fully accessible to search engines) is generated after the Ajax application has been built. Application is dynamically analyzed by actually running it.  div and span elements in Ajax applications might have clickables attached to them.  Detecting whether such an element is clickable by inspecting the code is very difficult.

State Flow Graph Root : Initial State: “Index” Vertices Set: Set of Runtime States: “Money Base”, “Previous RR”, “Current Currency Ratio”, etc. Edge Set: An Edge between v1, v2 represents a clickable that states that v2 is reachable from v1. Money Base Federal Updation Policy Current Currency ratio Previous RR Current Reserve ratio Index

Working of CRAWLJAX The URL of the website along with prospective click elements (a set of HTML tags) is input. Robot is used to simulate real user clicks on the embedded browser to fire possible events and actions attached to candidate clickables. A state flow graph is then built. The new state flow graph becomes the input for the crawl procedure which is recursively called. The distance between the current DOM and previous DOM is compared. The state flow graph is updated accordingly. The links in the nodes of the graphs are established by replacing the clickable with a hypertext link. HTML String representation of all DOM objects are generated and uploaded on the server. The original Ajax site is then linked to the mirror site in order to find the first door way to the search engines.

CRAWLJAX Architecture Robot CrawlJax Controller State Machine Site Map Generator Mirror Site Generator Linker UI Ajax Engine DOM generate sitemap generate mirror Link up browser event click generate click Access Delta Updates Site Map XML Multi-Page HTML DOM to HTML Transformer Transform Output Update 1. procedure START (url, Set tags) browser = initBrowser(url) robot = initRobot() sm = initStateMachine() crawl(sm, tags) linkupAndSaveAsHTML(sm) generateSitemap(sm) end procedure 2. procedure CRAWL (StateMachine sm, Set tags) Cs = sm.getCurrentState() deltaupdate = diff(cs.getDom(), browser.getDom()) Set C = getCandidateClickables(deltaupdate, tags) for c in C do robot.click(c) dom = browser.getDom() if distance(cs.getDom(), dom) > t then n s= State(c, dom) sm.addState(ns) sm.changeState(ns) crawl(sm, tags) sm.changeState(cs) if browser.history.canBack then browser.history.goBack() else browser.reload() clickThroughTo(cs) end if end for end procedure

Evaluation of CRAWLJAX Features High Number of Ajax Server Calls: Need to identify hot nodes and minimize hot calls. State Explosion Handling Mechanism: Configuration options like similarity threshold, maximum crawling time, maximum depth level, etc. can be specified. Redundant Clickables: Textual Two clickables are generated rather than one. Span is the actual clickable, not div. Mouse-Over Clickables Ignored: Certain tags like have mouse – over clickables that are ignored. Dependence on Back Implementation: Assigns unique IDs to identify elements to navigate its way back to the original state.

SmartCrawl An approach to access the web hidden behind the forms. Generation of name, value pairs for the input elements like text boxes. Form represented as F = {U, (N1, V1), (N2, V2),….,(Nn,Vn)} Handles combo boxes, radio buttons, check boxes, text fields, etc.

Working of SmartCrawl Finding Forms: It Indexes pages containing forms. Query Generation: Depending upon the information collected from the forms, queries are generated. Visiting the Results: Label values are injected to the form and the query is submitted. HTTP requests using the GET and POST methods are used to submit the parameter values. Searching for stored Pages: After a user issues a search, indexed web pages are consulted and pages having a high probability of answering a user’s search are returned.

SmartCrawl Architecture Form Parser Warehouse Crawler DownLoader Form Result Indexer Form Inquirer Document Seeker Docum ent List Form List Categories Lexicon Word Match URL List URL Queue Query List New Query

Label Extraction Algorithm Form extractor looks for nodes which represent forms (, ) Converted to a hierarchical table representation (e.g., Form contains checkbox, checkbox contains options) First pass of generated table verifies what exists to the left side of the field. If it is a label, it is associated to the field. Second Pass of generated table looks for labels one cell above of those fields whose labels were not generated in the first pass. For checkboxes, option values’ labels are extracted from the right.

Evaluation of SmartCrawl Features Incomplete Extraction of Labels: Label extraction algorithm only accesses elements in tag. Incomplete form extraction: API used for extraction of DOM tree (NekoHTML) has no support for malformed HTML. Architecture compatibility: easier implantations’ of strategies in current search engines to gain performance and scalability. Quality of indexed data: No analysis of the “result pages”. Slow Searching and Indexing: Use of unsophisticated structures for index storage.

HiWE: Hidden Web Exposer Addresses the issue of input dynamism LVS Table URL List LVS Manager Form Analyzer Crawl Manager Parser Form Processor Response Analyzer www Submission Response Feedback Data sources

Working of HiWE Form Analysis: Internal representation is built in form of (element, domain) pairs. Value Assignment and Submission: “Best Values” are generated for the input fields and the form is submitted. Explicit initialization, built in categories, crawling experience. Response Analysis: The response page is analyzed to check its relevance. Response Navigation: Hypertext links in response page are recursively followed to a pre specified depth threshold.

Evaluation of HiWE Features Sophisticated Label Extraction Algorithm: relies on visual adjacency and partial page layout Response Analysis: The result pages are analyzed to filter out erroneous pages. Efficient Ranking Function: Fuzzy, Average and Probabilistic methods used. Partial inputs in the form model ignored: Certain forms are ignored. Forms such as having low number of input elements, no matching entries in the LVS table, etc. Effective form filling using Crawl History: The finite and infinite values in LVS table are populated based on the past values. Task Specific Initialization: It helps HiWE avoid topic drift.

Comparisons Kinds of Applications crawled: CRAWLJAX: automatically clicks SmartCrawl, HiWe: automatically fill Topic Drift: CRAWLJAX blindly clicks the clickable elements HiWe follows a task specific approach. SmartCrawl: Initially fills default values for the fields but then fills out other combinations of values. Different Label Extraction Algorithms HiWe: Visual adjacency SmartCrawl: Generation of the DOM tree (hierarchical table structure)

Comparisons contd…. Performance: Relevant pages in the result set. HiWE: Clever Ranking Function, Crawler input, Response Analysis CRAWLJAX: Kind of clickables discovered. SMARTCRAWL: Low performance data structures, naïve label extraction algorithm, no analysis of response pages, etc. Speed of Execution HiWe and SmartCrawl execute faster than CRAWLJAX. Pressence of hot calls makes it slower. Maintainence of IDs: An overhead of CRAWLJAX required to implement the back functionality. There is no such requirement for HiWe and SmartCrawl.

Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks.

Similar presentations

Presentation on theme: "Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks.

Similar presentations

Presentation on theme: "Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks."— Presentation transcript:

Similar presentations

About project

Feedback