Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014.

Slides:



Advertisements
Similar presentations
Basic Internet Terms Digital Design. Arpanet The first Internet prototype created in 1965 by the Department of Defense.
Advertisements

4.01 How Web Pages Work.
How the Internet Works Course Objectives Introduce the various web browsers Introduce some new terms Explain the basic Internet to PC hookup  ISP  Wired.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
What do Computer Scientists and Engineers do? CS101 Regular Lecture, Week 10.
A reactive location-based service for geo-referenced individual data collection and analysis Xiujun Ma Department of Machine Intelligence, Peking University.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
The Internet. What is the Internet? A community with about 100 million users Available in almost every country about 160,000 people are added each month.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Web Design Basic Concepts.
By: Bihu Malhotra 10DD.   A global network which is able to connect to the millions of computers around the world.  Their connectivity makes it easier.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Ch2 Sec3 Maps and Computers.
Internet Standard Grade Computing. Internet a wide area network spanning the globe. consists of many smaller networks linked together. Service a way of.
AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti.
Ref: Geographic Information System and Science, By Hoeung Rathsokha, MSCIM GIS and Remote Sensing WHAT.
Chapter 1: Introduction to Web
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Lecturer: Ghadah Aldehim
Introduction to the Internet. What is the Internet The Internet is a worldwide group of connected networks that allows public access to information and.
The Internet. The Internet: A Definition  Short for Internetwork  AKA: The World Wide Web, or the Net  This is defined by the system of communications.
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Contents Data Communications Applications –File & print serving –Mail –Domain.
CS117 Introduction to Computer Science II Lecture 1 Introduction to WWW and HTML Instructor: Li Ma Office: NBC 126 Phone: (713)
Component 4: Introduction to Information and Computer Science Unit 2: Internet and the World Wide Web Lecture 2 This material was developed by Oregon Health.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
What is Web Mining? Discovering desired and useful information from the World-Wide Web.
Introduction To Internet
WHAT IS A SEARCH ENGINE. Widescreen Presentation Proteus, Keeper of Knowledge. Proteus is synonymous with change and success.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
Shortest Path Navigation Application on GIS Supervisor: Dr. Damitha Karunaratne Thilani Imalka 2007/MCS/023.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
User Behavior Analysis of Location Aware Search Engine Third international Conference of MDM, 2002 Takahiko Shintani, Iko Pramudiono NTT Information Sharing.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
GEO WEB. DEFINITION The Geospatial Web or Geoweb is a relatively new term that implies the merging of geographical(location- based) information with the.
Chapter 8  Government and Universities over 30 years  Who’s connected today? ◦ Individuals ◦ Educational institutions ◦ Government ◦ Research ◦ Medical.
Location Aware Information System (LAIS) Neftali Alverio Bryan Halter Jeff Cardillo Brian Reed Advisor: Prof. Tilman Wolf.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Google Image Search, Code, Fusion Tables Audrey and Chris.
Search Tools and Search Engines Searching for Information and common found internet file types.
The internship in SPAWAR contributed to a project in Maritime Domain Awareness (MDA). MDA is the effective understanding of anything associated with the.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
Internet Network of networks Mother of all networks
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Pervasive Computing MIT SMA 5508 Spring 2006 Larry Rudolph 1 Location, Location, Location Larry Rudolph.
CSCI-235 Micro-Computers in Science The Internet and World Wide Web.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Internet Searching the World Wide Web. The Internet and the World Wide Web The Internet is a worldwide collection of networks that allows people to communicate.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Geocoding Chapter 16 GISV431 &GEN405 Dr W Britz. Georeferencing, Transformations and Geocoding Georeferencing is the aligning of geographic data to a.
(class #2) CLICK TO CONTINUE done by T Batchelor.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
1.01- Understand Internet search tools and methods.
What is a Search Engine EIT, Author Gay Robertson, 2017.
Introduction to computers
WJEC GCSE Computer Science
Geographical information system: Definition and components
Presentation transcript:

Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014

What is Web Mining? Discovering desired and useful information from the World Wide Web

Exploiting Geographical Location Information of Web Pages  Orkut  Junghoo  Hector  Luis Department of Computer science,Stanford University, Stanford, Ca Department of Computer science, Columbia University, New York, (December 27,2008)

“Proof of Concept” using mapping databases Ways of exploiting information from internet:  Improve the search engine; such as, not showing irrelevant information about the query.  To identify the “globality” of resources; such as, use of hyperlink and exploiting information about web sites then it can estimated how global a web entity is.

Problems of exploit geographical location information of entities  How to compute geographical information?  How to exploit this information?

C omputing geographical information  Information Extraction; such as, automatically analyze web pages to extract geographic entities like area or zip code.  Network IP Address Analysis; such as, focus on the location of their hosting web sites.

Exploiting the Information using databases  Site Mapper ( It has the phone numbers of network administrators of all Class A and B domains. From this database, extracted the area code of the domain administrator and built a Site-Mapper table with area code information for IP addresses belonging to Class A and Class B addresses.

 Area Mapper ( It maps cities and townships to a given area code. In some cases, entire states (e.g., Montana) correspond to one area code. In other cases, a big city often has multiple area codes (e.g., Los Angeles). Then write scripts to convert the above data into a table with entries that maintained for each area code the corresponding set of cities/counties.

Zip-Code Mapper ( This mapped each zip code to a range of longitudes and latitudes.

States Cities Refresh Zoom Map URL City Zip code Area Code Ip Graphical Interface of Proof of Concept Prototype Output of search Input

Geospatial Data Mining on the Web: Discovering Locations of Emergency Service Facilities. (2012) Wenwen Li, Michael F. Goodchild, Richard L. Church, and Bin Zhou  GeoDa Center for Geospatial Analysis and Computation, School of Geographical Sciences and Urban Planning, Arizona State University, Tempe AZ  Department of Geography, University of California, Santa Barbara Santa Barbara, CA  Institute of Oceanographic Instrumentation, Shandong Academy of Sciences Qingdao, Shandong, China )

Google search image of fire station Actual Location Google result

Process of Web Crowler A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, or an automatic indexer.

Defining New Class Address Structure

Form of street address for Identifying target webpages

d1:Distance between p and the location of the foremost digit in the number block closest (before) to location p. d2: Distance between p and the location of the last digit of the first number that appears(for detecting 5-digit ZIP code), or the last digit of the second number after p if the token distance of the first and second number block equals r1: regular expression [1-9][0-9]*[\\s\\r\\n\\t]*([a-zA- Z0-9\\.]+[\\s\\r\\n\\t])+ r2: : regular expression "city-Pattern "[\\s\\r\\n\\t,]?+ ("statePattern")?+[\\s\\r\\n\\t,]*\\d{5}(-\\d{4})* Cont.

Decision rules of desired addresses by training data based on semantic information Station + Num Key word Station and Title web page as fire Station on web page title

Architecture of Proposed Cyber Miner  Here input is seeding web urls and output is target address

Search Results of Cyber Miner Location of all fire station obtained by Cyber Miner from address database

Web-based geographic search engine for location aware search in Singapore Flora S. Tsai School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore , Singapore 2010.

Geo search This is able to search for location-specific information in Singapore based Web sites. The user is able to view their search locations on a satellite map instead of the two-dimensional maps currently used in street directories. The Web-based search engine is able to search for locations based on area names, building names, and groups of landmark types, business names, and business categories. Furthermore, the user is also able to use their current coordinates as a parameter so that the search engine is able to return results in order of the distance from the user’s current location.

Google earth Using google earth for their search

Keyhole Markup Language Keyhole Markup Language (KML) is a file format used to display geographic data in an earth browser such as Google Earth, Google Maps and Google Maps for mobile.

Street directory Usefull for mobile phone only and it is also web map service which merge with google earth

Global Positioning System Google Earth allows download of tracks and waypoints from GPS devices creates KML files for the waypoints and tracks downloaded.

Design

Design Cont. BusinessAreaAddress, where the address is stored without the postal code; BusinessAreaPostal, where the postal code is stored; Area, where the keywords of the area are stored, e.g. Causeway Point; General Area, where the General Area of the location is stored, e.g. Yishun.

Algorithms Here use the Haversine’s Formula for faster processing. For two points on a sphere of radius R with latitudes Ø 1 and Ø 2, latitude separation Δ Ø= Ø 1 - Ø 2 and longitude separation Δ λ. where angles are in radians, and the distance d between the two points is related to their locations by the formula: h=haversin(Δ Ø)+cos(Ø 1 )cos(Ø 2 )haversin(Δ λ)……(1)

Algorithms Cont. Let h denote haversin (d/R) given from above. d can then be solved either by simply applying the inverse haversine (if available) or by using the arcsine (inverse sine) function: d=(R)haversin -1 (h)=(2R)arcsin(√h)………………..(2) This formula is only an approximation when applied to Earth as earth is not a perfect sphere, its radius R varies from km at the poles to km at the equator. The error is therefore 0.1% depending on the location due to this slight elipticity. Assuming that the geometric mean of R = km is used. The output of this formula is calculating distance from two coordinates

Result The database from which these results are taken contain 1652 entries with the following categories: Apparel, Bank, Cinema, Department Store, Duty Free Shop, Electronics, F&B (food and bev- erage), Fast Food, Food Court, Furniture, Health and Beauty, Minim-art, Musical Instruments, Restaurant, Snack Bar, Sports, Stationery,Seafood, and Supermarket. The landmark type searched for are Building, Road, MRT stations, Schools and Shopping Centres. General Area searched under Advanced have various roads grouped into one big area, e.g. Tan-jong Katong and Haig Road are both grouped under the Katong area

Simple search Input Output

Advance search

Thank you for your patience!