Intro to Web Development

Slides:



Advertisements
Similar presentations
JQuery MessageBoard. Lets use jQuery and AJAX in combination with a database to update and retrieve information without refreshing the page. Here we will.
Advertisements

Urchin & Website Usability. Usability Study Usability study is a repetitive process that involves testing a site and then using the test results to change.
Crawling the WEB Representation and Management of Data on the Internet.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,
What is a search engine? A program that indexes documents, then attempts to match documents relevant to a user's search requests. The term search engine.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
SEO Techniques Tech Talk 29 th August 2013 (By PEN Vannak)
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Internet Research Finding Free and Fee-based Obituaries Online.
Web Development Methodologies Yuan Wang(yw2326). Basic Concepts Browser/Server (B/S) Structure Keywords: Browser, Server Examples: Websites Client/Server.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
1 Web Developer & Design Foundations with XHTML Chapter 13 Key Concepts.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SISP Training Documentation Template.
Downloading defined: Downloading is the process of copying a file (such as a game or utility) from one computer to another across the internet. When you.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
How to Set-up Your Local Listing. Welcome This tutorial will take you through the steps to set-up (or edit) your Local listing to ensure you get the most.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Evaluating & Maintaining a Site Domain 6. Conduct Technical Tests Dreamweaver provides many tools to assist in finalizing and testing your website for.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Crawling in practice Instructor: Rada Mihalcea.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  
CSU Extension Webpage Template Session 4 February 2010.
LECTURE 13 Intro to Web Development. WEB DEVELOPMENT IN PYTHON In the next few lectures, we’ll be discussing web development in Python. Python can be.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
National College of Science & Information Technology.
CHAPTER 16 SEARCH ENGINE OPTIMIZATION. LEARNING OBJECTIVES How to monitor your site’s traffic What are the pros and cons of keyword advertising within.
Chapter 1 Getting Started with ASP.NET Objectives Why ASP? To get familiar with our IDE (Integrated Development Environment ), Visual Studio. Understand.
Search Engine Optimization
Advanced HTML Tags:.
4.01 How Web Pages Work.
Intro to Web Development
How do Web Applications Work?
Search Engines and Search techniques
Lecture 2 Python Basics.
HOW TO CREATE YOUR LISTING
Intro to Web Development
Understanding Search Engines
FREE TRAFFIC STRATEGIES
Internet Searching: Finding Quality Information
Retrieving information from forms
Web UI Basics ITM 352.
Basic Web Scraping with Python
Intro to PHP & Variables
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
Google Scholar.
CNIT 131 HTML5 – Anchor/Link.
Information Retrieval
Building Web Applications
Scrapy Web Cralwer Instructor: Bei Kang.
Web Scrapers/Crawlers
Coding Concepts (Basics)
HTML Links.
Lecture 5: Functions and Parameters
Document Structure & HTML
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
© Hugh McCabe 2000 Web Authoring Lecture 8
Bryan Burlingame 24 April 2019
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
4.01 How Web Pages Work.
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Web Application Development Using PHP
Presentation transcript:

Intro to Web Development Lecture 13 Intro to Web Development

Web development in python In the next few lectures, we’ll be discussing web development in Python. Python can be used to create a full-stack web application or as a scripting language used in conjunction with other web technologies. Well-known websites which utilize Python include - Reddit (major codebase in Python) - Dropbox (uses the Twisted networking engine) - Instagram (uses Django) - Google (Python-based crawlers) - Youtube

Web development in Python In order to cover all of the aspects of front- and back-end web development, we will build and serve a website from scratch over the next few lectures. We will be covering - Crawling and Scraping - Templating - Frameworks - Databases - WSGI Servers

Getting started First, let’s introduce the website concept. We will be building a website that allows travelers to quickly look up some information, statistics and important advisories for countries that they may be visiting. Our data source will be the CIA World Factbook, a rather large repository of information. Much of it is not useful for the average traveler, so we will pull out the relevant parts. However, the Factbook is updated constantly so we need to be able to grab the most recent information. Before we can build our website, we need to understand how to grab the data. We will put together a Python module to grab a few pieces of data, but we can always extend it later as we add more features to our website.

Crawling and Scraping The process of gathering information from web pages is known as crawling and scraping. - Crawling: the process of recursively finding links within a page and fetching the corresponding linked page, given some set of root pages to start from. - Scraping: the process of extracting information from a web page or document. Note: crawling and scraping is a legal grey issue. “Good” crawlers obey the robots.txt file accessible from the root of any website, which dictates the allowable actions that may be taken by a web robot on that website. It’s not clear whether disobeying the directives of robots.txt is grounds for legal action.

Crawling and scraping Before we begin, since we’re interested in gathering information from the CIA World Factbook, we should check out https://www.cia.gov/robots.txt -- but it doesn’t exist! Let’s check out craigslist.org/robots.txt as an example: User-agent: * Disallow: /reply Disallow: /fb/ Disallow: /suggest Disallow: /flag Disallow: /mf Disallow: /eaf Applies to all robots Directories that are prohibited

Scraping Let’s talk about the objectives of our scraper. - Input: Country Name (e.g. “Norway”) or Keyword (e.g. “Population”). - Returns: Dictionary. If the input was a country name, dictionary contains relevant information about that country. If the input was a keyword, dictionary contains results for every country. To start, we will support every country supported by the CIA World Factbook and we will support the keywords Area, Population, Capital, Languages, and Currency. While we’re not technically doing any web development today, we’re creating a script which we can call from our website when a user requests information.

Scraping Let’s write some basic code to get started. We’ll be calling get_info() from our website, but we include some code for initializing search_term when the module is run by itself. The very first thing we need to do is access the main page and figure out whether we’re looking for country or keyword data. def get_info(search_term): pass if __name__ == "__main__": search_term = raw_input("Please enter a country or keyword: ") results = get_info(search_term)

Scraping Take a look at the main page of the Factbook and take a minute to tour the entries of a handful of countries: https://www.cia.gov/library/publications/the-world- factbook/ Now view the source of these pages. Notice that the country names are available from a drop- down list, which links to the country’s corresponding page.

Scraping If the search term provided is found in this drop-down list, we only need to navigate to its corresponding page and pull the information on that country. If the search term is one of the approved keywords, we need to navigate to every one of these pages and pull the corresponding information. Note the approved keywords must be mapped to the actual keywords used on the website. Otherwise, we have to report an error. keywords = {"Area": "Area", "Population": "Population", "Capital": "Capital", "Languages": "Languages", "Currency": "Exchange rates"}

Scraping requests.get(url) returns a Response object with all of the important info about the page including its source. from __future__ import print_function import requests url = "https://www.cia.gov/library/publications/the-world-factbook/" keywords = {"Area": "Area", "Population": "Population", "Capital": "Capital", "Languages": "Languages", "Currency": "Exchange rates"} def get_info(search_term): main_page = requests.get(url) if search_term in keywords: return get_keyword(search_term, main_page) else: # Look for country name on main page! factbook_scraper.py

Scraping To search the country names, we’ll need to know a bit about the source. After viewing the source on the page, we find the following: <div id="cntrySelect"> <form action="#" method="GET"> <select name="selecter_links" class="selecter_links"> <option value="">Please select a country to view</option> <option value="geos/xx.html"> World </option> <option value="geos/af.html"> Afghanistan </option> <option value="geos/ax.html"> Akrotiri </option> <option value="geos/al.html"> Albania </option> <option value="geos/ag.html"> Algeria </option> …

Scraping The structure of a link to a country page is very systematic. We can take advantage of this to pull out the information we need. <div id="cntrySelect"> <form action="#" method="GET"> <select name="selecter_links" class="selecter_links"> <option value="">Please select a country to view</option> <option value="geos/xx.html"> World </option> <option value="geos/af.html"> Afghanistan </option> <option value="geos/ax.html"> Akrotiri </option> <option value="geos/al.html"> Albania </option> <option value="geos/ag.html"> Algeria </option> …

Scraping from __future__ import print_function import requests, re, sys def get_info(search_term): main_page = requests.get(url) if search_term in keywords: return get_keyword(search_term, main_page) else: link = re.search(r'<option value="([^"]+)"> ' + search_term.capitalize() + ' </option>', main_page.text) if link: # Country link found! pass else: return None We will use regular expressions to pull out the link associated with the search term, but only if it can be found as a country name.

Scraping from __future__ import print_function import requests, re, sys def get_info(search_term): main_page = requests.get(url) if search_term in keywords: return get_keyword(search_term, main_page) else: link = re.search(r'<option value="([^"]+)"> ' + search_term.capitalize() + ' </option>', main_page.text) if link: country_page = requests.get(url + link.group(1)) return get_country(country_page) else: return None Now, we build the URL of the country’s page, if it exists, and request that url to be passed into our get_country function.

Scraping Now, how does the data appear on a country’s page? Essentially, the keyword appears within a div tag with a class like “category sas_light” while the data associated with the keyword appears in the subsequent div tag with class “category_data”. This is a little inconvenient because the data is not nested within some identifying header or attribute. It simply appears after the corresponding keyword identifier. <div id='field' class='category sas_light' style='padding-left:5px;'> <a href='../docs/notesanddefs.html?fieldkey=2119&term=Population'>Population:</a> <a href='../fields/2119.html#af'><img src='../graphics/field_listing_on.gif'></a> </div> <div class=category_data>32,564,342 (July 2015 est.)</div>

Scraping Now, how does the data appear on a country’s page? We can try to use regular expressions here to describe the data pattern. <div id='field' class='category sas_light' style='padding-left:5px;'> <a href='../docs/notesanddefs.html?fieldkey=2119&term=Population'>Population:</a> <a href='../fields/2119.html#af'><img src='../graphics/field_listing_on.gif'></a> </div> <div class=category_data>32,564,342 (July 2015 est.)</div> keyword + r“:.*?<div class=category_data>(.+?)</div>”

Scraping def get_country(country_page): results = [] for keyword in keywords.values(): data = get_country_by_keyword(country_page, keyword) if data: results.append((keyword, data)) return results Now, unfortunately, not all of the data appears in a uniform fashion. We’ll have to uniquely access some of it based on what we’re looking for.

Scraping def get_country_by_keyword(country_page, keyword): data = re.search(keyword + r":.*?<div class=category_data>(.+?)</div>", country_page.text, re.S) if data: return data.group(1) return None Take a look at the factbook_scraper.py to see what we really have to do to grab this data. It’s a bit more complicated!

Scraping We’ve taken care of scraping data for a single country. Now what about scraping all of the data for a specific keyword? def get_keyword(search_term, main_page): results = [] keyword = keywords[search_term] link_pairs = re.findall(r'<option value="([^"]+)"> (.*?) </option>', main_page.text) for link, country in link_pairs: country_page = requests.get(url + link) data = get_country_by_keyword(country_page, keyword) results.append((country, data)) return results

Scraping Factbook The entire contents of factbook_scraper.py is posted next to the lecture slides. Let’s give it a try! Now, we have a script which will take in a single search term and return to us some information about either a single country, or about all of the countries in the world. Next lecture, we’ll start building a website around this script.

Crawling and scraping Note that while we used regular expressions and requests in this example, there are a number of ways to perform crawling and scraping since it is such a common use of Python. Some of the modules below are more powerful than the tools we learned this lecture but have a steeper learning curve. Scrapy BeautifulSoup RoboBrowser lxml