CSCE 590 Web Scraping – Information Extraction HW Topics Yahoo signin Information Retrieval March 16, 2017
Yahoo – login two step Create a yahoo email account with: Your USC login as the account name (yourLogin@yahoo.com) Your login as the password, so I can test your ability to login Write a utility function “dump_html (page, tag)” that will use the Beautifulsoup function prettify the “page” and write to the file “output_”+tag Use Selenium to login to your Yahoo account and dump the page Use scrapy to scrape table (see next slide) from http://finance.yahoo.com/quote/FB?p=FB export to csv Use scrapy to extract the same info on FB, XOM, STX, NFLX, AMZN (start_requests builds URLs from company symbol and yields them)
Facebook information from http://finance.yahoo.com/quote/FB?p=FB
IR project1 Scrapy - Where is Coach K? Subject: Duke coach Mike Krzyzewski By hand use google to find three starting URLs Open pages (parse) verify “Krzyzewski” on page Find date/year Find location Save in csv table with URL IR project2 Automate to the first step, i.e., have start_requests call google, using semantic comparison to page = … to rank the top three