CPSC 8985 Fall 2015 P10 Web Crawler Mike Schmidt
Overview A web crawler is a script or piece of code that will go out onto the Internet and pull information or data A web crawler can be trained to only look for certain information. This data can then be saved into a database which is known as web scraping Analysis can be performed on the stored data to show trends or similarities between sets
Architecture Java – language in which the business objects and data access objects are written Jsoup – Java library the application uses to pull html elements from web pages MongoDB – a NoSQL database that the applications uses to store information collected from the web
MongoDB The name mongo comes from the word humongous, as MongoDB provides a solution to store massive amounts of data MongoDB is a NoSQL type database and stores information in a JSON like way, using document objects Mongo databases can be spread over multiple servers which makes them a perfect solution to large amounts of data that need to be accessed in a timely manner
JSoup The jsoup Java library is used to parse webpages into elements using HTML tags and attributes Jsoup tears down website pages by using CSS and jquery like methods Scraped jsoup elements can be easily added to a document object which is then sent to the MongoDB server
Scraped Data This application scrapes data live from the Internet (weather, sports scores, and movie listings) Data that is collected is stored into a Mongo database where analysis can be performed Scraping data allows users to pull in information from multiple sources and aggregate it into one central location
Live Demo of Application