王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on Scrapy 2016Fall.01 厦门大学智能分析与推荐系统研究组 Group of Intelligent Analysis & Recommendation System 2016 年 9 月 19 日
Which sites can be crawled The Framework of Crawler Data processing and application Open Source Code CONTENT
Our Code Distributed Crawls Avoiding getting banned Papers and Research CONTENT
01 Which sites can be crawled PART ONE
1. Which sites can be crawled All kinds of sites Which sites are worth us to crawl……
02 The Framework of Crawler PART TWO
2. The Framework of Crawler Scrapy ( ) A Fast and Powerful Scraping and Web Crawling Framework
03 Data processing and application PART THREE
3. Data processing and application Content and Text Analysis Industry Analysis Social Media Monitor News websites, like : 、 、 Shopping Site, like : 、 、 Social Network, like : Weibo 、 Public WeChat Account 、 Facebook 、 Twitter……
04 Open Source Code PART FOUR
4. Open Source Code Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
4. Open Source Code 微信公众号爬虫 豆瓣读书爬虫 知乎爬虫 Bilibili 用户爬虫 新浪微博爬虫 小说下载分布式爬虫 中国知网爬虫 链家网爬虫 京东爬虫 QQ 群爬虫 乌云爬虫
05 Our Code PART FIVE
5.Our Code -Base on Scrapy -Encapsulation -Provide API
5.Our Code WORKFLOW
5.Our Code What to do next on our Framework ? -JavaScript -Simulated user login -Cookie -Proxy Server -Redis
06 Distributed Crawls PART SIX
6. Distributed Crawls
07 Avoiding getting banned PART SEVEN
7. Avoiding getting banned rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
08 Papers and Research PART EIGHT
8. Papers and Research -Crawler Technology -Data Mining
Thanks for Listening A & Q