Download presentation
Presentation is loading. Please wait.
Published byBrian Andrews Modified over 8 years ago
1
王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on Scrapy 2016Fall.01 厦门大学智能分析与推荐系统研究组 Group of Intelligent Analysis & Recommendation System 2016 年 9 月 19 日
2
01 020304 Which sites can be crawled The Framework of Crawler Data processing and application Open Source Code CONTENT
3
05 060708 Our Code Distributed Crawls Avoiding getting banned Papers and Research CONTENT
4
01 Which sites can be crawled PART ONE
5
1. Which sites can be crawled All kinds of sites Which sites are worth us to crawl……
6
02 The Framework of Crawler PART TWO
7
2. The Framework of Crawler Scrapy ( https://scrapy.org/ ) https://scrapy.org/ A Fast and Powerful Scraping and Web Crawling Framework
8
03 Data processing and application PART THREE
9
3. Data processing and application Content and Text Analysis Industry Analysis Social Media Monitor News websites, like : http://news.sina.com.cn/ 、 http://news.163.com/ 、 http://news.qq.com/…… Shopping Site, like : http://www.jd.com/ 、 https://www.taobao.com/ 、 http://www.yhd.com/…… Social Network, like : Weibo 、 Public WeChat Account 、 Facebook 、 Twitter……
10
04 Open Source Code PART FOUR
11
4. Open Source Code Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
12
4. Open Source Code 微信公众号爬虫 https://github.com/hexcola/wcspiderhttps://github.com/hexcola/wcspider 豆瓣读书爬虫 https://github.com/lanbing510/DouBanSpiderhttps://github.com/lanbing510/DouBanSpider 知乎爬虫 https://github.com/LiuRoy/zhihu_spiderhttps://github.com/LiuRoy/zhihu_spider Bilibili 用户爬虫 https://github.com/airingursb/bilibili-userhttps://github.com/airingursb/bilibili-user 新浪微博爬虫 https://github.com/LiuXingMing/SinaSpiderhttps://github.com/LiuXingMing/SinaSpider 小说下载分布式爬虫 https://github.com/gnemoug/distribute_crawlerhttps://github.com/gnemoug/distribute_crawler 中国知网爬虫 https://github.com/yanzhou/CnkiSpiderhttps://github.com/yanzhou/CnkiSpider 链家网爬虫 https://github.com/lanbing510/LianJiaSpiderhttps://github.com/lanbing510/LianJiaSpider 京东爬虫 https://github.com/taizilongxu/scrapy_jingdonghttps://github.com/taizilongxu/scrapy_jingdong QQ 群爬虫 https://github.com/caspartse/QQ-Groups-Spiderhttps://github.com/caspartse/QQ-Groups-Spider 乌云爬虫 https://github.com/hanc00l/wooyun_publichttps://github.com/hanc00l/wooyun_public
13
05 Our Code PART FIVE
14
5.Our Code -Base on Scrapy -Encapsulation -Provide API
15
5.Our Code WORKFLOW
16
5.Our Code What to do next on our Framework ? -JavaScript -Simulated user login -Cookie -Proxy Server -Redis
17
06 Distributed Crawls PART SIX
18
6. Distributed Crawls
19
07 Avoiding getting banned PART SEVEN
20
7. Avoiding getting banned rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
21
08 Papers and Research PART EIGHT
22
8. Papers and Research -Crawler Technology -Data Mining
23
Thanks for Listening A & Q
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.