Presentation is loading. Please wait.

Presentation is loading. Please wait.

王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on.

Similar presentations


Presentation on theme: "王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on."— Presentation transcript:

1 王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on Scrapy 2016Fall.01 厦门大学智能分析与推荐系统研究组 Group of Intelligent Analysis & Recommendation System 2016 年 9 月 19 日

2 01 020304 Which sites can be crawled The Framework of Crawler Data processing and application Open Source Code CONTENT

3 05 060708 Our Code Distributed Crawls Avoiding getting banned Papers and Research CONTENT

4 01 Which sites can be crawled PART ONE

5 1. Which sites can be crawled All kinds of sites Which sites are worth us to crawl……

6 02 The Framework of Crawler PART TWO

7 2. The Framework of Crawler Scrapy ( https://scrapy.org/ ) https://scrapy.org/ A Fast and Powerful Scraping and Web Crawling Framework

8 03 Data processing and application PART THREE

9 3. Data processing and application Content and Text Analysis Industry Analysis Social Media Monitor News websites, like : http://news.sina.com.cn/ 、 http://news.163.com/ 、 http://news.qq.com/…… Shopping Site, like : http://www.jd.com/ 、 https://www.taobao.com/ 、 http://www.yhd.com/…… Social Network, like : Weibo 、 Public WeChat Account 、 Facebook 、 Twitter……

10 04 Open Source Code PART FOUR

11 4. Open Source Code Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

12 4. Open Source Code 微信公众号爬虫 https://github.com/hexcola/wcspiderhttps://github.com/hexcola/wcspider 豆瓣读书爬虫 https://github.com/lanbing510/DouBanSpiderhttps://github.com/lanbing510/DouBanSpider 知乎爬虫 https://github.com/LiuRoy/zhihu_spiderhttps://github.com/LiuRoy/zhihu_spider Bilibili 用户爬虫 https://github.com/airingursb/bilibili-userhttps://github.com/airingursb/bilibili-user 新浪微博爬虫 https://github.com/LiuXingMing/SinaSpiderhttps://github.com/LiuXingMing/SinaSpider 小说下载分布式爬虫 https://github.com/gnemoug/distribute_crawlerhttps://github.com/gnemoug/distribute_crawler 中国知网爬虫 https://github.com/yanzhou/CnkiSpiderhttps://github.com/yanzhou/CnkiSpider 链家网爬虫 https://github.com/lanbing510/LianJiaSpiderhttps://github.com/lanbing510/LianJiaSpider 京东爬虫 https://github.com/taizilongxu/scrapy_jingdonghttps://github.com/taizilongxu/scrapy_jingdong QQ 群爬虫 https://github.com/caspartse/QQ-Groups-Spiderhttps://github.com/caspartse/QQ-Groups-Spider 乌云爬虫 https://github.com/hanc00l/wooyun_publichttps://github.com/hanc00l/wooyun_public

13 05 Our Code PART FIVE

14 5.Our Code -Base on Scrapy -Encapsulation -Provide API

15 5.Our Code WORKFLOW

16 5.Our Code What to do next on our Framework ? -JavaScript -Simulated user login -Cookie -Proxy Server -Redis

17 06 Distributed Crawls PART SIX

18 6. Distributed Crawls

19 07 Avoiding getting banned PART SEVEN

20 7. Avoiding getting banned rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

21 08 Papers and Research PART EIGHT

22 8. Papers and Research -Crawler Technology -Data Mining

23 Thanks for Listening A & Q


Download ppt "王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on."

Similar presentations


Ads by Google