王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on.

王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on Scrapy 2016Fall.01 厦门大学智能分析与推荐系统研究组 Group of Intelligent Analysis & Recommendation System 2016 年 9 月 19 日

01 020304 Which sites can be crawled The Framework of Crawler Data processing and application Open Source Code CONTENT

05 060708 Our Code Distributed Crawls Avoiding getting banned Papers and Research CONTENT

01 Which sites can be crawled PART ONE

1. Which sites can be crawled All kinds of sites Which sites are worth us to crawl……

02 The Framework of Crawler PART TWO

2. The Framework of Crawler Scrapy （ https://scrapy.org/ ） https://scrapy.org/ A Fast and Powerful Scraping and Web Crawling Framework

03 Data processing and application PART THREE

3. Data processing and application Content and Text Analysis Industry Analysis Social Media Monitor News websites, like ： http://news.sina.com.cn/ 、 http://news.163.com/ 、 http://news.qq.com/…… Shopping Site, like ： http://www.jd.com/ 、 https://www.taobao.com/ 、 http://www.yhd.com/…… Social Network, like ： Weibo 、 Public WeChat Account 、 Facebook 、 Twitter……

04 Open Source Code PART FOUR

4. Open Source Code Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

4. Open Source Code 微信公众号爬虫 https://github.com/hexcola/wcspiderhttps://github.com/hexcola/wcspider 豆瓣读书爬虫 https://github.com/lanbing510/DouBanSpiderhttps://github.com/lanbing510/DouBanSpider 知乎爬虫 https://github.com/LiuRoy/zhihu_spiderhttps://github.com/LiuRoy/zhihu_spider Bilibili 用户爬虫 https://github.com/airingursb/bilibili-userhttps://github.com/airingursb/bilibili-user 新浪微博爬虫 https://github.com/LiuXingMing/SinaSpiderhttps://github.com/LiuXingMing/SinaSpider 小说下载分布式爬虫 https://github.com/gnemoug/distribute_crawlerhttps://github.com/gnemoug/distribute_crawler 中国知网爬虫 https://github.com/yanzhou/CnkiSpiderhttps://github.com/yanzhou/CnkiSpider 链家网爬虫 https://github.com/lanbing510/LianJiaSpiderhttps://github.com/lanbing510/LianJiaSpider 京东爬虫 https://github.com/taizilongxu/scrapy_jingdonghttps://github.com/taizilongxu/scrapy_jingdong QQ 群爬虫 https://github.com/caspartse/QQ-Groups-Spiderhttps://github.com/caspartse/QQ-Groups-Spider 乌云爬虫 https://github.com/hanc00l/wooyun_publichttps://github.com/hanc00l/wooyun_public

05 Our Code PART FIVE

5.Our Code -Base on Scrapy -Encapsulation -Provide API

5.Our Code WORKFLOW

5.Our Code What to do next on our Framework ？ -JavaScript -Simulated user login -Cookie -Proxy Server -Redis

06 Distributed Crawls PART SIX

6. Distributed Crawls

07 Avoiding getting banned PART SEVEN

7. Avoiding getting banned rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

08 Papers and Research PART EIGHT

8. Papers and Research -Crawler Technology -Data Mining

Thanks for Listening A & Q

王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on.

Similar presentations

Presentation on theme: "王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on.

Similar presentations

Presentation on theme: "王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on."— Presentation transcript:

Similar presentations

About project

Feedback

王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on.

Presentation on theme: "王玮玮厦门大学自动化系 WANG Weiwei, Department Of Automation, Xiamen university. 基于 Scrapy 的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on."— Presentation transcript: