Download presentation
Presentation is loading. Please wait.
Published byMeredith Sharp Modified over 8 years ago
1
RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow
2
Motivation The internet is huge How does your page rank amongst others in your mesh? What is the reach of your website? Which pages are affecting your page rank? Data Source Common Crawl Organization More than 7 years of web page data, over 500TB CC April 2015 web corpus ~168TB Processed ~445GB for project Readily available in S3
3
Meshwork – your mesh in a network http://www.jcano.me/meshwork
4
Pipeline Data from S3 (source of truth) REST
5
Data Raw (WARC format) Extraction Edge List
6
… … Data Flow … … Link edge data (vertexId, pageRank)
7
Scaling up Page Rank job… spark-submit --class pageRank...
8
About Me Justin Cano UC Riverside BS Computer Engineering Previous work experience Software Engineer @ Hobbies I like building things! Hardware, software Learning and using new technologies Moviegoer Outdoor activities: biking, snowboarding Interests: design, app dev Favorite TV Shows: Futurama & The Daily Show Embedded Systems Developer @ Software Engineer Intern @
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.