Download presentation
Presentation is loading. Please wait.
Published byTyler Walsh Modified over 9 years ago
1
AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235
2
Presentation Outline Background & Motivation Goals Design Challenges Implementation Details Project Demonstration Future Extensions 11/1/2015 2 AfterCollege Scrape Utility
3
AfterCollege Background Customized career network for colleges & professional organizations across the country Goal: Create a better way for job seeking students and alumni to connect with the right employer 11/1/2015 3 AfterCollege Scrape Utility
4
What's Already There? Manually created configuration files Crawler that runs periodically Job feed outputs to be posted online 11/1/2015 4 AfterCollege Scrape Utility config.xmljobFeed.xml StaffBlack Widow crawler AfterCollege website
5
Limitations Scalability Expensive to maintain Requires technical knowledge Supports only GET requests Unable to handle dynamic websites 11/1/2015 5 AfterCollege Scrape Utility
6
Design Overview GUI Tool that assists staffs through configuration process Web Proxy that captures user activities New Crawler that uses both DOM & String Pattern matching 11/1/2015 6 AfterCollege Scrape Utility config filejobFeed.xml GUI ToolWeb ProxyNew Crawler Json files
7
11/1/2015 7 AfterCollege Scrape Utility Design Overview config file Job feed
8
Advantages Easing the pain; non-technical staff can also configure the system Makes the configuration process more straight forward and easier to understand Less expensive to maintain; take less than 10 minutes to reconfigure Supports POST Possibility of extension to support more complicated websites 11/1/2015 8 AfterCollege Scrape Utility
9
Challenges Come up with easy-to-follow user interface Build a web proxy from scratch Distinguish patterns based on selected texts Develop crawler algorithm that handles job information residing at different pages Deal with tricky Javascript Deal with embedded HTML pages Test crawling accuracy 11/1/2015 9 AfterCollege Scrape Utility
10
Design Decisions FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities Dojo vs. YUI Fade-In/Out, Drag & Drop Deals with different browsers Documentation XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI 11/1/2015 10 AfterCollege Scrape Utility
11
Implementation Details GUI Tool JS inserted to each page YUI for user interface, as JS toolkit AJAX for communication with web proxy Web Proxy Java Servlet Jetty as web/app server Apache HttpClient Crawler Regular Expressions for Pattern Match Scrapes jobs in per-page, per-field basis 11/1/2015 11 AfterCollege Scrape Utility
12
Implementation Details 11/1/2015 12 AfterCollege Scrape Utility Add customized JavaScript to rendered HTML pages
13
Implementation Details 11/1/2015 13 AfterCollege Scrape Utility Rendered HTML source code
14
Implementation Details 11/1/2015 14 AfterCollege Scrape Utility Output content
15
Implementation Details 11/1/2015 15 AfterCollege Scrape Utility
16
Implementation Details 11/1/2015 16 AfterCollege Scrape Utility
17
Implementation Details 11/1/2015 17 AfterCollege Scrape Utility Dom Pattern
18
Implementation Details 11/1/2015 18 AfterCollege Scrape Utility String Pattern
19
Implementation Details 11/1/2015 19 AfterCollege Scrape Utility
20
Project Demonstration
21
Future Extensions Pagination Add support to crawl multiple pages Tricky JavaScript Find solution to prevent redirection to different a domain Embedded Pages Add functionality to get the HTML content of embedded pages 11/1/2015 21 AfterCollege Scrape Utility
22
Resources Course Instructor Dr. Jeff Buckwalter Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System Subversion Wiki Site Knowledge share, work log, resource portal - http://cs690.wikispaces.com/ Google group Discussion and information exchange medium - http://groups.google.com/group/desidae 11/1/2015 22 AfterCollege Scrape Utility
23
Questions?
24
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.