Presentation is loading. Please wait.

Presentation is loading. Please wait.

AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, Harney 235.

Similar presentations


Presentation on theme: "AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, Harney 235."— Presentation transcript:

1 AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, 2009 @ Harney 235

2 Presentation Outline Background & Motivation Goals Design Challenges Implementation Details Project Demonstration Future Extensions 11/1/2015 2 AfterCollege Scrape Utility

3 AfterCollege Background Customized career network for colleges & professional organizations across the country Goal: Create a better way for job seeking students and alumni to connect with the right employer 11/1/2015 3 AfterCollege Scrape Utility

4 What's Already There? Manually created configuration files Crawler that runs periodically Job feed outputs to be posted online 11/1/2015 4 AfterCollege Scrape Utility config.xmljobFeed.xml StaffBlack Widow crawler AfterCollege website

5 Limitations Scalability Expensive to maintain Requires technical knowledge Supports only GET requests Unable to handle dynamic websites 11/1/2015 5 AfterCollege Scrape Utility

6 Design Overview GUI Tool that assists staffs through configuration process Web Proxy that captures user activities New Crawler that uses both DOM & String Pattern matching 11/1/2015 6 AfterCollege Scrape Utility config filejobFeed.xml GUI ToolWeb ProxyNew Crawler Json files

7 11/1/2015 7 AfterCollege Scrape Utility Design Overview config file Job feed

8 Advantages Easing the pain; non-technical staff can also configure the system Makes the configuration process more straight forward and easier to understand Less expensive to maintain; take less than 10 minutes to reconfigure Supports POST Possibility of extension to support more complicated websites 11/1/2015 8 AfterCollege Scrape Utility

9 Challenges Come up with easy-to-follow user interface Build a web proxy from scratch Distinguish patterns based on selected texts Develop crawler algorithm that handles job information residing at different pages Deal with tricky Javascript Deal with embedded HTML pages Test crawling accuracy 11/1/2015 9 AfterCollege Scrape Utility

10 Design Decisions FireFox Plugin vs. Web Proxy Integration with back-end Ability to add functionalities Dojo vs. YUI Fade-In/Out, Drag & Drop Deals with different browsers Documentation XML vs. JSON Simplicity & efficiency on parsing Availability of wrapper methods in YUI 11/1/2015 10 AfterCollege Scrape Utility

11 Implementation Details GUI Tool JS inserted to each page YUI for user interface, as JS toolkit AJAX for communication with web proxy Web Proxy Java Servlet Jetty as web/app server Apache HttpClient Crawler Regular Expressions for Pattern Match Scrapes jobs in per-page, per-field basis 11/1/2015 11 AfterCollege Scrape Utility

12 Implementation Details 11/1/2015 12 AfterCollege Scrape Utility Add customized JavaScript to rendered HTML pages

13 Implementation Details 11/1/2015 13 AfterCollege Scrape Utility Rendered HTML source code

14 Implementation Details 11/1/2015 14 AfterCollege Scrape Utility Output content

15 Implementation Details 11/1/2015 15 AfterCollege Scrape Utility

16 Implementation Details 11/1/2015 16 AfterCollege Scrape Utility

17 Implementation Details 11/1/2015 17 AfterCollege Scrape Utility Dom Pattern

18 Implementation Details 11/1/2015 18 AfterCollege Scrape Utility String Pattern

19 Implementation Details 11/1/2015 19 AfterCollege Scrape Utility

20 Project Demonstration

21 Future Extensions Pagination Add support to crawl multiple pages Tricky JavaScript Find solution to prevent redirection to different a domain Embedded Pages Add functionality to get the HTML content of embedded pages 11/1/2015 21 AfterCollege Scrape Utility

22 Resources Course Instructor Dr. Jeff Buckwalter Sponsor Steve Girolami, Perry Lee, & Saan Saeteurn Source code control System Subversion Wiki Site Knowledge share, work log, resource portal - http://cs690.wikispaces.com/ Google group Discussion and information exchange medium - http://groups.google.com/group/desidae 11/1/2015 22 AfterCollege Scrape Utility

23 Questions?

24 Thank you!


Download ppt "AfterCollege Self-Service Scrape Configuration & Posting Utility Kai Hu Haiyan Wu May 14, Harney 235."

Similar presentations


Ads by Google