Presentation is loading. Please wait.

Presentation is loading. Please wait.

Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology.

Similar presentations


Presentation on theme: "Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology."— Presentation transcript:

1 Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology Under the Hood

2 Content Integration Technologies for ScienceEducation.gov Crawling and Indexing (Part of Science.gov, E-Print Network) Federated Search (Science.gov, WorldWideScience.org) ScienceEducation.gov Needs to successfully integrate content from a variety of websites and databases requiring custom tools other search engines are unable to provide.

3 Drawing on the Experience of the E-Print Network Gateway to 30,000 websites and databases worldwide, containing over 5 million e-prints in basic and applied sciences.

4 Drawing on the Experience of the E-Print Network Initially developed in 2001 Crawls and indexes 30,000 websites Uses sophisticated filters to ensure that only quality e-prints are included in the Network Contains full-text index of over 1.5 million e-prints Uses an Admin Tool to manage websites in the E-Print Network

5

6 What is Federated Search? Federated Search is an application or service that allows a user to submit a search in parallel to multiple, distributed information sources and retrieve aggregated, ranked and de- duped results.

7 In Other Words… One Search, Many Sources DOD Search EPA NASA FDA NIH DOE NSF Other Agencies

8 Assembling the ScienceEducation.gov Search Engine- Part I Assemble Starting URLs Education Experts

9 Assembling the ScienceEducation.gov Search Engine- Part II Starting URLs Crawl Websites Filter Bad URLs And Remove Duplicates Build Index Assign Learning Levels ScienceEducation.gov Index

10 Challenges Ahead Determining what sites to crawl Filtering undesirable URLs Assigning appropriate learning level to content Categorizing content

11 To Crawl or Not To Crawl? Would miss these Don’t crawl these pages Will crawl these

12 Filtering Undesirable URLs All Crawled URLs Filter Good URLs Calendar Contact Feedback Housing. Registration Survey

13 Removing Duplicate Web Pages URL: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/education_threats.html DUP: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/ocean_planet_book_threats.html TITLE: Ocean Planet: Threats SNIPPET: Threats to the health of the oceans Oil spills account for only about five percent of the oil entering the oceans The Coast Guard estimates that for United States waters sewage treatment plants discharge twice as much oil each year as tanker spills Each year industrial household cleaning gardening and automotive products pollute water About 65 000 chemicals are used commercially in the United States today with about 1 000 new ones added each year Only about 300 have been extensively tested for toxicity It is estimated that medical waste that washed up onto Long Island and New Jersey beaches in the summer of 1988 cost as much as 3 billion in lost revenue from tourism and recreation.

14 Learning Level Stratification

15 Categorizing Content Audience: Student or Teacher Grade Level: K-3, 4-6, 7-9, 10-12, College Content Type: Interactive Activities, Lesson Plans, Reference Materials, Science Fair Projects, Videos Subject Area: Chemistry, Computer Science, Energy, Life Sciences, Mathematics, Physics

16 A Look at the Technology Under the Hood Thank you! Abe Lederman abe@deepwebtech.com www.deepwebtech.com


Download ppt "Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology."

Similar presentations


Ads by Google