Download presentation
Presentation is loading. Please wait.
Published byClifford Lester Modified over 9 years ago
1
Abe Lederman, President and CTO Deep Web Technologies, Inc. ScienceEducation.gov Meeting National Academy of Sciences, March 18, 2009 A Look at the Technology Under the Hood
2
Content Integration Technologies for ScienceEducation.gov Crawling and Indexing (Part of Science.gov, E-Print Network) Federated Search (Science.gov, WorldWideScience.org) ScienceEducation.gov Needs to successfully integrate content from a variety of websites and databases requiring custom tools other search engines are unable to provide.
3
Drawing on the Experience of the E-Print Network Gateway to 30,000 websites and databases worldwide, containing over 5 million e-prints in basic and applied sciences.
4
Drawing on the Experience of the E-Print Network Initially developed in 2001 Crawls and indexes 30,000 websites Uses sophisticated filters to ensure that only quality e-prints are included in the Network Contains full-text index of over 1.5 million e-prints Uses an Admin Tool to manage websites in the E-Print Network
6
What is Federated Search? Federated Search is an application or service that allows a user to submit a search in parallel to multiple, distributed information sources and retrieve aggregated, ranked and de- duped results.
7
In Other Words… One Search, Many Sources DOD Search EPA NASA FDA NIH DOE NSF Other Agencies
8
Assembling the ScienceEducation.gov Search Engine- Part I Assemble Starting URLs Education Experts
9
Assembling the ScienceEducation.gov Search Engine- Part II Starting URLs Crawl Websites Filter Bad URLs And Remove Duplicates Build Index Assign Learning Levels ScienceEducation.gov Index
10
Challenges Ahead Determining what sites to crawl Filtering undesirable URLs Assigning appropriate learning level to content Categorizing content
11
To Crawl or Not To Crawl? Would miss these Don’t crawl these pages Will crawl these
12
Filtering Undesirable URLs All Crawled URLs Filter Good URLs Calendar Contact Feedback Housing. Registration Survey
13
Removing Duplicate Web Pages URL: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/education_threats.html DUP: http://seawifs.gsfc.nasa.gov/OCEAN_PLANET/HTML/ocean_planet_book_threats.html TITLE: Ocean Planet: Threats SNIPPET: Threats to the health of the oceans Oil spills account for only about five percent of the oil entering the oceans The Coast Guard estimates that for United States waters sewage treatment plants discharge twice as much oil each year as tanker spills Each year industrial household cleaning gardening and automotive products pollute water About 65 000 chemicals are used commercially in the United States today with about 1 000 new ones added each year Only about 300 have been extensively tested for toxicity It is estimated that medical waste that washed up onto Long Island and New Jersey beaches in the summer of 1988 cost as much as 3 billion in lost revenue from tourism and recreation.
14
Learning Level Stratification
15
Categorizing Content Audience: Student or Teacher Grade Level: K-3, 4-6, 7-9, 10-12, College Content Type: Interactive Activities, Lesson Plans, Reference Materials, Science Fair Projects, Videos Subject Area: Chemistry, Computer Science, Energy, Life Sciences, Mathematics, Physics
16
A Look at the Technology Under the Hood Thank you! Abe Lederman abe@deepwebtech.com www.deepwebtech.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.