Download presentation
Presentation is loading. Please wait.
1
Search Bootstrapping How / Where to get started
2
Crawling Start with Nutch – http://nutch.apache.org/ Index directly to SOLR – http://www.lucidimagination.com/blog/2010/09/10 /refresh-using-nutch-with-solr/ http://www.lucidimagination.com/blog/2010/09/10 /refresh-using-nutch-with-solr/ Create a seed list from DMOZ rdf – http://www.dmoz.org/rdf.html http://www.dmoz.org/rdf.html – http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorial
3
Understanding Content Entity Extraction – LingPipe http://alias-i.com/lingpipe/http://alias-i.com/lingpipe/ – OpenNLP http://incubator.apache.org/opennlp/http://incubator.apache.org/opennlp/ Entity Identification / Taxonomies – Freebase http://www.freebase.com/http://www.freebase.com/
4
Some Additional Links Basic Web Page Parser – https://github.com/pjaol/Webcrawler https://github.com/pjaol/Webcrawler Example of OpenNLP usage – https://github.com/pjaol/entity_extractor https://github.com/pjaol/entity_extractor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.