How to Use LucidWorks Search Sagnik Ray Choudhury Sagnik@psu.edu
Installation and Search Components Access control. Crawling Aperture crawler. Web, filesystem, amazon S3 bucket Information extraction: Aperture parser Indexing Lucene. Ranking Result interface Standard/Flair interface lucidworks IST 441 PSU
Start Page http://ist441.ist.psu.edu:8988 lucidworks IST 441 PSU
Access Control: Admin Panel Admin screen: login here (username admin, password admin) lucidworks IST 441 PSU
Admin Dashboard User control Collections lucidworks IST 441 PSU
Adding Users If you use local installation: May or may not create users. If you use server installation: Create a new user with admin privilege. Delete the admin account. Do not use PSU/IST credentials. Creating new user Deleting admin lucidworks IST 441 PSU
Crawling: Step 1 Add a new collection with default template. lucidworks IST 441 PSU
Crawling: Choosing a Data Source Click on the new collection. Note index size and number of documents. Add a new data source (web site) lucidworks IST 441 PSU
Crawling: Parameter Selection Name, url, crawl depth Constraint to Allow crawling within the site/ outside the site. Include paths Particular set of pages you wish to crawl. Exclude paths Filetypes/ pages you do not Want to crawl. Small scale single thread crawler, for better performance, nutch can be integrated. http://docs.lucidworks.com/display/help/Create+a+New+Web+Site+Data+Source lucidworks IST 441 PSU
Starting the Crawling Process Click create to move to crawl-job screen. Start crawling (you can add a schedule too to crawl periodically). You can add another website by going back to collection page (slide 8). lucidworks IST 441 PSU
Information Extraction and Indexing Information extraction from crawled web pages. Default: Aperture parser. Fallback: Apache Tika. Extracted information: author, fulltext, date etc. http://docs.lucidworks.com/display/lweug/Overview+of+Crawling (field mapping section) Information extraction and indexing runs simultaneously with the crawling. Need to do a “hard commit” to ensure that index is up to date. To know more about the index, go to the Solr page for the collection. lucidworks IST 441 PSU
Searching Default interface: click on “tools” link on the top panel. lucidworks IST 441 PSU
Searching: Flare interface The “Apps” page links to the starting point for Flare interface. For advanced searching and statistics, click on your collection. lucidworks IST 441 PSU
Conclusion Basic crawling, indexing and searching using LucidWorks. Simple to use, but do not offer much flexibilities. Things to try: Incorporating new crawlers. Changing the information extraction process. Changing the indexing schema and ranking functions. Questions? lucidworks IST 441 PSU