Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.

Similar presentations


Presentation on theme: "Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1."— Presentation transcript:

1 Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1

2 Agenda Business Requirements  Client Overview  Business Problem  Business Goal  Solution and Scope Technical Specification  System Context  Architecture Overview  Components & Modules  Security Model  Document indexing  Search Explained Implementation Plan  Resource & Costs  Development Environment  Production Environment  Success Criteria Prototype Q&A 2

3 Multi-National Manufacturing & Sales Corporation Business Growth - Multiple Applications - Multiple Repositories Business Problem 3

4 Business Goal Organize Intellectual Capital and Assets  Accessibility - Connect knowledge workers securely to relevant information  Productivity - Increase productivity and reduce re-work by leveraging knowledge and expertise Client Overview 4

5 Solution Enterprise Knowledge Management Platform 5

6 System Context 6

7 Components & Modules 7

8 Architectural Overview 8

9 Security Model Integrated with existing GLOCO's security infrastructure Any access requires authentication To follow a link in search results, user may need additional authorization for repository access 9

10 Document indexing Document is anything that a search result can point at Documents are external to the search engine Documents include text and metadata Lucene sees each document as a set of named fields 10

11 How search works Lucene sees each document as a set of named fields A record is created for each document to store some fields o URL is usually a stored field The main index is keyed by search term (i.e. inverted) o Typical text fields are tokenized, filtered, and stemmed into terms o Indexed fields may be discarded after processing o For each term, a list of document IDs is stored to help locate records o Also stores frequency and proximity Search involves retrieval of document IDs by term, and stored fields by the document ID 11

12 Resource / Cost Plan  21 weeks total effort  13 member team including GLOCO and Innova  INNOVA supports full SDLC with phases  Solution Outline, High Level Design, Detailed Design  Build / Test / Deploy and Post Production Support 12

13 SLATES - Development Environment  Developer workstation to host Virtual Images.  Developer workstation to share development Search Servers  Fully configured environment to unit test and development 13

14 SLATES - QA / Test and Production Sticky load balancer to remember the serving tomcat Each Search server to hold multiple instances. Shared / Cached Network storage to share index Similar configuration for both QA and Production environment 14

15 Success Criteria and Benchmarks Most important project success criteria are:  10% time and resource savings on certain R&D activities  75% positive feedback on user surveys  50% of the target user group are actively using the system  5% of available documents have user-defined tags 15

16 User 1 Searches for the keyword 'Blood Glucose' 16

17 User 1 gets back the results with the keyword ‘blood glucose’ 17

18 User 1 adds tag ‘diabetes’ to a result 18

19 Tag ‘diabetes’ is immediately available for searching 19

20 User 2 searches for keyword ‘diabetes’ 20

21 User 2 gets back a result for keyword ‘diabetes’ 21

22 User 2 clicks on keyword ‘bp testing’ in the tag cloud 22

23 User 2 gets more results for keyword ‘bp testing’ 23

24 Thank you! Innova would like to thank: Zoya Kinstler Jeff Parker Basem Naseim Valar Jayaprakash Classmates Harvard University Extension School 24

25 Questions? 25

26 Reference Slides 26

27 27

28 Index Growth Index size is a percentage of the document corpus size Maintenance trade-off: o Expensive segment merges - load all segments, write a new one o Fragmented index is expensive to query - must read all segments Lucene index segments are write-once - helps with concurrency Updates are done as delete - re-add. Updates should be batched o Direct tagging is inefficient 28

29 Scalability (Source: Mark Miller, "Scaling Lucene and Solr", Lucid Imagination, 2010) Query volume is scaled by replication Index size and indexing load is scaled by sharding 29

30 Phase 1 - Work Break Down Chart 21 weeks total effort 13 member team including GLOCO and Innova INNOVA supports full SDLC with phases - Solution Outline,High Level Design, Detailed Design, Build / Test / Deploy and Post Production Support 30

31 Use Case - Search and Tag 31

32 Hardware / Software - Detailed Configuration 32

33 Interface Specification 33


Download ppt "Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1."

Similar presentations


Ads by Google