Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Analytics - Gaining Insight from Your Content with NOSQL.

Similar presentations


Presentation on theme: "Content Analytics - Gaining Insight from Your Content with NOSQL."— Presentation transcript:

1 Content Analytics - Gaining Insight from Your Content with NOSQL

2 Hey, I just met you...And this is crazy. But here's my info...So connect with me, maybe? Kyle Adams Americas Support Team Lead, Alfresco Twitter: @kylefadams LinkedIn: www.linkedin.com/in/kylefernandadamswww.linkedin.com/in/kylefernandadams Email Me: kyle.adams@alfresco.comkyle.adams@alfresco.com

3 Carly Rae Jepsen Anyone?...I apologize in advance

4 Agenda The NOSQL Universe Overview of Graph Databases and Neo4j Alfresco & Neo4j Use Cases

5 What is Content Analytics Applying Business Intelligence and Analytics to ECM Reporting != Analytics Identifying trends, patterns, and correlations Tell me what I don’t know!

6 Why NOSQL?

7 Driving Trends - Data Size Data size is increasing exponentially year after year

8 Driving Trends - Connectivity of Data Web 2.0 Flat Docs Hypertext “Web 3.0” Web 1.0.0 Feeds Hypertext Folksonomies Hypertext Tagging Hypertext Wikis Hypertext UGC Hypertext Blogs RDFa Hypertext Ontologies Hypertext GGG Hypertext

9 Driving Trend: Semi-Structured Information Content is becoming more unique We can blame generation Y for this! Before: Job Title - Software Engineer After: Job Title - ZOMG Awesome Core Repository Developer

10 RDBMS Performance Curve

11 NOSQL Database Types Column Family GraphDocument Key-Value

12 Key-Value Stores Most based on Dynamo white paper Dynamo: Amazon’s Highly Available Key Value Store (2007) Data Model Global key-value mapping Massively scalable HashMap Highly Fault Tolerant Examples Riak, Redis, Voldemort

13 Key Value Stores: Strengths and Weaknesses Strengths Simple data model Horizontally scalable Weaknesses Simple data model Poor at handling complex data

14 Column Family Based on Google’s BigTable white paper BigTable: Distributed Storage System for Structured Data (2006) Tuple of K-V where the key maps to a set of columns Map Reduce for querying and processing Examples Cassandra, HBase, Hypertable

15 Column Family: Strengths and Weaknesses Strengths Data model supports semi-structured data Naturally indexed (columns) Horizontally scalable Weaknesses Does not handle interconnected data well

16 Document-oriented Databases Data Model Collection of documents Document is a key-value collection Index-centric Examples MongoDB, CouchDB, Lotus Notes?

17 Document-oriented Databases: Strengths and Weaknesses Strengths Simple, but powerful data model Good scalability Weaknesses Does not handle interconnected data well Querying is limited to keys and indexes MapReduce for large queries

18 Graph Databases Data Model Nodes with properties Named relationships with properties Examples Neo4j, Sones GraphDB, OrientDB, InfiniteGraph, AllegroGraph

19 Graph Databases: Strengths and Weaknesses Strengths Extremely powerful data model Performant when querying interconnected data Easily to query Weaknesses Sharding Rewiring your brain

20 Complexity vs Size

21 Typical Use Cases for Graph Databases Recommendations Business Intelligence Social Computing Master Data Management Geospatial Genealogy Time Series Data Web Analytics Bioinformatics Indexing RDBMS

22 Maturity of Data Models Most NOSQL: ~6 years Relational: 42 years Graph Theory: 276 years

23 Leonhard Euler Inventor of Graph Theory (1736) Swiss mathematician The original hipster

24 Graph Data Model Person City Event

25 Graph Data Model Person City Event Is Attending Hosted In Is Located In Rated

26 Graph Data Model Person City Event Is Attending Hosted In Is Located In firstName:kyle lastName:adams name:DevCon name:San Jose country:USA Rated score: 11 out of 10 comment: Amazing!!!

27 Neo4j (...finally)

28 What is Neo4j? Leading Open Source graph database Embeddable and Server ACID compliant White board friendly Stable Has been in 24/7 operation since 2003

29 More Reasons Why Neo4j is Great High performance graph operations Traverse 1,000,000+ relationships/sec on commodity hardware 32 billion nodes & relationships per Neo4j instance 64 billion properties per Neo4j instance Small footprint Standalone server is ~65mb

30 If NOSQL stands for Not Only SQL,....then how do we execute queries?!

31

32

33

34 Traversals!

35 Social Network Performance MySQL vs Neo4j

36 Social Network Performance First rule of fight club: Run a friends of friends query Second rule of fight club: 1,000 Users Third rule of fight club: Average of 50 friends per user Fourth rule of fight club: Limit the depth of 5 Fifth rule of fight club: Intel i7 commodity laptop w/8GB RAM The Experiment: Round 1

37 Social Network Performance RDBMS Schema

38 Social Network Performance select distinct uf3.* from t_user_friend uf1 inner join t_user_friend uf2 on uf1.user_1 = uf2.user_2 inner join t_user_friend uf3 on uf2.user_1 = uf3.user_2 where uf1.user_1 = ? SQL: Friends of friends at depth 3

39 Social Network Performance MySQL Results: Round 1- 1,000 Users Depth Execution Time (sec) Records Returned 20.028~900 30.213~999 410.273~999 592,613.150~999

40 Social Network Performance Social Graph

41 Social Network Performance Neo4j Traversal API TraversalDescription traversalDescription = Traversal.description().relationships("IS_FRIEND_OF",Direction.OUTGOING).evaluator(Evaluators.atDepth(2)).uniqueness(Uniqueness.NODE_GLOBAL); Iterable nodes = traversalDescription.traverse(nodeById).nodes();

42 Social Network Performance Neo4j Results: Round 1- 1,000 Users Depth Execution Time (sec) Records Returned 20.04~900 30.06~999 40.07~999 50.07~999

43 Social Network Performance First rule of fight club: Run a friends of friends query Second rule of fight club: 1,000,000 Users Third rule of fight club: Average of 50 friends per user Fourth rule of fight club: Limit the depth of 5 Fifth rule of fight club: Intel i7 commodity laptop w/8GB RAM The Experiment: Round 2

44 Social Network Performance MySQL Results: Round 1- 1,000,000 Users Depth Execution Time (sec) Records Returned 20.016~2,500 330.267~125,000 4 1,543.505 ~600,00 5 Did not finish after an hour N/A

45 Social Network Performance Neo4j Results: Round 1- 1,00,000 Users Depth Execution Time (sec) Records Returned 20.010~2,500 30.168~110,000 41.359~600,000 52.132~800,000

46 Social Network Performance Why is RDBMS performance horrible? To find all friends on depth 5,MySQL will create Cartesian product on t_user_friend table 5 times Resulting in 50,000^5 records return All except 1,000 are discarded Neo4j will simply traverse through the nodes in the database until there are no more nodes in which to traverse

47 Social Network Performance The power of traversals Graphs data structures are localized Count all of the people around you Adding more people in the room may only slightly impact your performance to count your neighbors

48 Alfresco + Neo4j Integration

49 CMIS Web Scripts Spring Data Alfresco Insight CMIS: Document & Folders Web Scripts: Users & Groups Spring Data: Persist to Neo4j OpenNLP: Parse text and leave relevant words for analysis

50 Alfresco + Neo4j Use Cases

51 Retail Analytics: ACME Mullet Group ‘Murican Mullets Euro Mullets Mullets 4 Musicians

52 CMIS Web Scripts Alfresco Insight Spring Data Creates Social ContentFollows Retail Analytics: ACME Mullet Group

53 Healthcare Analytics: Chronic Disease Management Congestive Heart Failure (CHF)

54 CHF patients have the highest number of 30-day readmissions in the United States 1 in 5 patients suffer from preventable readmissions Preventable 30-day readmissions account for ~$17 billion in total Medicare costs Under the Patient Protection and Affordable Care Act Centers for Medicare & Medicaid services will be penalizing providers for high rates of preventable 30-day readmissions Healthcare Analytics: Chronic Disease Management

55 Use comparative analytics to compare graphs of patients with CHF and family members with CHF Preventative Care via Identifying indicators for CHF Problem: structured data only represent ~20% of total available data (i.e. data from EMR’s) Solution: Import unstructured data such as nurses notes and lab results and provide the ability to analyze such data Healthcare Analytics: Chronic Disease Management

56 Alfresco + Neo4j Integration Progress

57 CMIS integration Web Script integration Users and Groups Alfresco + Neo4j: What have I Completed to Date

58 CMIS Integration Web Script Integration Workflow Auditing Activities UI Graph Visualization Charting Activiti integration Create a process instance to analyze data if pattern X exists in the graph Alfresco + Neo4j: What’s Next

59 https://github.com/kylefernandadams/alfresco-insight Soon to be on GitHub

60 Neo Technology Neo4j and Spring Data Communities Authors of Neo4j in Action Credits

61 Questions?!


Download ppt "Content Analytics - Gaining Insight from Your Content with NOSQL."

Similar presentations


Ads by Google