Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recommendation Engines & Accumulo - Sqrrl

Similar presentations


Presentation on theme: "Recommendation Engines & Accumulo - Sqrrl"— Presentation transcript:

1 Recommendation Engines & Accumulo - Sqrrl
Data Science Group May 21, 2013

2 Agenda 6:15 - 6:30 More data trumps better algorithms by Michael Walker 6:30 - 7:30 Recommendation Engines by Tom Rampley 7:30 - 8:30 Accumulo - Sqrrl by John Dougherty 8:30 - 9:30 Network at Old Chicago at 14th and Market.

3 Data Science Group New Sponsors
Cloudera O'Reilly Media

4

5 More data is better

6 More data is better Even if less exact or messier

7

8 One sensor = strict accuracy

9 One sensor = strict accuracy
Multiple sensors = less accurate & messy

10 One sensor = strict accuracy
Multiple sensors = less accurate & messy More data points = greater value

11 One sensor = strict accuracy
Multiple sensors = less accurate & messy More data points = greater value Aggregate = more comprehensive picture

12

13 Increase frequency of sensor readings

14 Increase frequency of sensor readings
One measure per min = accurate

15 Increase frequency of sensor readings
One measure per min = accurate 100 readings per second = less accurate

16 Increase frequency of sensor readings
One measure per min = accurate 100 readings per second = less accurate > volume vs. exactitude

17 Increase frequency of sensor readings
One measure per min = accurate 100 readings per second = less accurate > volume vs. exactitude Accept messiness to get scale

18 Sacrifice accuracy in return for knowing general trend

19 Sacrifice accuracy in return for knowing general trend
Big data = probabilistic (not precise)

20 Sacrifice accuracy in return for knowing general trend
Big data = probabilistic (not precise) Good yet has problems

21 Internet of Things

22

23

24

25

26

27 Internet of Things "Data Science" means the scientific study of the creation, manipulation and transformation of data to create meaning.

28 Internet of Things "Data Scientist" means a professional who uses scientific methods to liberate and create meaning from raw data.

29 Internet of Things "Big Data" means large data sets that have different properties from small data sets and requires special data science methods to differentiate signal from noise to extract meaning and requires special compute systems and power.

30 Data Science

31 Data Science

32 Data Science "Signal" means a meaningful interpretation of data based on science that may be transformed into scientific evidence and knowledge.

33 Data Science "Noise" means a competing interpretation of data not grounded in science that may not be considered scientific evidence. Yet noise may be manipulated into a form of knowledge (what does not work).

34

35 Machine Learning

36

37 Machine Learning Field of study that gives computers the ability to learn without being explicitly programmed.

38

39 Algorithms

40

41 Algorithms Process or set of rules to be followed in calculations or other problem-solving operations to achieve a goal, especially a mathematical rule or procedure used to compute a desired result, produce the answer to a question or the solution to a problem in a finite number of steps.

42 More data trumps better algorithms

43 Microsoft Word Grammar Checker
More data trumps better algorithms Microsoft Word Grammar Checker

44 Microsoft Word Grammar Checker Improve algorithms
More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms

45 Microsoft Word Grammar Checker Improve algorithms New techniques
More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms New techniques

46 Microsoft Word Grammar Checker Improve algorithms New techniques
More data trumps better algorithms Microsoft Word Grammar Checker Improve algorithms New techniques New features

47 Feed more data into existing methods
More data trumps better algorithms Feed more data into existing methods

48 Feed more data into existing methods
More data trumps better algorithms Feed more data into existing methods Most ML-A one million words or less

49 Feed more data into existing methods
More data trumps better algorithms Feed more data into existing methods Most ML-A one million words or less Experiment: 10 mil mill - 1 billion

50 Results: algorithms improved dramatically
More data trumps better algorithms Results: algorithms improved dramatically Simple algorithm that was worst performer with 1/2 mill words performed better than all others with 1 bill words Algorithm worked best with 1/2 mill performed worst with 1 bill words

51 More trumps smarter (not always)
More data trumps better algorithms Conclusions: More trumps less More trumps smarter (not always)

52 More data trumps better algorithms
Tradeoff between spending time and money on algorithm development versus spending it on data development

53 Google language translation 1 billion words 1 trillion words
More data trumps better algorithms Google language translation 1 billion words 1 trillion words larger yet messier data set - entire internet

54

55

56


Download ppt "Recommendation Engines & Accumulo - Sqrrl"

Similar presentations


Ads by Google