Beyond Machine Learning - What Is Hidden In Your Data

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Bell Questions 11/11/10 Where are elements that are heavier than hydrogen and helium made? What do we learn from the fact that most of the galaxies are.
Introduction to Astrophysics Lecture 15: The formation and evolution of galaxies.
Get Close to your subject Your subject should be the star of your photos, and the one way to make that happen is to be sure you are close enough to the.
Internet Vision - Lecture 3 Tamara Berg Sept 10. New Lecture Time Mondays 10:00am-12:30pm in 2311 Monday (9/15) we will have a general Computer Vision.
December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.
Empirical Analysis Doing and interpreting empirical work.
Rapid Analysis Farrokh Alemi, Ph.D.. Analysis takes time and reflection People must be lined up and their views sought. People must be lined up and their.
Chapter 16: Hubble’s Law and Dark Matter 3C295 in X-rays 5 billion light years away and 2 million light years across.
T.Sharon 1 Internet Resources Discovery (IRD) Introduction to MMIR.
Fitting a Model to Data Reading: 15.1,
December 2, 2014Computer Vision Lecture 21: Image Understanding 1 Today’s topic is.. Image Understanding.
Nebulas are made up of gas left behind by stars forming or exploding There are different classes of Nebulas The classes are: Reflection Nebulae, Emission.
CS Instance Based Learning1 Instance Based Learning.
March 21, 2006Astronomy Chapter 27 The Evolution and Distribution of Galaxies What happens to galaxies over billions of years? How did galaxies form?
© Wiley Publishing All Rights Reserved.
Computer Vision. DARPA Challenge Seeks Robots To Drive Into Disasters. DARPA's Robotics Challenge offers a $2 million prize if you can build a robot capable.
1 Ethics of Computing MONT 113G, Spring 2012 Session 11 Graphics on the Web Limits of Computer Science.
In Situ Sampling of a Large-Scale Particle Simulation Jon Woodring Los Alamos National Laboratory DOE CGF
Research talk 1.1 Claudette M. Jones, M.Ed. KAISERSLAUTERN HS APLAC
CIS 9002 Kannan Mohan Department of CIS Zicklin School of Business, Baruch College.
Describing and Exploring Data Initial Data Analysis.
Dynamic Range And Granularity. Dynamic range is important. It is defined as the difference between light and dark areas of an image. All digital images.
Robotica Lecture 3. 2 Robot Control Robot control is the mean by which the sensing and action of a robot are coordinated The infinitely many possible.
A brief overview based on the work of Molly Bang
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
Views Lesson 7.
The Tully-Fisher Relation A relation between the rotation speed of a spiral galaxy and its luminosity The more mass a galaxy has  the brighter it is 
The Sloan Digital Sky Survey ImgCutout: The universe at your fingertips Maria A. Nieto-Santisteban Johns Hopkins University
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
An overview based on the work of Molly Bang
Digital Media Dr. Jim Rowan ITEC So far… We have compared bitmapped graphics and vector graphics We have discussed bitmapped images, some file formats.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Does God exist?. What’s new? If you go to your school every day and every day it looks the same do you think much about it? If one day you go there after.
Electromagnetic Spectrum. Element Spectrum Each element has a spectral “fingerprint” — a pattern of bright or dark lines that is UNIQUE to that element.
Goal: To understand voids Objectives: 1)To examine the Size and distribution of voids 2)To understand the Properties of voids 3)To learn about the Formation.
Digital Media Lecture 5: Vector Graphics Georgia Gwinnett College School of Science and Technology Dr. Jim Rowan.
Using Dynamic Quantum Clustering to Analyze Structure of Hierarchically Heterogeneous Samples at the Nanoscale Allison Hume Mentor: Marvin Weinstein.
Research talk 101 Jim Miles California State University, Long Beach 9/9/15.
The Mass of the Galaxy Can be determined using Kepler’s 3 rd Law –Solar System: the orbital velocities of planets determined by mass of Sun –Galaxy: orbital.
Big Data Analysis. Data Mining versus Data Analytics DATA ANALYSIS HYPOTHESIS CONCLUSION.
Astronomy 1020-H Stellar Astronomy Spring_2016 Day-2.
AP CSP: Data and Trends.
Announcements Quiz 6 due Monday – this covers stars, Chapter 10
Clustering Anna Reithmeir Data Mining Proseminar 2017
Astrophysics and Cosmology
Kinematics Introduction to Motion
UNIT 2 – CHAPTER 2 – LESSON 7 Introduction to Data.
Using Flow Textures to Visualize Unsteady Vector Fields
Machine Learning Feature Creation and Selection
Stars, starlight AND Light Information. Stars, starlight AND Light Information.
Making figures: The good, the bad, the ugly
Digital Media Dr. Jim Rowan ITEC 2110.
Chapter 15 Preview Section 1 Stars
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Summary and Recommendations
Introduction to Cosmology
Cosmological Assumptions
Dark Matter Background Possible causes Dark Matter Candidates
Chapter 26: Stars and Galaxies
The Expanding Universe
CS100J Lecture 16 Previous Lecture This Lecture Programming concepts
CS100J Lecture 16 Previous Lecture This Lecture Programming concepts
Summary and Recommendations
Interpreting Arial Photos
Dust Dust cloud The disk Lots of dust in spiral galaxies The bulge
Sequence alignment, E-value & Extreme value distribution
Rapid Spatial Learning Controls Instinctive Defensive Behavior in Mice
Lesson Overview 1.1 What Is Science?.
Presentation transcript:

Beyond Machine Learning - What Is Hidden In Your Data Beyond Machine Learning - What Is Hidden In Your Data? Marvin Weinstein (Mweinstein@QuantumInsightsLLC.com)

Not Big - Complex The problem with BIG data is not that it is big…it is that it is complex and… It is noisy and full of artifacts It contains irrelevant data We don’t know how to model it. It is dense Complex – unstructured, coming from many sources and we don’t know how to model it. Noisy and full of artifacts usually imply the need to clean the data. This is dangerous and introduces bias. Queries only find for what we are looking for. Queries introduce bias. Queries/hypotheses don’t work can’t find hidden, unexpected surprises.

A Paradigm Shift Dynamic Quantum Clustering (DQC) Provides an unbiased view of data and discovers hidden information without knowing there is something to be found. DQC is Data agnostic Unbiased No false positives Visual Maintains contact with original data Data agnostic – analyst doesn’t need subject matter expertise before exploring the data Unbiased – no assumptions are made in advance. Can vet the data for usefulness. Robust – works with raw data, no cleaning necessary Sensitive – can find small outliers Ayasdi among others finds structure but has trouble relating it to the original data

How DQC Can Be Of Benefit To Your Business ? Cleaning data is unnecessary. No hypothesis generation No domain knowledge required Can validate data Can identify structures hidden in data that we don’t suspect and wouldn’t know how to model Cleaning data is hard, time consuming and dangerous. Not having to clean your data makes complex projects doable and also gets analysts up to speed much faster with much less cost. Hypothesis generation takes a lot of time and guesswork. It is hard and dangerous, in that an incorrect hypothesis that sort of works can affect future analyses negatively. Not having to form a working hypothesis before exploring the data lets the data speak for itself. It save time and money and produces unexpected insights. No domain knowledge means that an analyst doesn’t need a lot of time to get up to speed before taking an initial look at the data. An unbiased approach that can tell if you measuring the right things is a huge benefit. Being able to know if what you are measuring and storing is important. You could be building up a big data warehouse of information that has no actionable information. The value of being able to reveal hidden information that you don’t know is there and wouldn’t know how to look for is obvious. Also, it is important that the identity of the datapoints in a structure is never lost. All patterns are immediately translatable back to subset of the original data.

DQC Works Across All Domains DQC has succeeded in dealing with data from: On-line gaming (player segmentation) X-ray chemistry Genomics (Alzheimer’s data) Proteomics (Glycoporin/Aquaporin data) Homeland security (search for contraband nuclear material) Hyperspectral data

What Does Dense Data Look Like? X-ray Chemistry On-line gaming Homeland Security/Agriculture Cosmology

What Does DQC Reveal? Before DQC After DQC (35) This data is from the Sloan Digital Sky Survey The points are galaxies, the coordinates are real spatial coordinates; i.e. the real angles on the sky and distance to the galaxy (red shift) On the left the original 3 dimensional data, on the right the movie generated by DQC evolutioon (35 frames) showing the galaxies being attracted to the nearest region of high density. The animation reveals the existence of filaments and voids that one reads about in the NY Times

Hyperspectral Example After SVD Before DQC After DQC (35) Each data point represents of ~600,000 spectra (strength of reflected light from the quarry at each of ~600,000 pixels). On the left one view of this dense data. On the right the movie created by DQC evolution revealing the complex structure hidden in this data. Note that after the rapid initial changes things slow down and the complex final structure evolves very slowly. There is no problem deciding that the evolution has essentially come to an end. Each shape in the final structure is important to the final interpretation of the data.

Some Example of Hidden Structure Thread, string, segment Simple Cluster Structure

What These Structures Mean Colors come from individual threads. Note the dark blue on the right. This funny distribution creating lines on the ground corresponds to one thread in the final structure. The fact that the ground is striated is shown in the black and white picture too. That this is a single material – as indicated by the unique spectral signature – is a surprise.

How Can DQC Benefit You? It can save time, money and produce better insights by Avoiding the need to clean the data Avoiding time spent generating hypotheses before getting started Identifying the important set of features to use in later analysis using conventional tools Validating that your data contains the information you need Finding hidden information that you wouldn’t know to look for