Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.

Slides:



Advertisements
Similar presentations
CS433: Modeling and Simulation
Advertisements

Basic Structures: Sets, Functions, Sequences, Sums, and Matrices
Week 21 Basic Set Theory A set is a collection of elements. Use capital letters, A, B, C to denotes sets and small letters a 1, a 2, … to denote the elements.
PROBABILITY INTRODUCTION The theory of probability consist of Statistical approach Classical approach Statistical approach It is also known as repeated.
Lecture 0: Introduction and Measure Theory CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier.
Introduction to stochastic process
13. The Weak Law and the Strong Law of Large Numbers
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Course outline and schedule Introduction Event Algebra (Sec )
1 Basic Probability Statistics 515 Lecture Importance of Probability Modeling randomness and measuring uncertainty Describing the distributions.
© Buddy Freeman, 2015Probability. Segment 2 Outline  Basic Probability  Probability Distributions.
Lecture III. Uniform Probability Measure I think that Bieren’s discussion of the uniform probability measure provides a firm basis for the concept of.
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
Week 10 - Monday.  What did we talk about last time?  More permutations  Addition rule  Inclusion and exclusion.
Chapter 1: Random Events and Probability
Chapter 1 Probability and Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
General information CSE : Probabilistic Analysis of Computer Systems
Welcome to Probability and the Theory of Statistics This class uses nearly every type of mathematics that you have studied so far as well as some possibly.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Independence and Bernoulli.
Chapter 8 Probability Section R Review. 2 Barnett/Ziegler/Byleen Finite Mathematics 12e Review for Chapter 8 Important Terms, Symbols, Concepts  8.1.
Ex St 801 Statistical Methods Probability and Distributions.
Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
L Berkley Davis Copyright 2009 MER301: Engineering Reliability1 LECTURE 1: Basic Probability Theory.
Basic Concepts of Discrete Probability (Theory of Sets: Continuation) 1.
Copyright © Cengage Learning. All rights reserved. 4 Probability.
Week 15 - Wednesday.  What did we talk about last time?  Review first third of course.
Set, Combinatorics, Probability & Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Set,
Lecture 2: Combinatorial Modeling CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at.
LECTURE 15 THURSDAY, 15 OCTOBER STA 291 Fall
CS 103 Discrete Structures Lecture 10 Basic Structures: Sets (1)
K. Shum Lecture 14 Continuous sample space, Special case of the law of large numbers, and Probability density function.
Week 11 - Wednesday.  What did we talk about last time?  Exam 2 post-mortem  Combinations.
LECTURE 14 TUESDAY, 13 OCTOBER STA 291 Fall
Lesson 6 – 2b Probability Models Part II. Knowledge Objectives Explain what is meant by random phenomenon. Explain what it means to say that the idea.
Computing Fundamentals 2 Lecture 6 Probability Lecturer: Patrick Browne
Week 11 What is Probability? Quantification of uncertainty. Mathematical model for things that occur randomly. Random – not haphazard, don’t know what.
Mathematical Proofs. Chapter 1 Sets 1.1 Describing a Set 1.2 Subsets 1.3 Set Operations 1.4 Indexed Collections of Sets 1.5 Partitions of Sets.
Copyright © Cengage Learning. All rights reserved.
1 3. Random Variables Let ( , F, P) be a probability model for an experiment, and X a function that maps every to a unique point the set of real numbers.
AGC DSP AGC DSP Professor A G Constantinides©1 Signal Spaces The purpose of this part of the course is to introduce the basic concepts behind generalised.
확률및공학통계 (Probability and Engineering Statistics) 이시웅.
Basic Principles (continuation) 1. A Quantitative Measure of Information As we already have realized, when a statistical experiment has n eqiuprobable.
Discrete Structures By: Tony Thi By: Tony Thi Aaron Morales Aaron Morales CS 490 CS 490.
Probability Rules In the following sections, we will transition from looking at the probability of one event to the probability of multiple events (compound.
Inference: Probabilities and Distributions Feb , 2012.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Probability theory is the branch of mathematics concerned with analysis of random phenomena. (Encyclopedia Britannica) An experiment: is any action, process.
Basic probability Sep. 16, Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies.
Notions & Notations (2) - 1ICOM 4075 (Spring 2010) UPRM Department of Electrical and Computer Engineering University of Puerto Rico at Mayagüez Spring.
Discrete Mathematics CS 2610 August 31, Agenda Set Theory Set Builder Notation Universal Set Power Set and Cardinality Set Operations Set Identities.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Basic Probability. Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies of probability)
Onur DOĞAN.  The Classical Interpretation of Probability  The Frequency Interpretation of Probability  The Subjective Interpretation of Probability.
Week 10 - Wednesday.  What did we talk about last time?  Counting practice  Pigeonhole principle.
Chapter 2 1. Chapter Summary Sets (This Slide) The Language of Sets - Sec 2.1 – Lecture 8 Set Operations and Set Identities - Sec 2.2 – Lecture 9 Functions.
Week 10 - Monday.  What did we talk about last time?  Combinations  Binomial theorem.
Chapter 4 Some basic Probability Concepts 1-1. Learning Objectives  To learn the concept of the sample space associated with a random experiment.  To.
Primbs, MS&E345 1 Measure Theory in a Lecture. Primbs, MS&E345 2 Perspective  -Algebras Measurable Functions Measure and Integration Radon-Nikodym Theorem.
The Language of Sets If S is a set, then
What is Probability? Quantification of uncertainty.
ICOM 5016 – Introduction to Database Systems
SETS, RELATIONS, FUNCTIONS
ICOM 5016 – Introduction to Database Systems
Sets and Probabilistic Models
Sets and Probabilistic Models
Experiments, Outcomes, Events and Random Variables: A Revisit
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Sets and Probabilistic Models
Sets and Probabilistic Models
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Presentation transcript:

Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Homework 2 Presentations on Biomedical Data Science Due: Next Week on Tuesday? Reorganizing Groups.

Measurements Measurements have inherent assumptions Measurements are often stated very informally – Formalize our measures!

Measurements Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington

The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the real number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?

The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the natural number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?

Solving the Problem of Measures What does it mean for some body (or subset) to be measurable? If a set E is measurable, how does one define its measure? What properties or axioms does measure (or the concept of measurability) obey?

Measure Theory Before we can measure anything we need something to measure! Let’s define a measurable space – A measurable space is a collection of events B, and the set of all outcomes, Ω, also called the sample space.

Events and Sample Spaces Each event, F, is a set containing zero or more outcomes. – Each outcome can be viewed as a realization of an event. The real world can be viewed as a player in a game that makes some move: – All events in F that contain the selected outcome are said to “have occurred”.

Events and Sample Space Take a deck of 52 cards + 2 jokers Draw a single card from the deck. Sample space: 54 element set, each card is a possible outcome. An event is any subset of the sample space, including a singleton set, or the empty set.

Events and Sample Space Potential events: – “Red and black at the same time without being a joker” – (0 elements) – “The 5 of hearts” – (1 element) – “A king” – (4 elements) – “A face card” – (12 elements) – “A card” – (54 elements)

Forming an Algebra on B and Ω In order to define measures on B, we need to make sure it has certain properties, those of a σ-algebra. A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets). “Vanilla” algebras are closed only under finite set operations.

Countable Sets Countable sets are those with the same cardinality of natural numbers. Quick refresher: Prove the cardinality of integers and natural numbers are the same.

σ-algebra If we have a σ-algebra on our sample space Ω, then:

Measures A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number. Formally:

Example Measure Let’s define a measure of “Volume”. The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties: – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Example Measure Does the ordinary concept of volume satisfy these two properties? – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Two Special Kinds of Measures Signed measure – can be negative Probability measure – defined over a probability space with a probability measure. – A probability measure, P, has the normal properties of a measure, but it is also normalized such that:

Sets of Measure Zero A set of measure zero is some set For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about the collection of sets B.

Borel Sets A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra. – Almost any set you can describe on the real line is a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc. – The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line.

Borel Sets For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X. Important! Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space.

Borel Sets Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets. Allows us to define equivalence of measures.

Borel Sets Let’s say we have two measures: To show they are equivalent we just need to show that: – They are equivalent on all intervals By definition they are then equivalent for all Borel sets, and hence over the measurable space. Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal.

Measure Theory and Data Science Data Science is about working with, and deriving observations or features from data. Features are effectively measures of some sort, but often not for the underlying space of interest. Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured.

Example Bearcats Elementary School had 300 students in their 5 th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning. The new class of 1 st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading. What can we infer? How does measure theory relate?

Measure Theory: Further Reading M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004 S. I. Resnick, “A probability path”, Birkhauser, A. Gut, “Probability: A Graduate Course”, Springer, R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990.

The Data Science Pipeline Metric identification Data collection Data exploration and summary statistics Feature generation Feature importance testing Modeling Validation

Automating the Data Pipeline Drake – Like make for data.

Getting your environments set for Data Science Over the next few weeks we will be introducing the projects and getting started with data science projects. Need to get the right tools installed!

Anaconda Grab the free distribution – Helps you maintain the appropriate python distributions.

iPython/Jupyter Interactive Python with documentation features Installs easily with Anaconda – cs.org/en/latest/install.h tml

Markdown Markdown Syntax – ax ax Markdown Basics – cs cs

Compute Lab Compute Server Minerva – Each group will get an account on Minerva with space and compute power for their project – Cloud-based Ubuntu server, similar to AWS, but private and secure.

For next time No homework this week, work on HWK 3 presentations Work with Jupyter examples on Minerva once accounts are set up. Learn Markdown Basics No class Thursday