Foundations of Data Science: Mathematics

Foundations of Data Science: Mathematics
Raphy Coifman Depts. of Mathematics & of Computer Science, Yale University Eric Kolaczyk Dept. of Mathematics & Statistics, Boston University

View @ 10K ft Common to group key players of data science into
“Computational” Sciences – E.g., Computer science, engineering, & statistics (previous talks) “Domain” sciences – E.g., Genomics, neuroscience, text analysis, etc. Mathematics plays a critical … but sometimes-differentiated -- role in support of both (i.e., often in the sense of `infrastructure’).

Mathematical Infrastructure: General
The “computational” side has traditionally been supported by, e.g., Linear algebra Numerical analysis Graph theory as well as supporting aspects of statistics, signal processing, etc.

Math Infrastructure: Domain
Support for the “domain” side frequently is domain/problem-specific. Most representatives from the physical sciences. Provides shortcuts and enabling tools for processing and information extraction, based on mathematical physical models. Ideas about representation and data features are mostly based on mathematical analysis tools.

Math Infrastructure: Domain (cont)
Example: Linear(ized) inverse problems Seismology/geophysics (Abel xform & related) Medical imaging (Radon xform, etc.) Radar Weather Note: These all historically had “big” data long before it became common in other domains.

Cartoon Version of Data Science Goals
Self-organization / clustering Dimension reduction / compression Massive Data Knowledge Modeling Etc. Mathematics can contribute both theoretical models/structure and a corresponding “calculus”.

Looking Back: Core Mathematical Activities
The following core activities were essential in the past to model our world, and with some adaptation and modifications are expected to still be essential in current and future environments. Linear algebra: the basics, high dimensional, and effective numerical analysis. Analysis: PDEs (both stochastic or deterministic), Harmonic analysis and approximation theory, Functional analysis.

Core Mathematical Activities (cont)
Geometries: Riemannian, metric geometry , differential geometry, some topology. Optimization: Dynamic programming, convex optimization, relaxation methods. Etc.

Key Point In the current and future environment for data science we are lacking theoretical models, as well as related calculus.

Illustration: Linear Algebra
Linear algebra -- more precisely, linear algebra in high dimensions -- is the main computational toolbox for all data analysis. Superficially, it has changed little in recent memory. However … the reality is that computational numerical algebra, for very large dimensions requires the integration of all the classical fields mentioned (i.e., analysis, geometry, optimization, probability, etc.)

Extending Linear Algebra
The challenges of linear algebra in, e.g., high-dimensional discretized function spaces, requires a good deal of understanding of analysis, randomization, and stochastic processes to enable very high dimensional linear algebra, as it requires reorganization of massive matrices . Extensions to tensor or multilinear algebra are an essential ingredient of combined source processing.

Extending Linear Algebra: Beyond Linear
The moment we exit the linear regime all expectations crumble. The simplest bilinear transformations, such as the pointwise product of two functions, become discontinuous in the weak topology, resulting potentially in major disruptions due to noise, which in the linear regime usually cancels, but builds up in nonlinear regimes. This is not academic, it requires serious deep mathematical analysis to understand empirical observations and structures in nonlinear regimes.

Mathematical Conceptualization of Modern Data Science Challenge
A cloud of points in high dimensional Euclidean space (a database) that needs to be characterized or modeled. The density of the points could be estimated, the geometry or configuration of the points described. (Manifold learning, graph/networks , probabilistic generative models , metric spaces etc.) Various functions on the data , such as low dimensional embeddings, classifiers, or “features may need to be regressed or “learned”.

The Main Point wrt Data Science Education
This activity involves a blend of all the (sub)fields listed above. Importantly, the corresponding mathematical abstractions usually discussed in the mathematical curriculum as independent courses (e.g., differential geometry, partial differential equations ,probability and stochastic processes, dynamical systems , optimization , Fourier analysis, etc. ) all come to life and blend in many instances of the modern setting, providing insights in all directions. This will require a blended integrative curriculum, in which all these tools are explained and jointly motivated.

Illustration: Comparing Two Random Samples
A fundamental problem in statistics is to assess whether two sets of observations are random samples (a) of the same probability distribution, or (b) governed by different distributions. Canonical solution is, e.g., two-sample Kolmogorov-Smirnov test. Fundamentally, requires the creation of histogram bins. The main challenge in high dimension is to match the bins to the geometry of the data, to its precision, and to the reason for matching the two populations.

Illustration (cont) Currently soft bins are created (implicitly) through kernel functions, or use of various smoothness classes (e.g., Lipschitz functions). Unfortunately, these heuristic methods are not informed by the intrinsic data geometry/statistics. Attempts to use deep nets to define appropriate binning might work better when driven by statistical analytic tools. At this point , kernel methods or smoothness class methods are assuming conventional geometries whether they fit or not, what is required is an appropriate analytic approximation theory, governed by statistics, as well as by filtering needs.

Summary / Discussion Mathematics provides Conceptual framework
Language “Calculus” for data science. Traditional framework has been found to extend well-beyond what might have been expected during its development … but we appear to be reaching a point where additional fundamental development will prove crucial.

Summary / Discussion How best to evolve the mathematics curriculum, both foundational and inter- disciplinary, to meet data science needs? How can we better foster integrative teaching/learning of topics from across traditionally diverse areas, both within mathematics (e.g., linear algebra, geometry, analysis) and across mathematical sciences (e.g., mathematics, statistics, probability)? How best to facilitate a more integrated development of theoretical error bounds / guarantees and computational / data analysis advances?

Foundations of Data Science: Mathematics

Similar presentations

Presentation on theme: "Foundations of Data Science: Mathematics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Foundations of Data Science: Mathematics

Similar presentations

Presentation on theme: "Foundations of Data Science: Mathematics"— Presentation transcript:

Similar presentations

About project

Feedback