Random Walks for Data Analysis

Random Walks for Data Analysis
Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks

Data come to us in a form of data tables:
Binary relations:

Data validation & Network stability analysis; Data modeling.
Data come to us in a form of data tables: Binary relations: Classes of tasks: Data interpretation; Data validation & Network stability analysis; Data modeling.

Data interpretation Only local information is available at a time;
Lack of global intuitive geometric structure (binary relations/comparison instead of geometry). Intuitive ideas: The data may “live” on some geometric manifold. We need a manifold learning strategy. Data geometrization

Example: Data interpretation
Nature as a data-network

Linnaeus - Systema Naturæ (1735) The Linnaean classes for plants: Classis 1. Monandria: flowers with 1 stamen Classis 2. Diandria: flowers with 2 stamens Classis 3. Triandria: flowers with 3 stamens Classis 4. Tetrandria: flowers with 4 stamens Classis 5. Pentandria: flowers with 5 stamens Classis 6. Hexandria: flowers with 6 stamens … etc. The data classification/ judgment is always based on introduction of equivalence relations on the set of walks over the database of atributes: Nature as a data-network

Linnaeus - Systema Naturæ (1735) The Linnaean classes for plants: Classis 1. Monandria: flowers with 1 stamen Classis 2. Diandria: flowers with 2 stamens Classis 3. Triandria: flowers with 3 stamens Classis 4. Tetrandria: flowers with 4 stamens Classis 5. Pentandria: flowers with 5 stamens Classis 6. Hexandria: flowers with 6 stamens … etc. The data classification/ judgment is always based on introduction of equivalence relations on the set of walks over the database of attributes: Lamarque Darwin Nature as a data-network Theory of evolution

Data validation & Network stability analysis
Does data have an “internal logic” that could help to select proper values? Is there an “internal network dynamics”? Can the structure cause changes in itself? BRAESS'S PARADOX: adding extra capacity to a network can in some cases reduce overall performance.

Data modeling Apparent units/nodes are not “natural”;
Step1 Assign dynamical variable to each node/ data-entry/ entity Step 2 Write a “Schrödinger equation” from the latest physics paper Step 3 upload the new paper to arXiv Algorithm of doing data science by a “bad” physicist: The system is rather complex: Apparent units/nodes are not “natural”; Too many degrees of freedom for any reasonable equation Only a few main traits can be modeled Collective variables Complexity reduction

Equivalence partitions of walks => random walks
The data classification is always based on introduction of equivalence relations on the set of walks over the database. Examples: Rx: walks of the given length n starting at the same node x are equivalent Ry: walks of the given length n ending at the same node y are equivalent Rx  Ry : walks of the given length n between the nodes x and y are equivalent

Example of equivalence partitions over databases …
Astrology has been dated to the 3rd millennium BCE Same day born people inherit a same/similar personality….

Equivalence partitions of walks => random walks
The data classification is always based on introduction of equivalence relations on the set of walks over the database. Given an equivalence relation on the set of walks and a function such that we can always normalize it to be a probability function: all “equivalent” walks are equiprobable. Partition into equivalence classes of walks The utility function for each equivalence class A random walk transition operator between eq. classes

We proceed in three steps:
Step 0: Given an equivalence relation between paths, any transition can be characterized by a probability to belong to an equivalence class. Different equivalence relations  Different equivalence classes  Different probabilities Step 1: “Probabilistic graph theory” Nodes of a graph, subgraphs (sets of nodes) of the graph, the whole graph are described by probability distributions & characteristic times w.r.t. different Markov chains; Step 2: “Geometrization of Data Manifolds” Establish geometric relations between those probability distributions whenever possible; 1. Coarse-graining/reduction of networks & databases → data analysis; sensitivity to assorted data variations; 2. Monge-Kontorovich type problems, Optimal transport → distances between distributions;

An example of equivalence relation:
Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent Equiprobable walks: the nearest neighbor random walks Stochastic normalization

Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent Equiprobable walks: the nearest neighbor random walks Stochastic normalization Probability of a n-walk

Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent Equiprobable walks: Stochastic normalization Probability of a n-walk … “Structure learning”

Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent Equiprobable walks: Stochastic normalization Probability of a n-walk … ≠ “Structure learning” Stochastic normalization

What is a neighbourhood?
Who are my neighbours in a given classification ? … 1.Neighbours are next to me… 2.Neighbours are 2 steps apart from me… n - Neighbours are n steps apart from me … My neighbours are those, whom I can visit equiprobably (w.r.t. a chosen equivalence of paths)…

Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent … Equiprobable walks: Stochastic matrices:

A variety of random walks at different scales
Step 0 A variety of random walks at different scales An example of equivalence relation: Rx: walks of the given length n starting at the same node x are equivalent Left eigenvectors (m=1) Centrality measures: … Equiprobable walks: Stochastic matrices: The “stationary distribution” of the nearest neighbor RW

Random walks of different scales
Time is introduced as powers of transition matrices

Random walks of different scales
Time is introduced as powers of transition matrices Still far from stationary distribution! Stationary distribution is already reached! Defect insensitive. Low centrality (defect) repelling.

Random walks for different equivalence relations
Nearest neighbor RW “Maximal entropy” RW J. K. Ochab, Z. Burda

Nearest neighbor RW Maximal entropy RW J. K. Ochab, Z. Burda

Nearest neighbor RW “Maximal entropy” RW J. K. Ochab, Z. Burda

Step 1: “Probabilistic graph theory”
As soon as we define an equivalence relation … In classical graph theory: The shortest-path distance, insensitive to the structure of the graph: The distance = “a Feynman path integral” sensitive to the global structure of the graph. Systems of weights are related to each other in a geometric fashion.

As soon as we define an equivalence relation … Graph Subgraph (a subset of nodes) Node Time scale Tr T The probability that the RW stays at the initial node in 1 step. “Wave functions” (Slater determinants) of transients (traversing nodes and subgraphs within the characteristic scales) return the probability amplitudes whose modulus squared represent the probability density over the subgraphs. Probabilistic graph invariants = the t-steps recurrence probabilities quantifying the chance to return in t steps. … | det T | The probability that the RW revisits the initial node in N steps. Return times to the subgraphs within transients = 1/Pr{ … } Centrality measures (stationary distributions) Return times to a node Random target time Mixing times over subgraphs (times until the Markov chain is "close" to the steady state distribution)

Recurrence probabilities as principal invariants of the graph
The Cayley – Hamilton theorem: where the roots m are the eigenvalues of T, and {Ik}Nk=1 are its principal invariants, with I0 = 1. Kolmogorov- Chapman equation: |Ik| are the k-steps recurrence probabilities quantifying the chance to return in k steps. |I1| = Tr T is the probability that a random walker stays at a node in one time step, |IN| = |det T| expresses the probability that the random walks revisit an initial node in N steps.

As soon as we define an equivalence relation … Graph Subgraph (a subset of nodes) Node Time scale Tr T The probability that the RW stays at the initial node in 1 step. “Wave functions” (Slater determinants) of transients (traversing nodes and subgraphs within the characteristic scales) return the probability amplitudes whose modulus squared represent the probability density over the subgraphs. Probabilistic graph invariants = the t-steps recurrence probabilities quantifying the chance to return in t steps. … | det T | The probability that the RW revisits the initial node in N steps. Return times to the subgraphs within transients = 1/Pr{ … } Centrality measures (stationary distributions) Return times to a node Random target time Mixing times over subgraphs (times until the Markov chain is "close" to the steady state distribution)

Analogy with fermionic systems

The determinants of minors of the kth order of Ψ define an orthonormal basis in the

The squares of these determinants define the probability distributions over the ordered sets of k indexes: satisfying the natural normalization condition,

Describe currents of random walkers: Analogy with fermionic systems
The squares of these determinants define the probability distributions over the ordered sets of k indexes: satisfying the natural normalization condition, The simplest example is the stationary distribution of random walks:

As soon as we get probability distributions…
Step 2: “Geometrization of Data Manifolds” As soon as we get probability distributions… Given T, L ≡ 1- T , the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse:

As soon as we get probability distributions…
Step 2: “Geometrization of Data Manifolds” As soon as we get probability distributions… Given T, L ≡ 1- T , the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, The (squared) norm of a vector and the angle The Euclidean distance:

Transport problems of the Monge-Kontorovich type
Step 2: “Geometrization of Data Manifolds” As soon as we get probability distributions… Given T, L ≡ 1- T , the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, Transport problems of the Monge-Kontorovich type The (squared) norm of a vector and an angle The Euclidean distance: “First-passage transportation” from x to y x y W(x→y) W(y→x) ≠

Transport problems of the Monge-Kontorovich type
Step 2: “Geometrization of Data Manifolds” As soon as we get probability distributions… Given T, L ≡ 1- T , the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, Transport problems of the Monge-Kontorovich type The (squared) norm of a vector and an angle The Euclidean distance: (Mean) first-passage time Commute time Electric potential Effective resistance distance Tax assessment land price in cities Musical diatonic scale degree … Musical tonality scale

Example 1: Nearest-neighbor random walks on undirected graphs
y1

Example 1: Nearest-neighbor random walks on undirected graphs
y1 The spectral representation of the (mean) first passage time, the expected number of steps required to reach the node i for the first time starting from a node randomly chosen among all nodes of the graph accordingly to the stationary distribution π. The commute time, the expected number of steps required for a random walker starting at i ∈ V to visit j ∈ V and then to return back to i,

Example 2: First-passage times in cities
Manhattan, 2005 Neubeckum, Germany, 2012 Tax assessment value of land ($) (Mean) First passage time Some places in urban environments are easily accessible, others are not; well accessible places are more favorable to public, while isolated places are either abandoned, or misused. In a long time perspective, inequality in accessibility results in disparity of land prices: the more isolated a place is, the less its price would be. In a lapse of time, structural isolation would cause social isolation, as a host society occupies the structural focus of urban environments, while the guest society would typically reside in outskirts, where the land price is relatively cheap.

Around The City of Big Apple
Federal Hall Public places City CORE Times Square SoHo City CORE 10 steps 100 East Village steps 500 (Mean) first-passage times in the city graph of Manhattan steps 1,000 steps Bowery East Harlem City Decay steps 5,000 steps 10,000 SLUM

Example 3: Electric Resistance Networks, Resistance distance
An electrical network is considered as an interconnection of resistors: The currents are described by the Kirchhoff circuit law:

Example 2: Electric Resistance Networks, Resistance distance
An electrical network is considered as an interconnection of resistors: The currents are described by the Kirchhoff circuit law: Given an electric current from a to b of amount 1 A, the effective resistance of a network is the potential difference between a and b, The effective resistance allows for the spectral representation:

Impedance networks: The two-point impedance and LC resonances

Resonances

(Complexity reduction) PCA Based on Geodesics
Small data variations rise small changes to the eigenvectors (rotations) and eigenvalues of the symmetric transition operator, so that we can consider the image of the database as a “probabilistic manifold” in PRN-1. Geodesics on the sphere are “big circles”. PRN-1 PCA is performed in the tangential space, then “principal directions” are projected onto geodesics. The result is an ordered sum of assorted data variations.

Geodesics paths of language evolution
Levenshtein’s distance (Edit distance): is a measure of the similarity between two strings: the number of deletions, insertions, or substitutions required to transform one string into another. MILCH K = MILK The normalized edit distance between the orthographic realizations of two words can be interpreted as the probability of mismatch between two characters picked from the words at random.

The four well-separated monophyletic spines represent the four biggest traditional IE language groups: Romance & Celtic, Germanic, Balto-Slavic, and Indo-Iranian; The Greek, Romance, Celtic, and Germanic languages form a class characterized by approximately the same azimuth angle (belong to one plane); The Indo-Iranian, Balto-Slavic, Armenian, and Albanian languages form another class, with respect to the zenith angle.

The systematic sound correspondences between the Swadesh’s words across the different languages perfectly coincides with the well-known centum-satem isogloss of the IE family (reflecting the IE numeral ‘100’), related to the evolution in the phonetically unstable palatovelar order.

The normal probability plots fitting the distances r of language points from the ‘center of mass’ to univariate normality. The data points were ranked and then plotted against their expected values under normality, so that departures from linearity signify departures from normality.

The univariate normal distribution is closely related to the time evolution of a mass-density function under homogeneous diffusion in one dimension in which the mean value μ is interpreted as the coordinate of a point where all mass was initially concentrated, and variance σ2 ∝ t grows linearly with time. The values of variance σ2 give a statistically consistent estimate of age for each language group. the last Celtic migration (to the Balkans and Asia Minor) (300 BC), the division of the Roman Empire (500 AD), the migration of German tribes to the Danube River (100 AD), the establishment of the Avars Khaganate (590 AD) overspreading Slavic people who did the bulk of the fighting across Europe. Anchor events:

From the time–variance ratio we can retrieve the probable dates for:
The break-up of the Proto-Indo-Iranian continuum. The migration from the early Andronovo archaeological horizon (Bryant, 2001). by 2,400 BC The end of common Balto-Slavic history before 1,400 BC The archaeological dating of Trziniec-Komarov culture The separation of Indo-Arians from Indo-Iranians. Probably, as a result of Aryan migration across India to Ceylon, as early as in 483BC (Mcleod, 2002) before 400 BC The division of Persian polity into a number of Iranian tribes, after the end of Greco-Persian wars (Green, 1996). before 400 BC

Proto-Indo-Europeans?
The Kurgan scenario postulating the IE origin among the people of “Kurgan culture”(early 4th millennium BC) in the Pontic steppe (Gimbutas,1982) . Einkorn wheat The Anatolian hypothesis suggests the origin in the Neolithic Anatolia and associates the expansion with the Neolithic agricultural revolution in the 8th and 6th millennia BC (Renfrew,1987). The graphical test to check three-variate normality of the distribution of the distances of the five proto-languages from a statistically determined central point is presented by extending the notion of the normal probability plot. The χ-square distribution is used to test for goodness of fit of the observed distribution: the departures from three-variant normality are indicated by departures from linearity. The use of the previously determined time–variance ratio then dates the initial break-up of the Proto-Indo-Europeans back to 7,400 BC pointing at the early Neolithic date.

In search of Polynesian origins
The components probe for a sample of 50 AU languages immediately uncovers the both Formosan (F) and Malayo-Polynesian (MP) branches of the entire language family. Headhunters

Mystery of the Tower of Babel
Nonliterate languages evolve EXPONENTIALLY FAST without extensive contacts with the remaining population. Isolation does not preserve a nonliterate language! Headhunters

I dress like everyone, I don’t care, I dress like no other
It is believed that fashion refers to a distinctive and often habitual trend in the style with which a person dresses. These trends are different for different cultures and places even across Europe. London Wrocław Is it possible to assess these trends quantitatively and to identify a personal strategy of self-representation in individuals?

Appearance assessment
Appearance is encoded by a string of attributes (symbols) 28 attributes for men, including age 47 attributes for women, including age

Appearance edit distance (Spot the Difference Game)
A string metric measures distance as the number of operations required to transform a string encoding one appearance into another.

Appearance edit distance (Spot the Difference Game)
A string metric measures distance as the number of operations required to transform a string encoding one appearance into another. Appearance edit distance = 8

The matrix of appearance edit distances for 28 women in London
Analyzed databases Men Women Wrocław 85 113 London 14 28 The matrix of appearance edit distances for 28 women in London 8 11 7 9 10 23 13 14 17 27 12 22 16 26 18 15 21 3 6 19 5 25 20 24

Phylogenetic (neighbor joining) trees for human appearance in Wrocław
The matrix of appearance edit distances can be visualized by phylogenetic unrooted trees that illustrate the relatedness of appearances. The goodness-of-fit to the distance matrix 62.2% The goodness-of-fit to the distance matrix 78.7% The simple relation of ancestry basic for a tree structure cannot grasp complex distinctive and often habitual trends in the style!

Phylogenetic (neighbor joining) trees for human appearance in London
The matrix of appearance edit distances can be visualized by phylogenetic unrooted trees that illustrate the relatedness of appearances. The goodness-of-fit to the distance matrix 85.3 % The goodness-of-fit to the distance matrix 89.5 % The simple relation of ancestry basic for a tree structure cannot grasp complex distinctive and often habitual trends in the style!

Geometrization of appearance by random walks
The appearance data for 113 women of Wrocław shown in the coordinates of 3 main style traits calculated over the edit distance data matrix. The matrix of appearance edit distances can be considered as an adjacency matrix of a complete graph with weighted edges.

Geometrization of appearance data
The main data trend → 2nd principal component of data → 1st principal component of data → The appearance data for 113 women of Wrocław shown in the coordinates of 2 main style traits calculated over the edit distance data matrix.

THREE statistically different types of appearance
Distribution of points along the linear trend The main data trend → 2nd principal component of data → 1st principal component of data → The appearance data for 113 women of Wrocław shown in the coordinates of 2 main style traits calculated over the edit distance data matrix. There are THREE statistically different types of appearance

THREE statistically different types of appearance
Distribution of points along the linear trend “I dress like everyone” (Gaussian statistics) “I don't care how I look” (Maxwell-Boltzman statistics) “I dress like no other” (Fermi-Dirac statistics)

Women are more likely to follow a common style, to dress like everyone else.

Traps and landmarks Recurrence time First-passage time:
Landmarks, “guiding structures”: firstly reached , seldom revisited Traps, “confusing environments”: can take long to reach, but often revisited

Musical Dice Game (*) The relations between notes in (*) are rather described in terms of probabilities and expected numbers of random steps than by physical time. Thus the actual length N of a composition is formally put N → ∞, or as long as you keep rolling the dice.

F. Liszt Consolation-No1
Bach_Prelude_BWV999 R. Wagner, Das Rheingold (Entrance of the Gods) V.A. Mozart, Eine-Kleine-Nachtmusik

A “guiding structure”: Tonality scales in Western music
Increase of harmonic interval/ first –passage time The recurrence time vs. the first passage time over 804 compositions of 29 Western composers. Recurrence time First-passage time

Network geometry at different scales
First-passage time Scale of RW … The node belongs to a network “core”, consolidating with other central nodes The node belongs to a “cluster”, loosely connected with the rest of the network. Network geometry at different scales

Possible analogy with Ricci flows
“Densification” of the network of “positive curvature” “Contraction” of a “probabilistic manifold” First-passage time Scale of RW … A “collapse” of the network of “negative curvature”

Ricci flows and photo resolution

Intelligibility of a network/database
Increase of harmonic interval/ first –passage time First-passage time Recurrence time Property of a node w.r.t. to a global structure Property of a node w.r.t. to a local structure

n →  Intelligibility of a network/database Recurrence time
Increase of harmonic interval/ first –passage time After enough learning, any structure becomes intelligible! Recurrence time Property of a node w.r.t. to a local structure n →  First-passage time Property of a node w.r.t. to a global structure

References D.V., Ph. Blanchard, “Introduction to Random Walks on Graphs and Databases”, © Springer Series in Synergetics , Vol. 10, Berlin / Heidelberg , ISBN (2011). D.V., Ph. Blanchard, Mathematical Analysis of Urban Spatial Networks, © Springer Series Understanding Complex Systems, Berlin / Heidelberg. ISBN , 181 pages (2009).

Random Walks for Data Analysis

Similar presentations

Presentation on theme: "Random Walks for Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Walks for Data Analysis

Similar presentations

Presentation on theme: "Random Walks for Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback