Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks.

Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks

Data come to us in a form of data tables: Binary relations:

Data come to us in a form of data tables: Binary relations: Classes of tasks: 1.Data interpretation; 2.Data validation & Network stability analysis; 3.Data modeling.

Data interpretation Only local information is available at a time; Lack of global intuitive geometric structure (binary relations/comparison instead of geometry). Intuitive ideas: The data may “live” on some geometric manifold. We need a manifold learning strategy. Data geometrization

Example: Data interpretation Nature as a data- network

Example: Data interpretation Nature as a data- network Linnaeus - Systema Naturæ (1735) The Linnaean classes for plants: Classis 1. Monandria: flowers with 1 stamen Classis 2. Diandria: flowers with 2 stamens Classis 3. Triandria: flowers with 3 stamens Classis 4. Tetrandria: flowers with 4 stamens Classis 5. Pentandria: flowers with 5 stamens Classis 6. Hexandria: flowers with 6 stamens … etc. The data classification/ judgment is always based on introduction of equivalence relations on the set of walks over the database:

Example: Data interpretation Nature as a data- network Linnaeus - Systema Naturæ (1735) The Linnaean classes for plants: Classis 1. Monandria: flowers with 1 stamen Classis 2. Diandria: flowers with 2 stamens Classis 3. Triandria: flowers with 3 stamens Classis 4. Tetrandria: flowers with 4 stamens Classis 5. Pentandria: flowers with 5 stamens Classis 6. Hexandria: flowers with 6 stamens … etc. The data classification/ judgment is always based on introduction of equivalence relations on the set of walks over the database: Theory of evolution LamarqueDarwin

Data validation & Network stability analysis BRAESS'S PARADOX: adding extra capacity to a network can in some cases reduce overall performance. Does data have an “internal logic” that could help to select proper values? Is there an “internal network dynamics”? Can the structure cause changes in itself?

Data modeling Algorithm of doing data science by a physicist:

Data modeling Algorithm of doing data science by a physicist: Apparent units/nodes are not “natural”; Too many degrees of freedom for any reasonable equation Only a few main traits can be modeled The system is rather complex: Collective variables Complexity reduction

The data classification is always based on introduction of equivalence relations on the set of walks over the database. Equivalence partitions of walks => random walks R x : walks of the given length n starting at the same node x are equivalent R y : walks of the given length n ending at the same node y are equivalent x R x  R y : walks of the given length n between the nodes x and y are equivalent Examples:

The data classification is always based on introduction of equivalence relations on the set of walks over the database. Equivalence partitions of walks => random walks Given an equivalence relation on the set of walks and a function such that we can always normalize it to be a probability function: all “equivalent” walks are equiprobable. Partition into equivalence classes of walks The utility function for each equivalence class A random walk transition operator between eq. classes

p1p1 p2p2 p4p4 p3p3 p5p5 P config =p 1 ·p 2 ·p 3 ·p 4 ·p 5 Maxwell–Boltzmann statistics Maxwell–Boltzmann distribution On equiprobable configurations … A classification  all “equivalent” walks are equiprobable

p1p1 p2p2 p4p4 p3p3 p5p5 P config =p 1 ·p 2 ·p 3 ·p 4 ·p 5 Maxwell–Boltzmann statistics Bose–Einstein statistics... P 1 = P 2 = … = P N Maxwell–Boltzmann distribution On equiprobable configurations … Gibrat’s Law: the probability of a new occurrence is proportional to the number of times it has occurred previously Pareto-Lévy distributions … “fat- tails “ A classification  all “equivalent” walks are equiprobable

We proceed in three steps: Step 0: Given an equivalence relation between paths, any transition can be characterized by a probability to belong to an equivalence class. Different equivalence relations  Different equivalence classes  Different probabilities Step 1: “Probabilistic graph theory” Nodes of a graph, subgraphs (sets of nodes) of the graph, the whole graph are described by probability distributions & characteristic times w.r.t. different Markov chains; Step 2: “Geometrization of Data Manifolds” Establish geometric relations between those probability distributions whenever possible; 1. Coarse-graining/reduction of networks & databases → data analysis ; sensitivity to assorted data variations ; 2. Monge-Kontorovich type problems, Optimal transport → distances between distributions ;

A variety of random walks at different scales An example of equivalence relation: Equiprobable walks: the nearest neighbor random walks Stochastic normalization R x : walks of the given length n starting at the same node x are equivalent Step 0

A variety of random walks at different scales An example of equivalence relation: Equiprobable walks: the nearest neighbor random walks Stochastic normalization Probability of a n -walk R x : walks of the given length n starting at the same node x are equivalent Step 0

A variety of random walks at different scales An example of equivalence relation: Equiprobable walks: Stochastic normalization Probability of a n -walk … … “Structure learning” R x : walks of the given length n starting at the same node x are equivalent Step 0

A variety of random walks at different scales An example of equivalence relation: Equiprobable walks: Stochastic normalization Probability of a n -walk … … “Structure learning” Stochastic normalization ≠ R x : walks of the given length n starting at the same node x are equivalent Step 0

What is a neighbourhood? Who are my neighbours in a given classification ? … … 1.Neighbours are next to me… 2.Neighbours are 2 steps apart from me… n - Neighbours are n steps apart from me … My neighbours are those, whom I can visit equiprobably (w.r.t. a chosen equivalence of paths)…

A variety of random walks at different scales An example of equivalence relation: … … Equiprobable walks: Stochastic matrices: R x : walks of the given length n starting at the same node x are equivalent Step 0

A variety of random walks at different scales An example of equivalence relation: … … Equiprobable walks: Left eigenvectors (  =1) Centrality measures: Stochastic matrices: The “stationary distribution” of the nearest neighbor RW R x : walks of the given length n starting at the same node x are equivalent Step 0

Random walks of different scales Time is introduced as powers of transition matrices

Random walks of different scales

Time is introduced as powers of transition matrices Random walks of different scales

Time is introduced as powers of transition matrices Stationary distribution is already reached! Low centrality (defect) repelling. Still far from stationary distribution! Defect insensitive. Random walks of different scales

“Maximal entropy” RWNearest neighbor RW J. K. Ochab, Z. Burda Random walks for different equivalence relations

Nearest neighbor RW“Maximal entropy” RW J. K. Ochab, Z. Burda Random walks for different equivalence relations

Maximal entropy RWNearest neighbor RW J. K. Ochab, Z. Burda Random walks for different equivalence relations

Nearest neighbor RW“Maximal entropy” RW J. K. Ochab, Z. Burda Random walks for different equivalence relations

Step 1: “Probabilistic graph theory” The shortest-path distance, insensitive to the structure of the graph: The distance = “a Feynman path integral” sensitive to the global structure of the graph. Systems of weights are related to each other in a geometric fashion. As soon as we define an equivalence relation …

Graph Subgraph (a subset of nodes) NodeTime scale Step 1: “Probabilistic graph theory” | det T | The probability that the RW revisits the initial node in N steps. Tr T The probability that the RW stays at the initial node in 1 step. Probabilistic graph invariants = the t -steps recurrence probabilities quantifying the chance to return in t steps. … Centrality measures (stationary distributions) Return times to a node “Wave functions” (Slater determinants) of transients (traversing nodes and subgraphs within the characteristic scales) return the probability amplitudes whose modulus squared represent the probability density over the subgraphs. Return times to the subgraphs within transients = 1/Pr{ … } Random target time Mixing times over subgraphs ( times until the Markov chain is "close" to the steady state distribution ) As soon as we define an equivalence relation …

|I 1 | = Tr T is the probability that a random walker stays at a node in one time step, |I N | = |det T| expresses the probability that the random walks revisit an initial node in N steps. |I k | are the k -steps recurrence probabilities quantifying the chance to return in k steps. where the roots  are the eigenvalues of T, and {I k } N k=1 are its principal invariants, with I 0 = 1. Recurrence probabilities as principal invariants of the graph The Cayley – Hamilton theorem : Kolmogorov- Chapman equation:

Graph Subgraph (a subset of nodes) NodeTime scale Step 1: “Probabilistic graph theory” | det T | The probability that the RW revisits the initial node in N steps. Tr T The probability that the RW stays at the initial node in 1 step. Probabilistic graph invariants = the t -steps recurrence probabilities quantifying the chance to return in t steps. … Centrality measures (stationary distributions) Return times to a node “Wave functions” (Slater determinants) of transients (traversing nodes and subgraphs within the characteristic scales) return the probability amplitudes whose modulus squared represent the probability density over the subgraphs. Return times to the subgraphs within transients = 1/Pr{ … } Random target time Mixing times over subgraphs ( times until the Markov chain is "close" to the steady state distribution ) As soon as we define an equivalence relation …

Analogy with fermionic systems

The determinants of minors of the k th order of Ψ define an orthonormal basis in the

Analogy with fermionic systems The squares of these determinants define the probability distributions over the ordered sets of k indexes: satisfying the natural normalization condition,

Analogy with fermionic systems Describe currents of random walkers: The squares of these determinants define the probability distributions over the ordered sets of k indexes: satisfying the natural normalization condition, The simplest example is the stationary distribution of random walks:

Step 2: “ Geometrization of Data Manifolds” Given T, L ≡ 1- T, the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: As soon as we get probability distributions…

Step 2: “ Geometrization of Data Manifolds” Given T, L ≡ 1- T, the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, The (squared) norm of a vector and the angle The Euclidean distance: As soon as we get probability distributions…

Step 2: “ Geometrization of Data Manifolds” Given T, L ≡ 1- T, the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, The (squared) norm of a vector and an angle The Euclidean distance: Transport problems of the Monge-Kontorovich type “First-passage transportation” from x to y x y W(x→y) W(y→x) ≠ As soon as we get probability distributions…

Transport problems of the Monge-Kontorovich type Step 2: “ Geometrization of Data Manifolds” Given T, L ≡ 1- T, the linear operators acting on distributions. The Green function is the natural way to find the relation between two distributions within the diffusion process Drazin’s generalized inverse: Given two distributions x,y over the set of nodes, we can define a scalar product, The (squared) norm of a vector and an angle The Euclidean distance: (Mean) first- passage time Commute time Electric potential Effective resistance distance Tax assessment land price in cities Musical diatonic scale degree … As soon as we get probability distributions… Musical tonality scale

Example 1: Nearest-neighbor random walks on undirected graphs 

The commute time, the expected number of steps required for a random walker starting at i ∈ V to visit j ∈ V and then to return back to i, The spectral representation of the (mean) first passage time, the expected number of steps required to reach the node i for the first time starting from a node randomly chosen among all nodes of the graph accordingly to the stationary distribution π. 

Some places in urban environments are easily accessible, others are not; well accessible places are more favorable to public, while isolated places are either abandoned, or misused. In a long time perspective, inequality in accessibility results in disparity of land prices: the more isolated a place is, the less its price would be. In a lapse of time, structural isolation would cause social isolation, as a host society occupies the structural focus of urban environments, while the guest society would typically reside in outskirts, where the land price is relatively cheap. Example 2: First-passage times in cities (Mean) First passage time Tax assessment value of land ($) Manhattan, 2005 Neubeckum, Germany, 2012

Federal Hall Times Square SoHo East Village Bowery East Harlem (Mean) first-passage times in the city graph of Manhattan

Where could we make jogging trails?

First passage time ( expected random steps) 11.70 11.88 12.32 12.47 12.58 12.66 12.67 12.84 13.84 14.02 14.94 15.13 15.51 15.62 16.34 17.34 17.39 17.50 17.65 17.72 17.82 17.87 18.07 18.20 18.29 18.49 18.56 18.62 18.99 19.01 19.04 19.05 19.49 19.94 20.07 20.12 20.28 20.31 20.62 21.36 21.70 21.88 22.89 24.77 26.35 26.51 26.71 27.35 27.99 28.16 28.21 28.86 29.07 29.21 29.32 29.56 29.62 30.71 31.35 34.29 34.83 35.56 35.73 32 8 38 39 10 40 3 12 11 9 13 43 34 42 37 33 46 48 45 55 41 35 47 31 51 59 52 36 58 53 49 60 66 61 2 15 14 44 30 54 62 50 16 57 56 17 1 24 6 4 65 21 22 23 18 20 5 7 19 25 29 63 26 64 28 27

ErwachseneKinderRentner Beweglichkeit1.01.50.5 ErwRentener Kinder Carlmeyerstr. 00110000.019 Carlmeyerstr. 00210000.019 Carlmeyerstr. 003221000.05 Carlmeyerstr. 00491500.03 Carlmeyerstr. 00514300.029 Carlmeyerstr. 00633070.081 Carlmeyerstr. 007180120.067 Carlmeyerstr. 008340150.106 Carlmeyerstr. 00946060.104 Carlmeyerstr. 01022000.041 Carlmeyerstr. 01116700.036 Carlmeyerstr. 01216020.036 Carlmeyerstr. 01363080.14 Carlmeyerstr. 01429030.062 Carlmeyerstr. 01519530.049 Unbekannt_haus10000.019 haus_11560000.112 43140565271 4312084535

The most isolated places The most integrated places Typical direction of movements are indicated by the blue arrows.

Physically shortest path Path for meeting as many people as possible Path for meeting as fewer people as possible Concept: Arrange seats for sport along less usable paths in the neighborhood. Reasons: 1.To split the public (business and needs) activity and the private (sport) activity ; 2.To prevent a social misuse of isolated places in the neighborhood;

Example 3: Electric Resistance Networks, Resistance distance An electrical network is considered as an interconnection of resistors: Kirchhoff circuit law: The currents are described by the Kirchhoff circuit law:

Example 2: Electric Resistance Networks, Resistance distance An electrical network is considered as an interconnection of resistors: Kirchhoff circuit law: The currents are described by the Kirchhoff circuit law: Given an electric current from a to b of amount 1 A, the effective resistance of a network is the potential difference between a and b, The effective resistance allows for the spectral representation:

Impedance networks: The two-point impedance and LC resonances

Resonances

 (Complexity reduction) PCA Based on Geodesics P R N-1 Small data variations rise small changes to the eigenvectors (rotations) and eigenvalues of the symmetric transition operator, so that we can consider the image of the database as a “probabilistic manifold” in P R N-1. Geodesics on the sphere are “big circles”. PCA is performed in the tangential space, then “principal directions” are projected onto geodesics. The result is an ordered sum of assorted data variations.

Geodesics paths of language evolution Levenshtein’s distance (Edit distance): is a measure of the similarity between two strings: the number of deletions, insertions, or substitutions required to transform one string into another. MILCHK = MILK The normalized edit distance between the orthographic realizations of two words can be interpreted as the probability of mismatch between two characters picked from the words at random.

1.The four well-separated monophyletic spines represent the four biggest traditional IE language groups: Romance & Celtic, Germanic, Balto-Slavic, and Indo-Iranian; 2.The Greek, Romance, Celtic, and Germanic languages form a class characterized by approximately the same azimuth angle (belong to one plane); 3.The Indo-Iranian, Balto-Slavic, Armenian, and Albanian languages form another class, with respect to the zenith angle.

The systematic sound correspondences between the Swadesh’s words across the different languages perfectly coincides with the well-known centum-satem isogloss of the IE family (reflecting the IE numeral ‘100’), related to the evolution in the phonetically unstable palatovelar order.

The normal probability plots fitting the distances r of language points from the ‘center of mass’ to univariate normality. The data points were ranked and then plotted against their expected values under normality, so that departures from linearity signify departures from normality.

The univariate normal distribution is closely related to the time evolution of a mass- density function under homogeneous diffusion in one dimension in which the mean value μ is interpreted as the coordinate of a point where all mass was initially concentrated, and variance σ 2 ∝ t grows linearly with time. 1.the last Celtic migration (to the Balkans and Asia Minor) (300 BC), 2.the division of the Roman Empire (500 AD), 3.the migration of German tribes to the Danube River (100 AD), 4.the establishment of the Avars Khaganate (590 AD) overspreading Slavic people who did the bulk of the fighting across Europe. Anchor events: The values of variance σ 2 give a statistically consistent estimate of age for each language group.

From the time–variance ratio we can retrieve the probable dates for: The break-up of the Proto-Indo-Iranian continuum. The migration from the early Andronovo archaeological horizon (Bryant, 2001). by 2,400 BC The end of common Balto-Slavic history before 1,400 BC The archaeological dating of Trziniec-Komarov culture The separation of Indo-Arians from Indo-Iranians. Probably, as a result of Aryan migration across India to Ceylon, as early as in 483BC (Mcleod, 2002) The division of Persian polity into a number of Iranian tribes, after the end of Greco-Persian wars (Green, 1996). before 400 BC

The Kurgan scenario postulating the IE origin among the people of “Kurgan culture”(early 4 th millennium BC) in the Pontic steppe (Gimbutas,1982). Einkorn wheat The Anatolian hypothesis suggests the origin in the Neolithic Anatolia and associates the expansion with the Neolithic agricultural revolution in the 8 th and 6 th millennia BC (Renfrew,1987). The graphical test to check three-variate normality of the distribution of the distances of the five proto-languages from a statistically determined central point is presented by extending the notion of the normal probability plot. The χ-square distribution is used to test for goodness of fit of the observed distribution: the departures from three-variant normality are indicated by departures from linearity. The use of the previously determined time–variance ratio then dates the initial break-up of the Proto-Indo-Europeans back to 7,400 BC pointing at the early Neolithic date.

By 550 AD pretty well before 600 –1200 AD …pretty well before 600 –1200 AD while descendants from Melanesia settled in the distant apices of the Polynesian triangle as evidenced by archaeological records (Kirch, 2000; Anderson and Sinoto,2002; Hurlesetal.,2003). An interaction sphere had existed encompassing the whole region

Nonliterate languages evolve EXPONENTIALLY FAST without extensive contacts with the remaining population. Isolation does not preserve a nonliterate language! Languages spoken in the islands of East Polynesia and of the Atayal language groups seem to evolve without extensive contacts with Melanesian populations, perhaps because of a rapid movement of the ancestors of the Polynesians from South-East Asia as suggested by the ‘express train’ model (Diamond, 1988) consistent with the multiple evidences on comparatively reduced genetic variations among human groups in Remote. Headhunters Mystery of the Tower of Babel

Recurrence time First-passage time: Traps and landmarks Traps, “confusing environments”: can take long to reach, but often revisited Landmarks, “guiding structures”: firstly reached, seldom revisited

The relations between notes in (*) are rather described in terms of probabilities and expected numbers of random steps than by physical time. Thus the actual length N of a composition is formally put N → ∞, or as long as you keep rolling the dice. (*) Musical Dice Game

F. Liszt Consolation-No1Bach_Prelude_BWV999 V.A. Mozart, Eine-Kleine-Nachtmusik R. Wagner, Das Rheingold (Entrance of the Gods)

A “guiding structure”: Tonality scales in Western music Increase of harmonic interval/ first –passage time The recurrence time vs. the first passage time over 804 compositions of 29 Western composers. Recurrence time First-passage time

Scale of RW … … The node belongs to a network “core”, consolidating with other central nodes The node belongs to a “cluster”, loosely connected with the rest of the network. Network geometry at different scales

First-passage time Scale of RW … … Possible analogy with Ricci flows “Densification” of the network of “positive curvature” “Contraction” of a “probabilistic manifold” A “collapse” of the network of “negative curvature”

Ricci flows and photo resolution

First-passage time Recurrence time Property of a node w.r.t. to a global structure Property of a node w.r.t. to a local structure Increase of harmonic interval/ first – passage time Intelligibility of a network/database

First-passage time Recurrence time Property of a node w.r.t. to a global structure Property of a node w.r.t. to a local structure n →  Increase of harmonic interval/ first – passage time Intelligibility of a network/database After enough learning, any structure becomes intelligible!

D.V., Ph. Blanchard, “Introduction to Random Walks on Graphs and Databases”, © Springer Series in Synergetics, Vol. 10, Berlin / Heidelberg, ISBN 978-3-642-19591-4 (2011). D.V., Ph. Blanchard, Mathematical Analysis of Urban Spatial Networks, © Springer Series Understanding Complex Systems, Berlin / Heidelberg. ISBN 978-3-540-87828-5, 181 pages (2009). References

Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks.

Similar presentations

Presentation on theme: "Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks.

Similar presentations

Presentation on theme: "Random Walks for Data Analysis Dima Volchenkov (Bielefeld University) Discrete and Continuous Models in the Theory of Networks."— Presentation transcript:

Similar presentations

About project

Feedback