Basic Normalizing Flow

A normalizing flow is similar to a VAE in that we try to build up $p(X)$ by starting from a simple known distribution $p(Z)$. $X$ and $Z$ are of same dimensionality.

The very basic idea of a flow connecting a complex $X$ and a simple $Z$ would be using a bijective function $f$ with its inverse $f^{-1}$ such that:

$X = f(Z)$
$Z = f^{-1}(X)$
$p(X) = p_Z(f^{-1}(X))|\text{delta}(\mathbf{J}_{f^{-1}})|$ where the term on the right is the absolute value of the determinant of the Jacobian of $f^{-1}(X)$ w.r.t. $X$, and $p_Z$ uses a subscription $Z$ to differentiate it from the distribution of $X$.

It is natural to introduce a chain structure between $X$ and $Z$:

$X = f_1(f_0(Z))$
$Z = f_0^{-1}(f_1^{-1}(X))$
$p(X) = p_Z(f_0^{-1}(f_1^{-1}(X))) \left\lvert \text{delta}(\mathbf{J}{f_0^{-1}}) \right\lvert \left\lvert \text{delta}(\mathbf{J}{f_1^{-1}}) \left\right$

Training Process

Since most of the time $X$ is what we observed and has an empirical distribution consisting of $N$ data points ${X_1, X_2, \cdots, X_N}$, we can simply optimize the negative loglikelihood:

\[\text{argmin}_f \sum_{i = 1}^N -\log (p(X_I)) = \text{argmin}_f \sum_{i = 1}^N -\log (p_Z(f^{-1}(X_i))|\text{delta}(\mathbf{J}_{f^{-1}(X_i)})|)\]

Commonly Used Flows

Planar Flow

\[X=f(Z)= Z+ \mathbf{u} h(\mathbf{w}^TZ+b)\]

$\mathbf{u}, \mathbf{w}$ and $b$ are parameters. Additional constraints are required on $\mathbf{u}, \mathbf{w}, b$ and $h$ to guarantee the function is bijective. For example, it is ituitive that $h$ needs to be bijective, like $h = \tanh$.

The Jabobian of $f^{-1}(X)$ is not obvious, as it depends on $h$, so the analytic form of $p(X)$ is not easy to compute.

Nonlinear Independent Components Estimation (NICE)

This method requires that $X$ can be split into two disjoint parts, $X_1$ and $X_2$, i.e. $X = [X_1, X_2]$. Same assumption for $Z$.

Forward mapping $f: Z \rightarrow X$

$X_1 = Z_1$
$X_2 = Z_2 + m_\theta(Z_1)$, where $m_\theta$ is a neural network Inverse mapping $f^{-1}: X \rightarrow Z$
$Z_1 = X_1$
$Z_2 = X_2 - m_\theta(X_1)$ The inverse mapping of $Z_2$ can be simply obtained by replacing $Z_1$ in forward mapping of $X_2$ with $X_1$ and move $m_\theta(Z_1)$ (equivalent to $m_\theta(X_1)$) to LHS.

The Jacobian of the forward mapping is lower trangular, whose determinant is simply the product of the elements on the diagonal, i.e.

\[\frac{\partial f^{-1}(X)}{\partial X} = \begin{pmatrix} \frac{\partial (Z_1)}{\partial X_1} & \frac{\partial (Z_1)}{\partial X_2} \\ \frac{\partial (Z_2)}{\partial X_1} & \frac{\partial (Z_2)}{\partial X_2} \end{pmatrix} =\begin{pmatrix} I & 0 \\ \frac{\partial (-m_\theta(X_1))}{\partial X_1} & I \end{pmatrix}\] \[\text{delta}(\begin{pmatrix} I & 0 \\ \frac{\partial (-m_\theta(X_1))}{\partial X_1} & I \end{pmatrix}) = 1\]

Therefore, this defines a volume preserving transformation.

Continuous Normalizing Flows

Original paper

Continuous Normalizing Flows solves the problem of selecting proper transition functions and the high computation complexity of Jacobians. The idea first origniates from a discrete sequence based transition:

\[X_{t+1} = X_t + f(X_t, \theta_t)\]

where $f$ is time-sensitive with the parameter $\theta_t$ changing with time. From a perspective of flows, this sequence transition provides a natural way of connecting $X$ and $Z$, i.e. $X$ being $X_T$ at the end of the sequence and $Z$ being $X_0$ at the beginning of the sequence, with finite discrete intermediate states $X_t, t \in [1, T-1] \cap \mathbb{N}^+$. The transition is obviously bijective, since each step is a simple linear function.

We then want to see what if the sequence becomes continous. Since the discrete sequence transition is describing the change at time $t$ in a discrete manor, i.e. $X_{t+1} - X_t = \Delta X_{t} =f(X_t, \theta_t)$, the continous version is then: $\frac{\partial X_t}{\partial t} = f(X_t, t, \theta)$ which is an ordinary differential equation (ODE). Starting from the input layer$X_0$ ($Z$), we can define the output layer$X_T$ to be the solution to this ODE initial value problem at some time $T$ and emprically computed by some black box ODE solver:

\[X_t = X_0 + \int_0^tf(X_i, i, \theta)di = \text{ODESolve}(X_0, f, 0, t, θ)\]

Computation of Jacobian

With the ODE to define the continous mapping from $X_t$ to $X_{t+1}$, the next question is how the probablity would change from $t$ to $t+1$. In the discrete case, we have

$$\log (p(X_{t+1})) - \log(p(X_t)) = \log(

\text{delta}(\mathbf{J}_{F_t^{-1}})

)$$ ,

where $F_t$ is the mapping function at time $t$: $X_{t+1} = X_{t} + \int_t^{t+1}f(X_i, i, \theta)di = \text{ODESolve}(X_t, f, t, t+1, θ)$.

The paper proved that in the continous case,

\[\frac{\partial \log(p(X_t))}{\partial t} = -\text{tr}(\frac{\partial f}{\partial X_t})\]

with same $f$ defined in the ODE $\frac{\partial X_t}{\partial t} = f(X_t, t, \theta)$. ($f$ is of the same dimensionality as $X_t$ so its first derivative w.r.t $X_t$ is a square matrix, and this theorem says we can only compute the diagnal elements and sum them up)

An example made by the author is a continous version of Planar Flow, where the function $\mathbf{u}h(\cdot)$ is not served as a direct mapping from $X_t$ to $X_{t+1}$, but describing the gradient (dynamic) at $X_t$:

\[\frac{\partial X_t}{\partial t} = \mathbf{u}h(\mathbf{w}^TX_t + b)\] \[\frac{\partial \log(p(X_t))}{\partial t} = -\mathbf{u}^T\frac{\partial h}{\partial X_t}\]

The two ODEs then give way of sampling complex $X$ ($X_t$)from simple random variable $Z$ ($X_0$) by the first ODE and estimate its density by the second ODE, assuming we have a good black box ODE solver.

The author did two addtional things to the two ODEs.

Because the trace function is linear, we can easily add multiple $f$s to ODE:

\[\frac{\partial X_t}{\partial t} = \sum_{i=1}^N f_i(X_t, t, \theta)\] \[\frac{\partial \log(p(X_t))}{\partial t} = \sum_{i=1}^N-\text{tr}(\frac{\partial f_i}{\partial X_t})\]

Also by utilizing the linearity, we can specify the role of $t$ in $f$ in a gating mechanism manner:

\[\frac{\partial X_t}{\partial t} = \sum_{i=1}^N \sigma_i(t)f_i(X_t, \theta)\] \[\frac{\partial \log(p(X_t))}{\partial t} = \sum_{i=1}^N-\text{tr}(\sigma_i(t)\frac{\partial f_i}{\partial X_t})\]

where $f_i$ now is independent of $t$

Backpropogation

A vanilla way of computing the gradients of all $f$ is expensive. The author proposed using the adjoint sensitivity method, which computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers.

Suppose we have a scalar loss function $L$ on the output $X_{t_1} = \text{ODESolve}(X_{t_0}, f, t_0, t_1, θ)$, written as $L(\text{ODESolve}(X_{t_0}, f, t_0, t_1, θ))$The target is $\frac{\partial L}{\partial \theta}$.

We first work on the adjoint $\mathbf{a}(t) = \frac{\partial L}{\partial X_{t}}$Its gradients (dynamics) are given by another ODE, which can be thought of as the instantaneous analog of the chain rule:

\[\frac{\partial \mathbf{a}(t)}{\partial t} = -\mathbf{a(t)}^T\frac{\partial f(X_t, t, \theta)}{\partial X_{t}}\]

We can then compute $\mathbf{a}(t)$ by another call to an ODE solver. This solver must run backwards, starting from the initial value of $\mathbf{a}(t_1)$. Because we need to know $X_t$ when computing this gradient at time $t$, we need to reversely get $X_t$ starting from $t_1$ together with the backward computation of $\mathbf{a}(t)$.

Finally we use $\mathbf{a}(t)$ to compute $\frac{\partial L}{\partial \theta}$:

\[\frac{\partial L}{\partial \theta} = -\int_{t_0}^{t_1}\mathbf{a}(t)^T\frac{\partial f(X_t, t, \theta)}{\partial \theta} dt\]

which is another ODE:

\[\frac{\partial \frac{\partial L}{\partial \theta}}{\partial t} = -\mathbf{a}(t)^T\frac{\partial f(X_t, t, \theta)}{\partial \theta}\]

so call ODESolver again.

Riemannian Continuous Normalizing Flows

Original paper

This work heavliy relies on the understanding of topology so I add an additional background section.

In this work, flows are defined via vector fields on manifolds and computed as the solution to the associated ordinary differential equation (ODE). Intuitively, this method operates by first parametrizing a vector field on the manifold with a neural network, then sampling particles from a base distribution, and finally approximating their flow along the vector field using a numerical solver. The high level idea is, for the ODE of the dynamic: $\frac{\partial X_t}{\partial t} = f(X_t, t, \theta)$ We are now considering $f(X_t, t, \theta)$ as a vector field, and it is describing the velocity at $X_t$. Consider the temporal evolution of a particle $X_t$ on a $d$-dimensional manifold $\mathcal{M}$, whose velocity is given by a vector field $f(X_t, t, \theta)$. Intuitively,$f(X_t, t, \theta)$) indicates the direction and speed along which the particle is moving on the manifold’s surface. Classic examples for such vector fields include weathercocks giving wind direction and compasses pointing toward the magnetic north pole of the earth.

Geometric Backgrounds

Mostly based on this

Mainfold

It is a topological space that can be covered by a collection of open subsets, $\mathbb{U}\alpha$, where each $\mathbb{U}\alpha$ is isomorphic to some “standard model”, e.g., some open subset of Euclidean space, $\mathbb{R}^n$. (an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping )

Chart

Given a topological space, $\mathcal{M}$, a chart (or local coordinate map) is a pair, $(\mathbb{U}, \varphi)$, where $\mathbb{U}$ is an open subset of $\mathcal{M}$ and $\varphi: \mathbb{U} \rightarrow \mathbb{\Omega}$ is a homeomorphism onto an open subset, $\mathbb{\Omega} = \varphi(\mathbb{U})$, of $\mathbb{R}^{n_\phi}$ (for some $n_\phi \ge 1$).

( homeomorphism (from Greek ὅμοιος (homoios) ’similar, same’, and μορφή (morphē) ’shape, form’, named by Henri Poincaré, also called topological isomorphism, or bicontinuous function, is a bijective and continuous function between topological spaces that has a continuous inverse function.)

A more ituitive way of describing a chart is found here: A chart for a topological space $\mathcal{M}$ (also called a coordinate chart, coordinate patch, coordinate map, or local frame) is a homeomorphism $\varphi$ from an open subset $\mathbb{U}$ of $\mathcal{M}$ to an open subset of a Euclidean space. The chart is traditionally recorded as the ordered pair $(\mathbb{U}, \varphi)$.

For a point $p \in \mathbb{U}$ , we obtain its local coordinates by first mapping $p$ to Euclidean space via $\varphi$, i.e. $\varphi(p)$, then you do projection to $n_\varphi$ coordinates in this Euclidean space. The coordinates can be written as $(\text{Proj}(\varphi(p))1, \text{Proj}(\varphi(p))_2, \cdots, \text{Proj}(\varphi(p)){n_\varphi})$

Atlas

Not that god in Greek myth.

An atlas of $\mathcal{M}$ consists of individual charts that, roughly speaking, describe individual regions of the manifold. Formally, it is an index family ${(\mathbb{U}\alpha, \varphi\alpha): \alpha \in \text{Some set} }$ which covers $\mathcal{M}$, i.e. $\cup_\alpha \mathbb{U}_\alpha = \mathcal{M}$.

Transition Map

A transition map provides a way of comparing two charts of an atlas. To make this comparison, we consider the composition of one chart with the inverse of the other. Suppose $(\mathbb{U}\alpha, \varphi\alpha)$ and $(\mathbb{U}\beta, \varphi\beta)$ are two charts for $\mathcal{M}$ such that $\mathbb{U}\alpha \cap \mathbb{U}\beta$ is not empty. The transition map $\tau_{\alpha, \beta}: \varphi_\alpha(\mathbb{U}\alpha \cap \mathbb{U}\beta) \rightarrow \varphi_\beta(\mathbb{U}\alpha \cap \mathbb{U}\beta)$ (map the $\varphi_\alpha$ based Euclidean of the joint part to $\varphi_\beta$ based Euclidean of the joint part) is defined as: $\tau_{\alpha, \beta} = \varphi_\beta \circ \varphi_\alpha^{-1}$. With $\tau_{\alpha, \beta}$ you can directly jump from $\varphi_\alpha$ based Euclidean to $\varphi_\beta$ based Euclidean on the joint area of the two charts. Note that we can define in the same way a jump from $\varphi_\beta$ to $\varphi_\alpha$ as $\tau_{\beta, \alpha}: \varphi_\beta(\mathbb{U}\alpha \cap \mathbb{U}\beta) \rightarrow \varphi_\alpha(\mathbb{U}\alpha \cap \mathbb{U}\beta)$, $\tau_{\beta, \alpha} = \varphi_\alpha \circ \varphi_\beta ^{-1}$, so the transition map always come in pairs.

$C^k$ n-atlas

A family of charts that

Covers the whole mainfold $\mathcal{M}$, $\cup_\alpha \mathbb{U}_\alpha = \mathcal{M}$
Every chart projects to Euclidean of the same dimension $n$, $\varphi_i(p_i) \in \mathbb{R}^n \forall i$
Whenever $\mathbb{U}i \cap \mathbb{U}_j \ne \Phi$, the transition map $\tau{i, j}$ is a $C^k$ -diffeomorphism (k times differentiable)

Compatible

Chart $(\mathbb{U}, \varphi)$ is compatible with $C^k$ n-atlas $\mathcal{A}$ iff for every $\mathbb{U}_i \in \text{Charts of }\mathcal{A}$ and $\mathbb{U}_i \cap \mathbb{U} \ne \Phi$ , the transition maps between the two are both $C^k$ -diffeomorphism.

Two altases are $C^k$ compatible iff every chart of one is compatible with the other atlas. This is equivalent to saying that the union of the two $C^k$ atlases is still a $C^k$ atlas.

An atlas $\mathcal{A}$ is maximal if it contains all possible $C^k$ compatible atlases. The definition of a maximal atlas is needed so that two manifolds with different atlases, but which are $C^K$-compatible will not be considered different manifolds. A maximal $C^K$ atlas is what we call a $C^K$ differentiable structure. source

$C^k$ manifold of dimension n

A manifold $\mathcal{M}$ of dimension $n$ with a maximal $C^k$ atlas on it.

Curve

In analytic geometry, a curve is continuous map from a one-dimensional space (interval) to an $n$-dimensional space: $\gamma: (-\epsilon, \epsilon) \rightarrow \mathcal{M}$. Namely, one can think of a curve in a topological space as being a “path” traced out continuously, where the trace is indexed by time. A differentiable curve is a differentiable manifold of dimension one.

Tangent Space

For a $d$-dimensional manifold $\mathcal{M}$, at every point $p$ on $\mathcal{M}$ there is a tagent space $\mathcal{T}_p \mathcal{M}$. Suppose we have two curves $\gamma_1: (-\epsilon_1, \epsilon_1) \rightarrow \mathcal{M}$ and $\gamma_2: (-\epsilon_2, \epsilon_2) \rightarrow \mathcal{M}$, and they both pass through the point $p$, i.e. $\gamma_1(0) = \gamma_2(0) = p$. If there is a chart $(\mathbb{U}, \varphi)$ containing $p$ while $\frac{d (\varphi \circ \gamma_1)}{dt}

_{t=0} = \frac{d (\varphi \circ \gamma_2)}{dt}

_{t=0}$ (derivatives at the mapped Euclidean space w.r.t curve trace time 0 are identical), then we call the two curves in an equivalence class, and we call the curves in the equivalence class tangent vectors, and all the tangent vectors form up a tangent space.

Introduction to Normalizing Flows