Amongst deep studying practitioners, *Kullback-Leibler divergence* (KL divergence) is probably finest identified for its function in coaching variational autoencoders (VAEs). To study an informative latent area, we don’t simply optimize for good reconstruction. Somewhat, we additionally impose a previous on the latent distribution, and goal to maintain them shut – typically, by minimizing KL divergence.

On this function, KL divergence acts like a watchdog; it’s a constraining, regularizing issue, and if anthropomorphized, would appear stern and extreme. If we depart it at that, nevertheless, we’ve seen only one facet of its character, and are lacking out on its complement, an image of playfulness, journey, and curiosity. On this submit, we’ll check out that different facet.

Whereas being impressed by a collection of tweets by Simon de Deo, enumerating purposes of KL divergence in an unlimited variety of disciplines,

we don’t aspire to offer a complete write-up right here – as talked about within the preliminary tweet, the subject might simply fill a complete semester of research.

The way more modest targets of this submit, then, are

- to shortly recap the function of KL divergence in coaching VAEs, and point out similar-in-character purposes;
- for instance that extra playful, adventurous “different facet” of its character; and
- in a not-so-entertaining, however – hopefully – helpful method, differentiate KL divergence from associated ideas reminiscent of cross entropy, mutual info, or free power.

Earlier than although, we begin with a definition and a few terminology.

## KL divergence in a nutshell

KL divergence is the anticipated worth of the logarithmic distinction in chances in response to two distributions, (p) and (q). Right here it’s in its discrete-probabilities variant:

[begin{equation}

D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})

tag{1}

end{equation}]

Notably, it’s uneven; that’s, (D_{KL}(p||q)) just isn’t the identical as (D_{KL}(q||p)). (Which is why it’s a *divergence*, not a *distance*.) This side will play an vital function in part 2 devoted to the “different facet.”

To emphasize this asymmetry, KL divergence is usually referred to as *relative info* (as in “info of (p) relative to (q)”), or *info achieve*. We agree with one in all our sources that due to its universality and significance, KL divergence would most likely have deserved a extra informative title; reminiscent of, exactly, *info achieve*. (Which is much less ambiguous pronunciation-wise, as properly.)

## KL divergence, “villain”

In lots of machine studying algorithms, KL divergence seems within the context of *variational inference*. Usually, for real looking information, precise computation of the posterior distribution is infeasible. Thus, some type of approximation is required. In variational inference, the true posterior (p^*) is approximated by an easier distribution, (q), from some tractable household.

To make sure now we have an excellent approximation, we decrease – in concept, not less than – the KL divergence of (q) relative to (p^*), thus changing inference by optimization.

In observe, once more for causes of intractability, the KL divergence minimized is that of (q) relative to an unnormalized distribution (widetilde{p})

[begin{equation}

J(q) = D_{KL}(q||widetilde{p})

tag{2}

end{equation}]

the place (widetilde{p}) is the joint distribution of parameters and information:

[begin{equation}

widetilde{p}(mathbf{x}) = p(mathbf{x}, mathcal{D}) = p^*(mathbf{x}) p(mathcal{D})

tag{3}

end{equation}]

and (p^*) is the true posterior:

[begin{equation}

p^*(mathbf{x}) = p(mathbf{x}|mathcal{D})

tag{4}

end{equation}]

Equal to that formulation (eq. (2)) – for a derivation see (Murphy 2012) – is that this, which reveals the optimization goal to be an higher certain on the adverse log-likelihood (NLL):

[begin{equation}

J(q) = D_{KL}(q||p^*) – log p(D)

tag{5}

end{equation}]

Yet one more formulation – once more, see (Murphy 2012) for particulars – is the one we really use when coaching (e.g.) VAEs. This one corresponds to the anticipated NLL plus the KL divergence between the approximation (q) and the imposed *prior* (p):

[begin{equation}

J(q) = D_{KL}(q||p) – E_q[- log p(mathcal{D}|mathbf{x})]

tag{6}

finish{equation}]

Negated, this formulation can be referred to as the *ELBO*, for *proof decrease certain*. Within the VAE submit cited above, the ELBO was written

[begin{equation}

ELBO = E[log p(x|z)] – KL(q(z)||p(z))

tag{7}

finish{equation}]

with (z) denoting the latent variables ((q(z)) being the approximation, (p(z)) the prior, typically a multivariate regular).

### Past VAEs

Generalizing this “conservative” motion sample of KL divergence past VAEs, we are able to say that it expresses the standard of approximations. An vital space the place approximation takes place is (lossy) *compression*. KL divergence offers a technique to quantify how a lot info is misplaced after we compress information.

Summing up, in these and related purposes, KL divergence is “unhealthy” – though we don’t need it to be zero (or else, why trouble utilizing the algorithm?), we definitely wish to preserve it low. So now, let’s see the opposite facet.

## KL divergence, good man

In a second class of purposes, KL divergence just isn’t one thing to be minimized. In these domains, KL divergence is indicative of shock, disagreement, exploratory conduct, or studying: This really is the attitude of *info achieve*.

### Shock

One area the place *shock*, not info per se, governs conduct is notion. For instance, eyetracking research (e.g., (Itti and Baldi 2005)) confirmed that shock, as measured by KL divergence, was a greater predictor of visible consideration than info, measured by entropy. Whereas these research appear to have popularized the expression “Bayesian shock,” this compound is – I believe – not probably the most informative one, as neither half provides a lot info to the opposite. In Bayesian updating, the magnitude of the distinction between prior and posterior displays the diploma of *shock* led to by the information – shock is an integral a part of the idea.

Thus, with KL divergence linked to shock, and shock rooted within the elementary technique of Bayesian updating, a course of that could possibly be used to explain the course of life itself, KL divergence itself turns into elementary. We might get tempted to see it all over the place. Accordingly, it has been utilized in many fields to quantify unidirectional divergence.

For instance, (Zanardo 2017) have utilized it in buying and selling, measuring how a lot an individual disagrees with the market perception. Increased disagreement then corresponds to larger anticipated positive factors from betting towards the market.

Nearer to the realm of deep studying, it’s utilized in intrinsically motivated reinforcement studying (e.g., (Solar, Gomez, and Schmidhuber 2011)), the place an optimum coverage ought to maximize the long-term info achieve. That is doable as a result of like entropy, KL divergence is additive.

Though its asymmetry is related whether or not you employ KL divergence for regularization (part 1) or shock (this part), it turns into particularly evident when used for studying and shock.

### Asymmetry in motion

Trying once more on the KL formulation

[begin{equation}

D_{KL}(p||q) = sumlimits_{x} p(x) log(frac{p(x)}{q(x)})

tag{1}

end{equation}]

the roles of (p) and (q) are essentially totally different. For one, the expectation is computed over the primary distribution ((p) in (1)). This side is vital as a result of the “order” (the respective roles) of (p) and (q) could must be chosen in response to tractability (which distribution can we common over).

Secondly, the fraction contained in the (log) signifies that if (q) is ever zero at a degree the place (p) isn’t, the KL divergence will “blow up.” What this implies for distribution estimation basically is properly detailed in Murphy (2012). Within the context of shock, it signifies that if I study one thing I used to assume had likelihood zero, I will likely be “infinitely shocked.”

To keep away from infinite shock, we are able to be sure our prior likelihood is rarely zero. However even then, the attention-grabbing factor is that how a lot info we achieve in anybody occasion is determined by *how a lot info I had earlier than*. Let’s see a easy instance.

Assume that in my present understanding of the world, black swans most likely don’t exist, however they might … perhaps 1 % of them is black. Put otherwise, my prior perception of a swan, ought to I encounter one, being black is (q = 0.01).

Now in reality I *do* encounter one, and it’s black.

The knowledge I’ve gained is:

[begin{equation}

l(p,q) = 0 * log(frac{0}{0.99}) + 1 * log(frac{1}{0.01}) = 6.6 bits

tag{8}

end{equation}]

Conversely, suppose I’d been way more undecided earlier than; say I’d have thought the percentages have been 50:50.

On seeing a black swan, I get so much much less info:

[begin{equation}

l(p,q) = 0 * log(frac{0}{0.5}) + 1 * log(frac{1}{0.5}) = 1 bit

tag{9}

end{equation}]

This view of KL divergence, by way of shock and studying, is inspiring – it could lead on one to seeing it in motion all over the place. Nonetheless, we nonetheless have the third and remaining process to deal with: shortly examine KL divergence to different ideas within the space.

### Entropy

All of it begins with entropy, or *uncertainty*, or *info*, as formulated by Claude Shannon.

Entropy is the common log likelihood of a distribution:

[begin{equation}

H(X) = – sumlimits_{x=1}^n p(x_i) log(p(x_i))

tag{10}

end{equation}]

As properly described in (DeDeo 2016), this formulation was chosen to fulfill 4 standards, one in all which is what we generally image as its “essence,” and one in all which is particularly attention-grabbing.

As to the previous, if there are (n) doable states, entropy is maximal when all states are equiprobable. E.g., for a coin flip uncertainty is highest when coin bias is 0.5.

The latter has to do with *coarse-graining*, a change in “decision” of the state area. Say now we have 16 doable states, however we don’t actually care at that stage of element. We do care about 3 particular person states, however all the remainder are mainly the identical to us. Then entropy decomposes additively; complete (fine-grained) entropy is the entropy of the coarse-grained area, plus the entropy of the “lumped-together” group, weighted by their chances.

Subjectively, entropy displays our uncertainty whether or not an occasion will occur. Apparently although, it exists within the bodily world as properly: For instance, when ice melts, it turns into extra unsure the place particular person particles are. As reported by (DeDeo 2016), the variety of bits launched when one gram of ice melts is about 100 billion terabytes!

As fascinating as it’s, info per se could, in lots of circumstances, not be one of the best technique of characterizing human conduct. Going again to the eyetracking instance, it’s utterly intuitive that folks take a look at shocking components of pictures, not at white noise areas, that are the utmost you may get by way of entropy.

As a deep studying practitioner, you’ve most likely been ready for the purpose at which we’d point out *cross entropy* – probably the most generally used loss perform in categorization.

### Cross entropy

The cross entropy between distributions (p) and (q) is the entropy of (p) plus the KL divergence of (p) relative to (q). Should you’ve ever carried out your personal classification community, you most likely acknowledge the sum on the very proper:

[begin{equation}

H(p,q) = H(p) + D_{KL}(p||q) = – sum p log(q)

tag{11}

end{equation}]

In info theory-speak, (H(p,q)) is the anticipated message size per datum when (q) is assumed however (p) is true.

Nearer to the world of machine studying, for mounted (p), minimizing cross entropy is equal to minimizing KL divergence.

### Mutual info

One other extraordinarily vital amount, utilized in many contexts and purposes, is *mutual info*. Once more citing DeDeo, “you possibly can consider it as probably the most common type of correlation coefficient that you may measure.”

With two variables (X) and (Y), we are able to ask: How a lot can we study (X) after we study a person (y), (Y=y)? Averaged over all (y), that is the *conditional entropy*:

[begin{equation}

H(X|Y) = – sumlimits_{i} P(y_i) log(H(X|y_i))

tag{12}

end{equation}]

Now mutual info is entropy minus conditional entropy:

[begin{equation}

I(X, Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)

tag{13}

end{equation}]

This amount – as required for a measure representing one thing like correlation – is symmetric: If two variables (X) and (Y) are associated, the quantity of data (X) provides you about (Y) is the same as that (Y) provides you about (X).

KL divergence is a part of a household of divergences, referred to as *f-divergences*, used to measure directed distinction between likelihood distributions. Let’s additionally shortly look one other information-theoretic measure that not like these, is a *distance*.

### Jensen-Shannon distance

In math, a *distance*, or *metric*, moreover being non-negative has to fulfill two different standards: It should be symmetric, and it should obey the triangle inequality.

Each standards are met by the *Jensen-Shannon distance*. With (m) a mix distribution:

[begin{equation}

m_i = frac{1}{2}(p_i + q_i)

tag{14}

end{equation}]

the Jensen-Shannon distance is a mean of KL divergences, one in all (m) relative to (p), the opposite of (m) relative to (q):

[begin{equation}

JSD = frac{1}{2}(KL(m||p) + KL(m||q))

tag{15}

end{equation}]

This may be a super candidate to make use of have been we fascinated by (undirected) distance between, not directed shock attributable to, distributions.

Lastly, let’s wrap up with a final time period, limiting ourselves to a fast glimpse at one thing entire books could possibly be written about.

### (Variational) Free Power

Studying papers on variational inference, you’re fairly prone to hear individuals speaking not “simply” about KL divergence and/or the *ELBO* (which as quickly as you realize what it stands for, is simply what it’s), but in addition, one thing mysteriously referred to as *free power* (or: *variational free power*, in that context).

For sensible functions, it suffices to know that *variational free power* is adverse the ELBO, that’s, corresponds to equation (2). However for these , there’s *free power* as a central idea in thermodynamics.

On this submit, we’re primarily fascinated by how ideas are associated to KL divergence, and for this, we observe the characterization John Baez provides in his aforementioned speak.

*Free* power, that’s, power in helpful type, is the anticipated power minus temperature occasions entropy:

[begin{equation}

F = [E] -T H

tag{16}

finish{equation}]

Then, the additional free power of a system (Q) – in comparison with a system in equilibrium (P) – is proportional to their KL divergence, that’s, the knowledge of (Q) relative to (P):

[begin{equation}

F(Q) – F(P) = k T KL(q||p)

tag{17}

end{equation}]

Talking of free power, there’s additionally the – not uncontroversial – free power precept posited in neuroscience.. However in some unspecified time in the future, now we have to cease, and we do it right here.

## Conclusion

Wrapping up, this submit has tried to do three issues: Having in thoughts a reader with background primarily in deep studying, begin with the “recurring” use in coaching variational autoencoders; then present the – most likely much less acquainted – “different facet”; and at last, present a synopsis of associated phrases and their purposes.

Should you’re fascinated by digging deeper into the various numerous purposes, in a variety of various fields, no higher place to start out than from the Twitter thread, talked about above, that gave rise to this submit. Thanks for studying!

DeDeo, Simon. 2016. “Data Principle for Clever Folks.”

*Nature Opinions. Neuroscience*11 (February): 127–38. https://doi.org/10.1038/nrn2787.

*Advances in Neural Data Processing Techniques 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada]*, 547–54. http://papers.nips.cc/paper/2822-bayesian-surprise-attracts-human-attention.

Murphy, Kevin. 2012. *Machine Studying: A Probabilistic Perspective*. MIT Press.

*CoRR*abs/1103.5708. http://arxiv.org/abs/1103.5708.

Zanardo, Enrico. 2017. “HOW TO MEASURE DISAGREEMENT ?” In.