What Is Jensen–Shannon Divergence?
Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of Kullback–Leibler divergence (KLD). Unlike KLD, JSD is symmetric (JSD(P||Q) = JSD(Q||P)) and always finite, making it a proper distance metric when square-rooted. It is defined as:
$$JSD(P \parallel Q) = \frac{1}{2}D_{KL}(P \parallel M) + \frac{1}{2}D_{KL}(Q \parallel M)$$
where (M = \frac{1}{2}(P + Q)) is the mixture distribution.
For discrete distributions with base-2 logarithm, JSD is bounded between 0 and 1. It also satisfies the inequality:
$$JSD(P \parallel Q) \leq \frac{1}{2} TV(P, Q)$$
where TV is total variation distance.
Why Developers Should Care
JSD is used in:
- Generative Adversarial Networks (GANs): The original GAN paper (Goodfellow et al., 2014) uses JSD as the objective function for training.
- Natural Language Processing: Measuring topic similarity, document clustering, and word sense disambiguation.
- Bioinformatics: Comparing genome sequences and protein structures.
- Quantum Information: Quantum JSD (QJSD) measures distinguishability of quantum states.
Code Example: Computing JSD in Python
Using SciPy's scipy.spatial.distance.jensenshannon:
import numpy as np
from scipy.spatial.distance import jensenshannon
# Two discrete probability distributions
P = np.array([0.2, 0.3, 0.5])
Q = np.array([0.4, 0.1, 0.5])
# Compute Jensen-Shannon distance (square root of divergence)
js_dist = jensenshannon(P, Q, base=2)
print(f"Jensen-Shannon distance: {js_dist:.4f}") # Output: 0.2278
# The divergence is the square of the distance
js_div = js_dist ** 2
print(f"Jensen-Shannon divergence: {js_div:.4f}") # Output: 0.0519
For custom implementation:
def js_divergence(P, Q, base=2):
M = (P + Q) / 2
from scipy.stats import entropy
return (entropy(P, M, base=base) + entropy(Q, M, base=base)) / 2
Properties and Variants
- Geometric JSD: A variant using geometric mean instead of arithmetic mean, yielding closed-form for Gaussians (Nielsen, 2019).
- Quantum JSD: Defined for density matrices, used in quantum information theory (Holevo information).
- Generalized JSD: For more than two distributions with weights:
$$JSD_{\pi_1,\ldots,\pi_n}(P_1,\ldots,P_n) = H\left(\sum_{i=1}^n \pi_i P_i\right) - \sum_{i=1}^n \pi_i H(P_i)$$
where (H) is Shannon entropy.
Relation to Mutual Information
JSD is equivalent to the mutual information between a mixture distribution and an indicator variable that selects which distribution generated a sample. This interpretation is key in information theory.
Applications in Machine Learning
- GANs: The discriminator estimates JSD between real and generated data distributions. However, in practice, the original GAN uses an approximation because JSD is not differentiable everywhere.
- Clustering: The JSD centroid minimizes average JSD to a set of distributions (computed via the Convex-Concave Procedure, CCCP).
- Feature Selection: JSD can rank features by how much they separate classes.
Benchmarks and Tools
- Ruby gem for JSD calculation available.
- Python: SciPy's
jensenshannonfunction. - R:
statcomplibrary includes JSD. - THOTH: Python package for efficient information-theoretic estimation.
Limitations
- JSD is not a metric in its divergence form; only its square root is a metric.
- For continuous distributions, the integral may not have a closed form, requiring Monte Carlo estimation.
- The bound of 1 (base-2) only holds for discrete distributions; for continuous, it's unbounded.
Conclusion
Jensen–Shannon divergence is a robust tool for comparing probability distributions, essential in modern ML and data science. Its symmetry and boundedness make it preferable to KLD in many contexts. Start using it today with SciPy's one-liner.
References
- Lin, J. (1991). "Divergence measures based on the Shannon entropy". IEEE Trans. Info. Theory.
- Goodfellow et al. (2014). "Generative Adversarial Nets".
- Nielsen, F. (2019). "On the Jensen-Shannon symmetrization". Entropy.
- SciPy documentation:
scipy.spatial.distance.jensenshannon.



