Simulated Annealing: From Physics to Optimization
— blog, optimization, algorithms, mathematics, physics — 13 min read
Introduction
The idea of Simulated Annealing (SA) has always fascinated me. The concept behind it (borrowed from physics known for perhaps millennia) is quite simple: cool slowly and you'll get optimization. Unfortunately, I never had the chance to delve into the mathematics or physics behind this algorithm.
I want this blog post to give a glimpse of the math and physics that support the intuition. I also prepared a project post that applies the algorithm to the Ising model: a classical but fascinating application.
But why does this work? What guarantees do we have? This article explores the mathematical foundations of Simulated Annealing, showing how it emerges from fundamental principles of statistical mechanics and how theoretical results guarantee its convergence under certain conditions.
We will start by understanding the Boltzmann distribution, from which SA was inspired. Then, we will take a deep dive into the Metropolis algorithm, which is the core of the SA algorithm. Finally, we will explore the convergence theorem, which guarantees the convergence of the algorithm.
The Boltzmann Distribution
Let's start from the building blocks. At the heart of Simulated Annealing lies the Boltzmann distribution, which describes the probability (denoted as $P$) of finding a physical system in a state with energy $E$ at temperature $T$:
P(E) \propto e^{-E/T}This distribution is fundamental to both statistical mechanics and optimization. What does it tell us?
- High temperature: All states are roughly equally likely → system explores freely
- Low temperature: Low-energy states are much more likely → system concentrates near minima
But why is this the correct distribution? There are several ways to derive it, each providing different insights. We'll focus on the Detailed Balance at Equilibrium Approach, which directly connects to how the Metropolis algorithm works.
Detailed Balance at Equilibrium Approach (Deep Dive)
At thermal equilibrium, a system must satisfy detailed balance. This means that the rate of transitions from state $i$ to $j$ equals the rate from $j$ to $i$:
P(i) \cdot w(i \to j) = P(j) \cdot w(j \to i)where $w(i \to j)$ is the transition rate from state $i$ to state $j$.
For thermal systems, transition rates typically satisfy Arrhenius-like behavior:
w(i \to j) / w(j \to i) = e^{-(E_j - E_i)/T}Wait, what is Arrhenius-like Behavior?
The Arrhenius equation (named after Svante Arrhenius, 1889) describes how reaction/transition rates depend on temperature and energy barriers. It's both theoretically motivated and experimentally validated.
The original Arrhenius equation for chemical reactions is:
k = A e^{-E_a / (RT)}where $k$ is the transition rate, $E_a$ is the activation energy (energy barrier), $R$ is the gas constant, and $T$ is temperature.
Physical Interpretation:
- To transition from state
$i$to state$j$, the system must overcome an energy barrier - The probability of having enough thermal energy to overcome the barrier is proportional to
$e^{-\Delta E/T}$ - Higher temperature → more thermal energy → higher transition rate
- Larger energy barrier → less likely to overcome → lower transition rate
The Result: From detailed balance and Arrhenius-like behavior, we get:
P(i) \cdot w(i \to j) = P(j) \cdot w(j \to i)P(i) \cdot w(i \to j) = P(j) \cdot w(i \to j) \cdot e^{-(E_j - E_i)/T}Rearranging and simplifying:
\frac{P(i)}{P(j)} = e^{-(E_i - E_j)/T}This immediately leads to the Boltzmann distribution!
P(i) \propto e^{-E_i/T}The Metropolis Algorithm: Foundation of Simulated Annealing
Why Do We Need It?
Before diving into the mathematics, let's understand the problem the Metropolis algorithm solves.
The Problem: Two Naive Approaches Fail
Approach 1: Greedy Optimization (Always Go Downhill)
- Rule: Only accept moves that decrease energy
- Problem: Gets stuck in local minima! Once you're in a valley, you can never escape to find a deeper valley.
Approach 2: Random Walk (Accept Everything)
- Rule: Accept all proposed moves
- Problem: Never converges! You just wander around randomly, never settling into low-energy states.
The Solution: Probabilistic Acceptance
The Metropolis algorithm solves this by probabilistically accepting uphill moves. The key insight:
At high temperature, we explore widely. At low temperature, we exploit what we've found.
This balance between exploration and exploitation is what makes SA powerful.
The Metropolis Algorithm: How It Works
Now that we understand the Boltzmann distribution, we can see how the Metropolis algorithm samples from it.
The algorithm works by constructing a Markov chain: a sequence of states where each new state depends only on the current one. Under the right conditions, this chain converges to the Boltzmann distribution (see stationary distribution), which is exactly what we want.
The Metropolis algorithm has two key components:
1. Proposal Mechanism: $q(x'|x)$
The proposal probability $q(x'|x)$ is the probability of proposing state $x'$ when we're currently in state $x$. This is how we explore the state space.
Key Properties:
- Symmetry:
$q(x'|x) = q(x|x')$(equal probability to propose$x'$from$x$as$x$from$x'$) - Normalization:
$\sum_{x'} q(x'|x) = 1$(must define a valid probability distribution) - Accessibility: For any two states, there should be a path of proposals connecting them (ensures ergodicity)
Examples of symmetric proposals:
- Gaussian random walk:
$x' = x + \mathcal{N}(0, \sigma^2)$→ symmetric because Gaussian is symmetric - Uniform random move:
$x' = x + \text{Uniform}(-\delta, \delta)$→ symmetric - Swap two elements: In traveling salesman problem, swapping cities
$i$and$j$→ symmetric (swapping back is equally likely)
2. Acceptance Rule: $A(x \to x') = \min(1, e^{-\Delta E/T})$
The acceptance probability determines whether we accept the proposed move:
A(x \to x') = \begin{cases}1 & \text{if } \Delta E \leq 0 \text{ (downhill move)} \\e^{-\Delta E/T} & \text{if } \Delta E > 0 \text{ (uphill move)}\end{cases}where $\Delta E = E(x') - E(x)$ is the energy change.
Intuition:
- Downhill moves (
$\Delta E < 0$): Always accept (we want to go to lower energy) - Uphill moves (
$\Delta E > 0$): Accept with probability$e^{-\Delta E/T}$- Small energy increase → high acceptance probability → allows exploration
- Large energy increase → low acceptance probability → avoids bad moves
- As
$T$decreases, uphill moves become less likely → transitions to exploitation
Temperature's Role:
- High
$T$:$e^{-\Delta E/T} \approx 1$for most moves → high acceptance → exploration - Low
$T$:$e^{-\Delta E/T} \approx 0$for uphill moves → low acceptance → exploitation $T \to 0$: Only downhill moves accepted → pure greedy optimization
Detailed Balance: The Mathematical Guarantee
For the Metropolis algorithm to converge to the Boltzmann distribution, it must satisfy detailed balance:
\pi(x) P(x \to x') = \pi(x') P(x' \to x)where:
$\pi(x) = \frac{1}{Z} e^{-E(x)/T}$is the target distribution (Boltzmann)$P(x \to x') = q(x'|x) \cdot A(x \to x')$is the transition probability
Why Detailed Balance Matters: If detailed balance holds, then $\pi(x)$ is automatically the stationary distribution. This means the Markov chain will eventually converge to sampling from the Boltzmann distribution, regardless of the starting state.
Proof Sketch
We need to show that the Metropolis algorithm satisfies detailed balance. The proof considers two cases:
Case 1: Downhill Move ($\Delta E \leq 0$, i.e., $E(x') \leq E(x)$)
When moving to a lower (or equal) energy state:
$A(x \to x') = 1$(always accept downhill moves)$A(x' \to x) = \min(1, e^{-(E(x) - E(x'))/T}) = \min(1, e^{+\Delta E/T})$
Since $\Delta E \leq 0$, we have $e^{+\Delta E/T} \leq 1$, so:
$A(x' \to x) = e^{+\Delta E/T} = e^{-(E(x') - E(x))/T}$
Checking detailed balance:
Left side: $\pi(x) P(x \to x') = \pi(x) \cdot q(x'|x) \cdot 1 = \pi(x) q(x'|x)$
Right side: $\pi(x') P(x' \to x) = \pi(x') \cdot q(x|x') \cdot e^{+\Delta E/T}$
Since $q(x'|x) = q(x|x')$ (symmetry) and $\pi(x') = \pi(x) e^{-\Delta E/T}$ (from Boltzmann distribution), we get:
Right side: $\pi(x) e^{-\Delta E/T} \cdot q(x'|x) \cdot e^{+\Delta E/T} = \pi(x) q(x'|x)$
Therefore: $\pi(x) P(x \to x') = \pi(x') P(x' \to x)$ ✔
Case 2: Uphill Move ($\Delta E > 0$, i.e., $E(x') > E(x)$)
When moving to a higher energy state:
$A(x \to x') = e^{-\Delta E/T}$$A(x' \to x) = 1$(moving back is downhill)
Checking detailed balance:
Left side: $\pi(x) P(x \to x') = \pi(x) \cdot q(x'|x) \cdot e^{-\Delta E/T}$
Right side: $\pi(x') P(x' \to x) = \pi(x') \cdot q(x|x') \cdot 1 = \pi(x') q(x'|x)$
Since $\pi(x') = \pi(x) e^{-\Delta E/T}$ and $q(x'|x) = q(x|x')$, we get:
Right side: $\pi(x) e^{-\Delta E/T} \cdot q(x'|x) = \pi(x) q(x'|x) e^{-\Delta E/T}$
Therefore: $\pi(x) P(x \to x') = \pi(x') P(x' \to x)$ ✔
Conclusion: The Metropolis algorithm satisfies detailed balance for both downhill and uphill moves, ensuring convergence to the Boltzmann distribution!
Simulated Annealing: From Fixed Temperature to Optimization
The Key Idea
The Metropolis algorithm at fixed temperature samples from the Boltzmann distribution at that temperature. This is useful for understanding equilibrium properties in physics, but not for optimization.
Simulated Annealing makes a crucial modification: decrease the temperature over time. This transforms the algorithm from a sampling method into an optimization method.
SA as Non-Stationary Markov Chain: Since the temperature changes with iteration $k$, the acceptance probabilities change over time. This makes SA a time-inhomogeneous (non-stationary) Markov process, unlike fixed-temperature Metropolis which is stationary.
The Convergence Theorem
There's a beautiful theoretical result about Simulated Annealing, first rigorously proved by Hajek (1988):
If we cool infinitely slowly (i.e.,
$T(k) \to 0$as$k \to \infty$in a specific way), SA is guaranteed to find the global minimum with probability 1.
Formal Statement (Hajek's Theorem, 1988)
Under certain conditions on the energy landscape (the function that we want to optimize) and proposal mechanism, if the temperature schedule satisfies:
T(k) \geq \frac{C}{\log(k+1)}for a sufficiently large constant $C$ (depending on the energy landscape), then:
\lim_{k \to \infty} P(\text{SA finds global minimum}) = 1Key conditions:
- Irreducibility: Every state is reachable from every other state (via a sequence of proposals)
- Aperiodicity: The Markov chain doesn't get stuck in cycles
- Connectivity: The energy landscape has a "path" from any state to the global minimum
- Bounded energy differences: Energy changes
$\Delta E$are bounded (or have bounded variance)
The constant $C$ must satisfy $C \geq \Delta E_{\max}$, where $\Delta E_{\max}$ is the largest energy barrier in the landscape.
Why Does This Work? (Proof Sketch)
The proof relies on understanding mixing times and escape probabilities. Here's the intuitive argument:
The Core Idea: Escape from Local Minima
The fundamental challenge in optimization is escaping from local minima. At temperature $T$, the probability of escaping a local minimum with energy barrier $\Delta E$ is approximately:
P_{\text{escape}}(T) \propto e^{-\Delta E/T}Key insight: As $T \to 0$, this probability goes to zero exponentially fast. If we cool too fast, we'll get trapped before we can escape!
The Mixing Time Argument
At each temperature $T$, the system needs time to reach equilibrium (the Boltzmann distribution). This time is called the mixing time $\tau_{\text{mix}}(T)$.
For many systems, the mixing time scales as:
\tau_{\text{mix}}(T) \propto e^{\Delta E_{\max}/T}where $\Delta E_{\max}$ is the largest energy barrier in the landscape.
Why exponential? To escape the highest barrier, we need to wait long enough for a rare thermal fluctuation. The probability of such a fluctuation is $e^{-\Delta E_{\max}/T}$, so we need $\sim e^{\Delta E_{\max}/T}$ attempts on average.
How the Logarithmic Schedule Ensures Equilibrium
For the logarithmic schedule $T(k) = C/\log(k+1)$, we want to know the number of iterations $\Delta k$ spent in a small temperature range around $T$. Using the linear approximation (since $\Delta T$ is small), we get:
\Delta k \approx \left|\frac{dk}{dT}\right| \cdot \Delta Twhere $\frac{dk}{dT} = -\frac{C}{T^2} e^{C/T}$. This gives us:
\Delta k \propto \frac{C}{T^2} e^{C/T}For small $T$ (low temperatures), the exponential term $e^{C/T}$ dominates, so:
\Delta k \propto e^{C/T}But remember, we need $\tau_{\text{mix}}(T) \propto e^{\Delta E_{\max}/T}$ iterations.
How does logarithmic schedule work? The key is that we need to choose $C$ large enough:
C \geq \Delta E_{\max}When this holds, the logarithmic schedule ensures we spend exponentially many iterations at each temperature level, which is exactly what we need to overcome the exponential mixing time!
Physical Interpretation
This result beautifully mirrors physical annealing:
- In real annealing, you cool metal slowly to avoid defects (trapped states)
- The cooling rate must be slower than the relaxation time of the material
- SA's logarithmic schedule ensures we cool slower than the "relaxation time" (mixing time) of the optimization problem
The Theory vs. Practice trade-off:
Logarithmic schedules are too slow for most applications. The requirement $T(k) \geq C/\log(k+1)$ means:
- To reach
$T = 0.1$from$T_0 = 10$with$C = 10$, we need$k \approx e^{100} \approx 10^{43}$iterations! - This is computationally infeasible
In practice, we use faster schedules that sacrifice theoretical guarantees for computational efficiency.
Practical Cooling Schedules
Since logarithmic schedules are impractical, several finite-time schedules are commonly used:
1. Geometric Schedule: $T_{k+1} = \alpha T_k$
- Pros: Simple, widely used, good performance in practice
- Cons: No theoretical guarantee
- Typical values:
$\alpha = 0.90$to$0.99$(higher = slower cooling) - Use: Most common choice for practical applications
2. Adaptive Schedule: Adjusts based on acceptance rate
- Pros: Adapts to problem characteristics, can be more efficient
- Cons: More complex, requires tuning
- Idea: Maintain target acceptance rate (e.g., 40%) by adjusting cooling rate
- Use: When problem characteristics are unknown or vary
Note: The logarithmic schedule (discussed in detail above) provides theoretical guarantees but is computationally infeasible for practical applications.
Conclusion
Simulated Annealing bridges the gap between physics and optimization, showing how principles from statistical mechanics can solve hard optimization problems.
Key Insights:
- The Metropolis Algorithm: Probabilistic acceptance based on the Boltzmann distribution allows balancing exploration and exploitation
- Detailed Balance: The mathematical guarantee that ensures convergence to the target distribution
- The Convergence Theorem: Infinitely slow cooling guarantees finding the global minimum, though practical schedules sacrifice this guarantee for efficiency
- Physical Intuition: SA mimics real annealing, where slow cooling allows systems to find low-energy configurations
References
-
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6), 1087-1092. — Original Metropolis algorithm
-
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671-680. — Original Simulated Annealing paper
-
Hajek, B. (1988). Cooling schedules for optimal annealing. Mathematics of Operations Research, 13(2), 311-329. — Convergence theorem
-
Arrhenius, S. (1889). Über die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch Säuren. Zeitschrift für Physikalische Chemie, 4, 226-248. — Original Arrhenius equation
-
Chandler, D. (1987). Introduction to Modern Statistical Mechanics. Oxford University Press. — Statistical mechanics background
-
Binder, K., & Heermann, D. W. (2010). Monte Carlo Simulation in Statistical Physics: An Introduction. Springer. — Monte Carlo methods