Simulated Annealing: From Physics to Optimization

05.01.2026 — blog, optimization, algorithms, mathematics, physics — 9 min read

Introduction

The idea of Simulated Annealing (SA) has always fascinated me. I first heard about annealing in physics classes, then later saw the same idea appear in optimization and algorithms. That bridge between physical intuition and computation felt powerful, but I did not fully understand why the method works.

This post is my attempt to close that gap in a readable way.

The goal is simple: if you are comfortable with basic graduate-level math (probability, calculus, linear algebra), you should finish with an intuitive and mathematical "aha" moment.

I also wrote a companion project post where SA is applied to the Ising model.

In this article, we follow one clean line:

Why naive optimization strategies fail
How the Metropolis acceptance rule fixes that
Why this rule is tied to the Boltzmann distribution
Why slow cooling gives convergence guarantees (in theory)

1) Why Simulated Annealing Exists

Before formulas, we need the optimization problem SA is designed to solve.

Two Naive Strategies (and Why They Fail)

Strategy A: Greedy descent (always go downhill)

Rule: accept only moves that reduce energy/cost
Failure mode: local minima trap you forever

Strategy B: Random walk (accept everything)

Rule: accept every proposed move
Failure mode: endless wandering, no concentration near minima

Simulated Annealing sits between these extremes: it accepts bad moves sometimes, but in a controlled way.

The Core Intuition

At high temperature, accept many moves to explore.
At low temperature, become selective and exploit good regions.

This exploration-to-exploitation transition is the heart of SA.

Checkpoint

Greedy is too rigid.
Random walk is too noisy.
SA uses temperature to interpolate between them.

2) Metropolis Rule: The Engine Inside SA

The engine is the Metropolis algorithm.

At iteration $k$ , with current state $x$ :

Propose a candidate $x'$ from proposal distribution $q(x' \mid x)$ .
Compute energy change $\Delta E = E(x') - E(x)$ .
Accept with probability

A(x \to x') = \min\left(1, e^{-\Delta E/T}\right)

Equivalent piecewise form:

A(x \to x') = \begin{cases}
1 & \text{if } \Delta E \leq 0 \\
e^{-\Delta E/T} & \text{if } \Delta E > 0
\end{cases}

What This Means

If $\Delta E \leq 0$ : always accept (downhill)
If $\Delta E > 0$ : sometimes accept (uphill), with probability controlled by $T$

Temperature behavior:

High $T$ : many uphill moves accepted -> exploration
Low $T$ : uphill moves suppressed -> exploitation
$T \to 0$ : behavior approaches greedy descent

About the Proposal $q(x' \mid x)$

To keep the theory clean, we usually assume:

Symmetry: $q(x' \mid x) = q(x \mid x')$
Reachability: any state can eventually be reached (irreducibility/ergodicity intuition)

Checkpoint

Metropolis gives a principled "sometimes uphill" rule.
Temperature is not decoration; it directly controls acceptance.
Proposal quality affects practical performance.

3) Why Boltzmann Appears

Now we connect the rule above to physics.

The Boltzmann distribution assigns probability to state energy at temperature $T$ :

\pi(x) \propto e^{-E(x)/T}

Interpretation:

High $T$ : probabilities are flatter -> broad exploration
Low $T$ : low-energy states dominate -> concentration near minima

The important claim is: with symmetric proposals and the Metropolis acceptance rule, this Boltzmann distribution is the stationary target at fixed temperature.

Detailed Balance (Core Argument)

Let transition probability be:

P(x \to x') = q(x' \mid x)\,A(x \to x')

We want:

\pi(x)P(x \to x') = \pi(x')P(x' \to x)

for $\pi(x) \propto e^{-E(x)/T}$ .

Metropolis acceptance is chosen exactly so this identity holds (for downhill and uphill cases). Once detailed balance holds, $\pi$ is stationary.

So the acceptance rule is not ad hoc: it is mathematically tuned to target Boltzmann at fixed $T$ .

Checkpoint

Boltzmann is the fixed-temperature target.
Detailed balance is the mathematical consistency check.
Arrhenius intuition explains the exponential acceptance shape.

Deep Dive (Optional): Arrhenius and Full Detailed-Balance Proof

Arrhenius and Physical Rates

In thermal systems, transition rates often satisfy Arrhenius-like ratios:

w(i \to j) / w(j \to i) = e^{-(E_j - E_i)/T}

This matches the same exponential structure seen in Metropolis acceptance. The classical Arrhenius equation is:

k = A e^{-E_a / (RT)}

Variable meanings:

$k$ : transition/reaction rate
$E_a$ : activation barrier
$T$ : temperature

This is exactly the physical intuition behind "rare uphill moves."

Arrhenius-Based Derivation of Boltzmann

At thermal equilibrium, detailed balance states:

P(i) \cdot w(i \to j) = P(j) \cdot w(j \to i)

where $w(i \to j)$ is the transition rate from state $i$ to $j$ .

If rates follow Arrhenius-like behavior:

\frac{w(i \to j)}{w(j \to i)} = e^{-(E_j - E_i)/T}

then:

P(i)\cdot w(i \to j)
= P(j)\cdot w(i \to j)\cdot e^{-(E_i-E_j)/T}

Rearranging:

\frac{P(i)}{P(j)} = e^{-(E_i-E_j)/T}

which implies:

P(i) \propto e^{-E_i/T}

What is Arrhenius-like behavior?

The Arrhenius equation gives a temperature-dependent rate law:

k = A e^{-E_a/(RT)}

with $k$ as rate, $E_a$ as activation barrier, $R$ as gas constant, and $T$ as temperature.

Physical interpretation:

To move from one state to another, the system must overcome an energy barrier.
The probability of enough thermal fluctuation scales like an exponential penalty.
Higher temperature increases barrier-crossing events.
Larger barriers suppress transition rates.

Full Two-Case Detailed Balance Proof

We want to verify:

\pi(x)P(x\to x') = \pi(x')P(x'\to x)

with:

P(x\to x')=q(x'|x)A(x\to x'), \quad
\pi(x)\propto e^{-E(x)/T}

assuming symmetric proposals $q(x'|x)=q(x|x')$ .

Case 1: Downhill move ( $\Delta E \le 0$ )

\Delta E = E(x')-E(x)\le 0

Then:

$A(x\to x')=1$
$A(x'\to x)=e^{+\Delta E/T}$ (since reverse is uphill by amount $-\Delta E$ )

Left side:

\pi(x)P(x\to x')=\pi(x)\,q(x'|x)

Right side:

\pi(x')P(x'\to x)
=\pi(x')\,q(x|x')\,e^{+\Delta E/T}

Using $\pi(x')=\pi(x)e^{-\Delta E/T}$ and proposal symmetry:

\pi(x')\,q(x|x')\,e^{+\Delta E/T}
=\pi(x)e^{-\Delta E/T}q(x'|x)e^{+\Delta E/T}
=\pi(x)q(x'|x)

So detailed balance holds.

Case 2: Uphill move ( $\Delta E > 0$ )

Then:

$A(x\to x')=e^{-\Delta E/T}$
$A(x'\to x)=1$

Left side:

\pi(x)P(x\to x')
=\pi(x)\,q(x'|x)\,e^{-\Delta E/T}

Right side:

\pi(x')P(x'\to x)
=\pi(x')\,q(x|x')

Using $\pi(x')=\pi(x)e^{-\Delta E/T}$ and symmetry:

\pi(x')\,q(x|x')
=\pi(x)e^{-\Delta E/T}q(x'|x)

Left and right are equal, so detailed balance also holds in this case.

Therefore, with symmetric proposals and Metropolis acceptance, detailed balance holds in both regimes.

4) Simulated Annealing: Turning Sampling into Optimization

At fixed $T$ , Metropolis samples a Boltzmann distribution.

SA changes this by making temperature depend on iteration:

T = T(k), \quad T(k) \downarrow 0

So SA is a time-inhomogeneous (non-stationary) Markov process.

Why Cooling Must Be Slow

There is a classic theorem (Hajek, 1988):

If we cool infinitely slowly (i.e., $T(k) \to 0$ as $k \to \infty$ in a specific way), SA is guaranteed to find the global minimum with probability 1.

One sufficient schedule is:

T(k) \geq \frac{C}{\log(k+1)}

with large enough $C$ (related to worst energy barriers). Under standard regularity conditions, this implies:

\lim_{k\to\infty} P(\text{SA reaches a global minimum}) = 1

Core Intuition Behind the Theorem

Escaping a barrier of height $\Delta E$ has probability roughly:

P_{\text{escape}}(T) \propto e^{-\Delta E/T}

As $T$ drops, escape becomes exponentially harder. So if cooling is too fast, the chain freezes before crossing key barriers.

Logarithmic cooling is slow enough to keep giving the chain time to mix and escape.

Practice vs Theory

The theorem is asymptotic and conservative. In practice, logarithmic cooling is often far too slow.

Example: for moderate constants, iteration counts can become astronomically large. For example, reaching $T=0.1$ from $T_0=10$ with $C=10$ gives $k \approx e^{100} \approx 10^{43}$ iterations.

So practitioners typically choose faster schedules that work well empirically but lose strict global-optimum guarantees.

Deep Dive (Optional): Mixing-Time View and Temperature-Window Scaling

Mixing-time scaling

For many landscapes, mixing time behaves like:

\tau_{\text{mix}}(T) \propto e^{\Delta E_{\max}/T}

The log schedule spends exponentially many iterations at low temperatures, matching this exponential difficulty when $C$ is sufficiently large.

Temperature-window calculus

For logarithmic cooling:

T(k) = \frac{C}{\log(k+1)}

Invert approximately (large $k$ ): $k \sim e^{C/T}$ .

For a small temperature window $\Delta T$ , iterations spent near $T$ satisfy:

\Delta k \approx \left|\frac{dk}{dT}\right|\Delta T

and:

\frac{dk}{dT} \propto -\frac{C}{T^2}e^{C/T}

hence:

\Delta k \propto \frac{C}{T^2}e^{C/T}

At low $T$ , the exponential dominates:

\Delta k \propto e^{C/T}

Putting this together with $\tau_{\text{mix}}(T) \propto e^{\Delta E_{\max}/T}$ : choosing $C \geq \Delta E_{\max}$ ensures the schedule allocates enough iterations per low-temperature band (in asymptotic terms) to keep up with the barrier-crossing difficulty.

Checkpoint

Slow cooling is tied to barrier-crossing probabilities, not folklore.
The convergence theorem is about the infinite-time limit.
Practical schedules trade guarantees for tractability.

5) Practical Cooling Schedules

Since logarithmic schedules are rarely practical, people use finite-time schedules:

Geometric: $T_{k+1} = \alpha T_k$

Pros: simple, robust, widely used
Cons: no strict asymptotic guarantee
Typical $\alpha$ : 0.90 to 0.99 (larger means slower cooling)

Adaptive: tune cooling from acceptance rate

Pros: adapts to problem landscape
Cons: more knobs, more implementation complexity
Typical idea: keep acceptance rate in a target band (for example 20%-50%)

Rule of thumb: start with geometric cooling, then move to adaptive only if needed.

Conclusion

Simulated Annealing is a beautiful bridge between physics and optimization.

At a fixed temperature, Metropolis sampling follows Boltzmann-like behavior. By lowering temperature over time, we turn that sampling mechanism into an optimizer that can first explore and then settle.

If there is one takeaway, it is this:

Accepting occasional uphill moves is not a hack; it is the mechanism that prevents premature trapping. Cooling controls when exploration gives way to exploitation.

Key ideas to retain:

Metropolis acceptance creates controlled uphill exploration.
Detailed balance links that rule to the Boltzmann target at fixed temperature.
Slow cooling enables global-optimum guarantees in theory.
Practical schedules trade theoretical guarantees for computational feasibility.

References

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6), 1087-1092. — Original Metropolis algorithm
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671-680. — Original Simulated Annealing paper
Hajek, B. (1988). Cooling schedules for optimal annealing. Mathematics of Operations Research, 13(2), 311-329. — Convergence theorem
Arrhenius, S. (1889). Über die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch Säuren. Zeitschrift für Physikalische Chemie, 4, 226-248. — Original Arrhenius equation
Chandler, D. (1987). Introduction to Modern Statistical Mechanics. Oxford University Press. — Statistical mechanics background
Binder, K., & Heermann, D. W. (2010). Monte Carlo Simulation in Statistical Physics: An Introduction. Springer. — Monte Carlo methods

…