A Tiny Boltzmann Machine

1 month ago 4

Boltzmann Machines

Here we introduce introduction to Boltzmann machines and present a Tiny Restricted Boltzmann Machine that runs in the browser.

Skip to Simulator

Boltzmann Machines are used for unsupervised learning, which means they can learn from data without being told what to look for.

The can be used for generating new data that is similar to the data they were trained on, also known as generative AI.

Boltzmann Machine

A Boltzmann Machine is a type of neural network that tries to learn patterns by mimicking how energy works in physics.

Each neuron can be on or off, the machine is made up of many of these neurons connect to each other.

Some neurons are visible (we can see them and even set their state), and some are hidden (we can't see them).

The connections between neurons are called weights, and they can be positive or negative.

Hover over the neurons to highlight their connections.

General Boltzmann Machine

Hover over the neurons to highlight their connections.

A General Boltzmann Machine has connections between all neurons. This makes it powerful, but its training involves calculating an O(2n)O(2^n) term.

Restricted Boltzmann Machine

A Restricted Boltzmann Machine is a special case where the visible and hidden neurons are not connected to each other. This makes it faster to train and understand.

The energy of a configuration of the visible and hidden units is defined as:

E(v,h)=−∑i=1m∑j=1nwijvihj−∑i=1mbivi−∑j=1ncjhj E(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_j

where vv is the visible layer, hh is the hidden layer, ww is the weight matrix, and bb and cc are the biases for the visible and hidden layers, respectively.

The visualisation on the right randomises the weights, biases and activation values of a Boltzmann machine and calculates its energy.

During training it is given examples (e.g., images, text) and the machine adjusts its weights to lower the energy of those samples.

It effectively learns P(v)P(v), the probability of visible units vv, which is proportional to e−E(v)e^{-E(v)}.

After training, it can sample new data from the learned distribution using Gibbs sampling.

These samples are new, never-before-seen, but statistically similar to the training data.

Here is our training data.

We want the network to learn how to make similar samples to these.

Lets simulate one step at a time

A Restricted Boltzmann Machine (RBM) is trained using a process called Contrastive Divergence. The steps are as follows:

Step 1: Clamping visible units to data
Step 2: Sampling hidden units
Step 3: Sampling visible units
Step 4: Sampling hidden units
Step 5: Updating weights

A more formal description of the steps above are given in the

Appendix.

Simulator

Press the "Run Simulation" button to start traininng the RBM. If you let the simulation run for a while, you will see the weights of the RBM converge to a stable state. The energy loss will also decrease over time.

You can compare the input and output states of the RBM by pausing the simulation.

In the beginning, the input and output states will be dissimilar. As the simulation progresses, the input and output states will become more similar.

Appendix: Contrastive Divergence

Starting with a Boltzmann machine as defined earlier, we want to derivce the contrastive divergence algorithm for training. The goal is to adjust the weights of the network to minimize the energy of the training data.

We have:

A visisble layer vv and a hidden layer hh.
A weight matrix WW that connects the visible and hidden layers.
A bias vector bb for the visible layer and a bias vector cc for the hidden layer.

Energy function in matrix form:

E(v,h)=−∑i=1m∑j=1nwijvihj−∑i=1mbivi−∑j=1ncjhjE(v,h) = -\sum_{i=1}^{m} \sum_{j=1}^{n} w_{ij} v_i h_j - \sum_{i=1}^{m} b_i v_i - \sum_{j=1}^{n} c_j h_j

Joint distribution:

P(v,h)=1Ze−E(v,h)P(v,h) = \frac{1}{Z} e^{-E(v,h)}

Where ZZ is the partition function, which normalizes the distribution.

We train the RBM by maximizing the likelihood of the training data, i.e. maximizing log(P(v))\text{log}(P(v)).

The marginal likelihood of the visible layer is given by:

P(v)=∑hP(v,h)P(v) = \sum_{h} P(v,h)

Then the log-likelihood is:

log(P(v))=log∑h1Ze−E(v,h)=log∑he−E(v,h)−log(Z)\text{log}(P(v)) = \text{log}\sum_{h} \frac{1}{Z} e^{-E(v,h)} = \text{log}\sum_{h} e^{-E(v,h)} - \text{log}(Z)

Differentiating with respect to the weights wijw_{ij} gives:

∂log⁡P(v)∂wij=1∑he−E(v,h)∑h(−∂E(v,h)∂wij)e−E(v,h)−1Z∑v,h(−∂E(v,h)∂wij)e−E(v,h)\begin{align*} \frac{\partial \log P(\mathbf{v})}{\partial w_{ij}} &= \frac{1}{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}} \sum_{\mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \\ &\quad - \frac{1}{Z} \sum_{\mathbf{v}, \mathbf{h}} \left( -\frac{\partial E(\mathbf{v}, \mathbf{h})}{\partial w_{ij}} \right) e^{-E(\mathbf{v}, \mathbf{h})} \end{align*}

Similar forms exist for the biases bib_i and cjc_j.

Since we are performing gradient ascent, therefore.

Δwij←Δwij+η∂log⁡P(v)∂wij\Delta w_{ij} \leftarrow \Delta w_{ij} + \eta \frac{\partial \log P(v)}{\partial w_{ij}}

Therefore we get our weight update rule:

Δwij=η∂log⁡P(v)∂wij=η(⟨vihj⟩data−⟨vihj⟩model)\Delta w_{ij} = \eta \frac{\partial \log P(v)}{\partial w_{ij}} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)

A similar process can be followed for the biases bib_i and cjc_j.

Δbi=η(⟨vi⟩data−⟨vi⟩model)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δcj=η(⟨hj⟩data−⟨hj⟩model)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)

Where ⟨⋅⟩data\langle \cdot \rangle_{data} is the expectation with respect to the training data and ⟨⋅⟩model\langle \cdot \rangle_{model} is the expectation with respect to the model distribution.

The next step is to approximate the model expectation using Gibbs sampling.

Positive phase: Sample h(0)≈P(h∣v(0)=data)\mathbf{h}^{(0)} \approx P(\mathbf{h}|\mathbf{v}^{(0)} = \text{data})
Negative phase: Run k steps of Gibbs sampling:

Alternating between v(t+1)≈P(v∣(h)(t))\mathbf{v}^{(t+1)} \approx P(\mathbf{v}|\mathbf(h)^{(t)})
and h(t+1)≈P(h∣(v)(t))\mathbf{h}^{(t+1)} \approx P(\mathbf{h}|\mathbf(v)^{(t)})

Once those steps are done we update the weights and biases according to:

Δwij=η(⟨vihj⟩data−⟨vihj⟩model)\Delta w_{ij} = \eta \left( \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model} \right)Δbi=η(⟨vi⟩data−⟨vi⟩model)\Delta b_i = \eta \left( \langle v_i \rangle_{data} - \langle v_i \rangle_{model} \right)Δcj=η(⟨hj⟩data−⟨hj⟩model)\Delta c_j = \eta \left( \langle h_j \rangle_{data} - \langle h_j \rangle_{model} \right)