Importance Sampling: A Comprehensive Guide

Introduction

Importance sampling is a powerful approximation technique used to estimate properties of a distribution when direct sampling is difficult or impossible. Instead of directly sampling from the target distribution, importance sampling draws samples from a different, more convenient distribution (called the importance function) and then weights these samples to correct for the difference between the two distributions. This article provides a comprehensive tutorial on importance sampling, covering its theoretical foundations, practical considerations, and applications.

The Essence of Importance Sampling

Importance sampling is an approximation method, not a direct sampling method. It's a trick derived from a mathematical transformation that reformulates a problem. It is especially useful in machine learning for calculating expectations, which are frequently the answers sought by many ML algorithms. Expectations arise when fitting models, performing probability queries, summarizing model performance, and training reinforcement learning agents.

Mathematically, we often need to calculate the expectation of a function ( f(x) ) with respect to a probability distribution ( p(x) ):

[E[f(x)] = \int f(x) p(x) dx]

Here, ( x ) is a continuous random vector and ( p(x) ) is its probability density. The problem is that this integral is often impossible to compute exactly, especially when the dimension of ( x ) is high. Monte Carlo methods offer a solution by approximating the expectation with an average:

Read also: Understanding PLCs

[E[f(x)] \approx \frac{1}{N} \sum{i=1}^{N} f(xi)]

where the ( x_i )'s are sampled from ( p(x) ). As ( N ) gets large, this average approaches the true expectation.

Monte Carlo Methods: A Foundation

Monte Carlo methods are concerned with calculating expectations, a critical task in machine learning. Given a random variable ( x ) with probability density ( p(x) ), the expectation of a function ( f(x) ) is:

[E[f(x)] = \int f(x) p(x) dx]

This can be understood as the probability-weighted average of ( f(x) ) over the entire space where ( x ) exists. For discrete ( x ), this becomes:

Read also: Learning Resources Near You

[E[f(x)] = \sum f(x) p(x)]

However, when ( x ) has a high dimension, calculating the expectation directly becomes infeasible. Monte Carlo methods address this by sampling from the probability distribution of ( x ) and estimating the expectation from the samples:

[E[f(x)] \approx \frac{1}{N} \sum{i=1}^{N} f(xi)]

As ( N ) becomes large, this estimation approaches the real expectation.

Estimators, Bias, and Variance

An estimator is a rule for calculating an estimate of a given quantity based on observed data. Running the Monte Carlo method yields an estimator for the expectation. Each time we run through the process of sampling and estimating, we produce a slightly different estimation, denoted as ( s ), for the expectation. These estimations have a distribution with its own expectation and variance.

Read also: Learning Civil Procedure

The Central Limit Theorem (CLT) states that for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution, even if the original variables are not normally distributed. Thus, when ( N ) is large enough, the distribution of ( s ) is a normal distribution, even if ( p(x) ) is not normally distributed.

Monte Carlo methods generate unbiased estimators for the expectation of a distribution. The expectation and variance of the estimator are:

[E[s] = E[f(x)]]

[Var(s) = \frac{Var(f(x))}{N}]

Importance Sampling: The Details

Importance sampling introduces a new distribution ( q(x) ) to improve the estimation of the expectation. We can rewrite the expectation formula as:

[E[f(x)] = \int f(x) \frac{p(x)}{q(x)} q(x) dx]

This involves multiplying the integral by ( \frac{q(x)}{q(x)} ), which equals 1, as long as ( q(x) > 0 ) whenever ( p(x) ) is non-zero. We then reorder the terms, grouping ( f(x) \frac{p(x)}{q(x)} ) as the new function of interest. This allows us to sample from ( q(x) ) and still estimate the expectation of ( f(x) ) over ( p(x) ).

The original expectation can be approximated using samples from ( q(x) ):

[E[f(x)] \approx \frac{1}{N} \sum{i=1}^{N} f(xi) \frac{p(xi)}{q(xi)}]

where the ( x_i )'s are sampled from ( q(x) ). Let's call this new average ( r ).

Advantages of Importance Sampling

Unbiased Estimator: Just like in the Monte Carlo method, importance sampling yields an unbiased estimator:
[E[r] = E[f(x)]]
Variance Reduction: Importance sampling can potentially reduce variance. The variance of ( r ) is:
[Var(r) = \frac{Var(f(x) \frac{p(x)}{q(x)})}{N}]
The goal is to choose ( q(x) ) such that this variance is less than the variance we dealt with earlier. It turns out that ( q(x) ) should favor ( x ) values where the absolute value of ( p(x)f(x) ) is high.

How to Choose the New Distribution ( q(x) )

Choosing ( q(x) ) in practice is tricky. The best choice depends on ( p(x) ) and ( f(x) ), which are supposedly difficult to work with. Here are some guidelines:

Use any available knowledge about ( p(x) ) and ( f(x) ) to guide the choice.
Choose a ( q(x) ) that approximates ( p(x) ).
Select a ( q(x) ) that is high where the absolute value of ( p(x) ) times ( f(x) ) is high.

A poorly selected ( q(x) ) can lead to a density ratio that varies wildly over the samples, with the majority of them being very small. This results in an average effectively determined by a small number of samples, leading to high variance.

Importance Sampling in Action: Examples and Applications

A Simple Example

Let's consider a scenario where importance sampling can significantly reduce variance. Suppose ( p(x) ) and ( f(x) ) are defined such that ( f(x) ) is large only in rare events under ( p(x) ).

In this case, a straightforward Monte Carlo approach would yield estimates that bounce around considerably (high variance). This is because a small set of samples have an outsized impact on the average. Importance sampling can help by sampling the more important regions more frequently.

For example, let's define ( q(x) ) as a distribution that samples the regions where ( f(x) ) is large more often. Then, to use ( q ) samples, we need to adjust the ( f(x) ) function by multiplying it by the density ratio ( \frac{p(x)}{q(x)} ). After this adjustment, we can proceed as with plain Monte Carlo: sample from ( q(x) ), pass those into the density ratio adjusted ( f(x) ), and calculate their average.

Python Code Example

Consider estimating the expectation of a function ( f(x) ) where ( x \sim p(x) ). The Monte Carlo sampling method involves sampling ( x ) from ( p(x) ) and averaging the samples. If ( p(x) ) is hard to sample from, we can use importance sampling with a known distribution ( q(x) ). The key formula is:

[E[f(x)] = \int f(x) \frac{p(x)}{q(x)} q(x) dx \approx \frac{1}{N} \sum{i=1}^{N} f(xi) \frac{p(xi)}{q(xi)}]

where ( x_i ) is sampled from ( q(x) ), and ( \frac{p(x)}{q(x)} ) is the sampling weight.

Here’s a Python example:

import numpy as npfrom scipy import statsdef f_x(x): return 1 / (1 + np.exp(-x))def distribution(mu=0, sigma=1): dist = stats.norm(mu, sigma) return distn = 1000mu_target = 3.5sigma_target = 1mu_appro = 3sigma_appro = 1p_x = distribution(mu_target, sigma_target)q_x = distribution(mu_appro, sigma_appro)# Compute true value sampled from p(x)s = 0for i in range(n): x_i = np.random.normal(mu_target, sigma_target) s += f_x(x_i)print("Simulated value (from p(x)):", s/n)# Sample from q(x)value_list = []for i in range(n): x_i = np.random.normal(mu_appro, sigma_appro) value = f_x(x_i) * (p_x.pdf(x_i) / q_x.pdf(x_i)) value_list.append(value)print("Importance sampling estimate:", np.mean(value_list))print("Variance:", np.var(value_list))

In this example, both ( p(x) ) and ( q(x) ) are normal distributions. You can modify the code to use a ( p(x) ) that is hard to sample from.

Comparison of Different Distributions

If ( q(x) ) is too similar to ( p(x) ), the benefits of importance sampling might be marginal. However, if ( q(x) ) is significantly different, more samples may be needed to approximate the value. If the distributions are too dissimilar, ( \frac{p(x)}{q(x)} ) can vary greatly, increasing the variance.

For example, consider the case where ( p(x) ) and ( q(x) ) are normal distributions with different means:

n = 5000mu_target = 3.5sigma_target = 1mu_appro = 1sigma_appro = 1p_x = distribution(mu_target, sigma_target)q_x = distribution(mu_appro, sigma_appro)value_list = []for i in range(n): x_i = np.random.normal(mu_appro, sigma_appro) value = f_x(x_i) * (p_x.pdf(x_i) / q_x.pdf(x_i)) value_list.append(value)print("Importance sampling estimate:", np.mean(value_list))print("Variance:", np.var(value_list))

In this case, you might get a good estimation value, but the variance can be quite high.

Importance Sampling in Reinforcement Learning

One application of importance sampling in ML is policy-based reinforcement learning. When updating a policy, estimating the value functions of the new policy can be costly. If the new policy is relatively close to the old policy, you can calculate the total rewards based on the old policy and reweight them according to the new policy.

Multiple Importance Sampling (MIS)

In many cases, we might have several sampling strategies (i.e., a bunch of distributions), each of which works in some cases and doesn't work in others. Multiple importance sampling (MIS) is a technique to combine several distributions at once.

With MIS, the estimate is given by a weighted average:

[E[f(x)] \approx \sum{i=1}^{N} wi(x) \frac{f(x)}{p_i(x)}]

where ( w_i(x) ) are weight functions. A common choice for the weights is the balance heuristic:

[wi(x) = \frac{pi(x)}{\sum{j=1}^{N} pj(x)}]

The only requirement for MIS to work is that ( f(x) > 0 \Rightarrow p(x) > 0 ).

Importance Sampling and Bayesian Inference

Importance sampling is a Bayesian estimation technique that estimates a parameter by drawing from a specified importance function rather than a posterior distribution. It is useful when the area of interest has a small probability of occurrence.

tags: #importance #sampling #tutorial