Introduction

A common “gotcha” with creating simple Monte Carlo scripts in Python or Excel is the request to add correlation to the distributions being sampled. This is a difficult thing to do in Excel without using plugins like AtRisk, but in Python, correlated sampling can be performed conveniently with the use of Copulas.

The fundamental intuition to build when working with copulas is that all continuous distributions can be remapped to a \(\text{Uniform}[0, 1]\) distribution with their Cumulative Distribution Function (CDF), and transformed back with the inverse CDF, also called the Percent Point Function (PPF). Furthermore, sampling from any distribution can be done by running uniform samples through its PPF (that is how random variate sampling is done in general).

With that trick in mind, we can make use of the multivariate Gaussian distribution and its CDF to easily generate correlated samples of any distributions we want, which is very useful when creating numerical simulation models.

Example

Generate Correlated Samples

Let’s say we have three parameters \(a\), \(b\) and \(c\) that we want to sample with some correlation coefficient \(\rho\). Before we worry about the specifics of those distributions, we can sample from a multivariate Gaussian with that correlation structure.

Code

# Number of samples
n = 2500

# Correlation matrix
rho = 0.80
corr = np.array([
    [1.0, rho, rho],
    [rho, 1.0, rho],
    [rho, rho, 1.0],
])

# Mean vector (zeros with standard mv normal)
mu = [0.0, 0.0, 0.0]

# Generate samples with dims (n, parameter)
samples = np.random.multivariate_normal(
    mean=mu,
    cov=corr,
    size=n
)

This generates 2500 samples from the standard multivariate Gaussian distribution with correlation \(\rho = 0.8\).

Code

df = pd.DataFrame({
    "a": samples[:, 0],
    "b": samples[:, 1],
    "c": samples[:, 2],
})

def create_pairplot(df: pd.DataFrame, color: str = "dodgerblue") -> alt.RepeatChart:
    return (
        alt.Chart(df)
        .mark_circle(
            color=color,
            opacity=0.25
        )
        .encode(
            alt.X(
                alt.repeat("column"), 
                type="quantitative", 
                scale=alt.Scale(zero=False)
            ),
            alt.Y(
                alt.repeat("row"), 
                type="quantitative", 
                scale=alt.Scale(zero=False)
            )
        )
        .properties(
            height=PLOT_HEIGHT / 3,
            width=180,
        )
        .repeat(
            row=list(df.columns),
            column=list(df.columns)
        )
    )

create_pairplot(df, color="lightcoral")

Code

df.corr()

	a	b	c
a	1.000000	0.797864	0.799834
b	0.797864	1.000000	0.800729
c	0.799834	0.800729	1.000000

Transform to Uniform

We can transform those samples to \(\text{Uniform} [0, 1]\) by simply passing them through the standard normal CDF.

Code

uniform_samples = stats.norm.cdf(df.values)
uniform_df = pd.DataFrame({
    "a": uniform_samples[:, 0],
    "b": uniform_samples[:, 1],
    "c": uniform_samples[:, 2],
})

create_pairplot(uniform_df, color="pink")

Now the sample are transformed to the uniform domain, but importantly their correlation structure has been preserved.

Code

uniform_df.corr()

	a	b	c
a	1.000000	0.785667	0.783459
b	0.785667	1.000000	0.785617
c	0.783459	0.785617	1.000000

Transform to Distributions

Now we have a bunch of uniformly distributed random samples that have our desired correlation structure. All that is left is to map the samples to our desired distributions using their respective PPFs.

Let’s define our distributions as:

\[ \begin{align} A &\sim \text{Normal}(500, 50) \\ B &\sim \text{Gamma}(2, 5) \\ C &\sim \text{Beta}(5, 8) \end{align} \]

Notice I’m not using normal distribtions exclusively, I can use the copula to map to whatever distributions I want. Note only that that the more skewed and kurtotic the distributions, the more warping will occur in their correlations out the other end. There are other types of Copula that can handle this better than the Gaussian Copula used here.

Code

dist_df = pd.DataFrame({
    "a": stats.norm(500, 25).ppf(uniform_df["a"]),
    "b": stats.gamma(a=2, scale=5, loc=0).ppf(uniform_df["b"]),
    "c": stats.beta(5, 8).ppf(uniform_df["c"])
})

create_pairplot(dist_df, color="orange")

Code

dist_df.corr()

	a	b	c
a	1.000000	0.750857	0.796169
b	0.750857	1.000000	0.766907
c	0.796169	0.766907	1.000000

Comparison with Uncorrelated Sampling

If we skip the Copula and just sample from our distributions, we get this:

Code

unc_dist_df = pd.DataFrame({
    "a": stats.norm(500, 25).rvs(size=n),
    "b": stats.gamma(a=2, scale=5, loc=0).rvs(size=n),
    "c": stats.beta(5, 8).rvs(size=n)
})

create_pairplot(unc_dist_df, color="lightblue")

Code

unc_dist_df.corr()

	a	b	c
a	1.000000	0.007764	0.007308
b	0.007764	1.000000	-0.002946
c	0.007308	-0.002946	1.000000

Effect on Monte Carlo Simulation

Let’s say that whatever these parameters represent, we want to know the result of this expression:

\[ y = ab^c \]

Code

unc_dist_df["y"] = unc_dist_df["a"] * unc_dist_df["b"] ** unc_dist_df["c"]
dist_df["y"] = dist_df["a"] * dist_df["b"] ** dist_df["c"]

unc_dist_df["Case"] = "Uncorrelated"
dist_df["Case"] = "Correlated"

combined_df = pd.concat([dist_df, unc_dist_df], ignore_index=True)

# Filter tail
upper_lim = np.percentile(combined_df["y"], q=99.5)
combined_df = combined_df[
    combined_df["y"] < upper_lim
]

chart = (
    alt.Chart(combined_df)
    .mark_bar(opacity=0.75)
    .encode(
        x=alt.X("y:Q", bin=alt.Bin(maxbins=50), axis=alt.Axis(title="y")),
        y=alt.Y(
            "count():Q", 
            axis=alt.Axis(labels=False, ticks=False, title=None),
            stack=False
        ),
        color=alt.Color(
            "Case:N", 
            scale=alt.Scale(
                domain=["Correlated", "Uncorrelated"], 
                range=["orange", "lightblue"]
            ),
            legend=alt.Legend(orient="top-right", title="")
        )
    )
    .properties(height=PLOT_HEIGHT, width=PLOT_WIDTH)
)


chart.show()

Code

percentiles = [10, 50, 90]

result_df = pd.DataFrame({
    "10th": np.round(np.percentile(dist_df["y"], q=percentiles), 3),
    "50th": np.round(np.percentile(unc_dist_df["y"], q=percentiles), 3)
}).T

# Fix column labels
result_df.columns = [f"P{p}" for p in percentiles]
result_df.index = ["Correlated", "Uncorrelated"]

result_df

	P10	P50	P90
Correlated	600.174	1115.298	2656.351
Uncorrelated	658.274	1052.506	1874.278

The correlated samples show a wider uncertainty range in the result than the uncorrelated samples, as is expected. This may be an important detail to capture depending on the sensitivity of the analysis. The effect of correlation can also be quite unintuitive, so it is always worth checking the effect it has on results.

Summary of Process

graph TD
    corr["Correlation Matrix"]
    mvn["Standard MvNorm(0, corr)"]

    corr --> mvn --> mvn_samples

    mvn_samples["MvNorm Samples [2500, 3]"]
    uniform["Uniform Samples [2500, 3]"]

    mvn_samples -->|Normal CDF| uniform

    dist_a["a"]
    dist_b["b"]
    dist_c["c"]

    dist_a_samples["a Samples [2500]"]
    dist_b_samples["b Samples [2500]"]
    dist_c_samples["c Samples [2500]"]

    dist_a --->|PPF| dist_a_samples
    dist_b --->|PPF| dist_b_samples
    dist_c --->|PPF| dist_c_samples

    uniform --> dist_a_samples
    uniform --> dist_b_samples
    uniform --> dist_c_samples