To make the processes outlined in the previous two posts more concrete, the following is the story of a young couple trying to manage a household budget. I’ll use an aggregated zero-inflated model of spending to simulate monthly expenses.
Introduction
Consider the household of Amy and Brett, a young couple who want to better plan their monthly budget.
Amy and Brett have a pattern of discretionary spending outlined as follows:
Amy
Amy is a freelancer and goes to work on 85% of days, where she spends no money
On days she is not working, she spends anywhere between $20 and $150
\[
A \sim
\begin{cases}
0, & \text{with probability } 0.15 \\
C_A \sim \text{PERT}(20, 50, 150), & \text{with probability } 0.85
\end{cases}
\]
Brett
Brett is unemployed, and on a given day has a 70% chance of spending
When he spends, usually gambling, he makes a real impact, spending anywhere between $20 on a cheeky bet to $1500 when “we’re on here…”
On days Amy is home, Brett tends not to gamble, but will still place bets while on the loo, out of Amy’s sight
The amount Brett bets depends on Amy’s spending… if she is leaving money on the table, he tends to use it, but if she’s making big purchases he keeps his bets smaller.
\[
B \sim
\begin{cases}
0, & \text{with probability } 0.3 \\
C_B \sim \text{PERT}(20, 20, 1500), & \text{with probability } 0.7
\end{cases}
\]
The cost of household essentials is \(\$7500\) a month. So the total monthly expenses \(C_{\text{total}}\) can be given by:
\[
C_{\text{total}} = 7500 + (30(A + B))
\]
The Model
Broadly speaking the model has two parts – days that Amy and Brett spend, and on those days, how much each of them spend.
Spending Days
First I’ll simulate whether or not Amy and Brett spend money on a given day, and by extension their spending days in each month:
Code
# Simulate a month of spending days (True or False) for Amy and Brettdays =30n =2500p_a =0.15p_b =0.7# Negative correlation between Amy and Brett spending daysrho =-0.95spend_corr_matrix = np.array([ [1.0, rho], [rho, 1.0],])# Use a Copula to generate correlated spending days for the monthuniform_simdata = stats.norm.cdf( np.random.multivariate_normal( mean=[0, 0], cov=spend_corr_matrix, size=(n, days) ))# Map to binary outcomes (True = spend, False = no spend)spending_days = np.where( uniform_simdata < [p_a, p_b],True,False)df = pd.DataFrame({"Amy": np.sum(spending_days[:, :, 0], axis=1),"Brett": np.sum(spending_days[:, :, 1], axis=1),})flat_df = df.melt(var_name="Person", value_name="Spending Days")def plot_disrete_histogram(df, col: str, color: str="dodgerblue") -> alt.Chart: tick_values =list(range(0, days +1))return ( alt.Chart(df) .mark_bar(color=color, opacity=0.5) .encode( alt.X(f"{col}:Q", bin=alt.Bin(step=1), scale=alt.Scale(domain=[0, days]), axis=alt.Axis(values=tick_values, format=".0f"), ), alt.Y("count():Q", stack=None, title='', axis=alt.Axis( labels=False, ticks=False, grid=False, domain=False ) ), alt.Color("Person:N", legend=alt.Legend(orient="top-right", title="") ), ) .properties( title="Number of Days Money Spent in Month", height=PLOT_HEIGHT, width=PLOT_WIDTH ) )plot_disrete_histogram(flat_df, "Spending Days").show()
Since Amy is often at work, she spends money on less days during the month. Brett, on the other hand is at home, in his element, making legendary bets.
It’s not obvious from the plot above, but Brett is less likely to gamble on days that Amy is at home. A 2D plot can show this better:
Code
tick_values =list(range(0, days +1, 2))chart = ( alt.Chart(df) .mark_rect() .encode( alt.X("Amy:Q", bin=alt.Bin(maxbins=days //2), scale=alt.Scale(domain=[0, days]), axis=alt.Axis(values=tick_values) ), alt.Y("Brett:Q", bin=alt.Bin(maxbins=days //2), scale=alt.Scale(domain=[0, days]), axis=alt.Axis(values=tick_values) ), alt.Color("count():Q", scale=alt.Scale(scheme="oranges"), legend=None ) ) .properties( title="Amy vs Brett Spending Days in Month", height=PLOT_HEIGHT, width=PLOT_WIDTH ))chart.show()
There is a negative trend between Amy’s days at home and Brett’s spending. This is because on months where Amy isn’t getting as much work, she is at home and Brett has less opportunities to gamble. However, on months where she is working a lot, Brett is at home alone, turning the place into his personal casino.
Amount Spent
Next, I’ll simulate how much Amy and Brett each spend on a given day. I’ll use the PERT distribution, as it is a convenient and flexible choice for subjective models like this.
Code
class PERT:def__init__(self, minimum: float, mode: float, maximum: float):self.minimum = minimumself.mode = modeself.maximum = maximumself.alpha = ( (4* (self.mode -self.minimum) / (self.maximum -self.minimum)) +1 )self.beta = ( (4* (self.maximum -self.mode) / (self.maximum -self.minimum)) +1 )# Beta distribution on [0,1]self.beta_dist = stats.beta(self.alpha, self.beta)def pdf(self, x):""" Calculate the PDF of the PERT distribution at x. """if x <self.minimum or x >self.maximum:return0.0# Scale x to [0,1] x_scaled = (x -self.minimum) / (self.maximum -self.minimum)returnself.beta_dist.pdf(x_scaled) / (self.maximum -self.minimum)def rvs(self, shape: int|tuple[int] =1) -> np.ndarray:""" Generate random variates of given size from the PERT distribution. """ samples_scaled =self.beta_dist.rvs(size=shape)# Scale samples back to [minimum, maximum]returnself.minimum + samples_scaled * (self.maximum -self.minimum)def ppf(self, q):""" Percent-point function (inverse CDF) at q of the PERT distribution. """ x_scaled =self.beta_dist.ppf(q)returnself.minimum + x_scaled * (self.maximum -self.minimum)def plot(self, color: str="dodgerblue", title: str="PERT PDF", height=PLOT_HEIGHT, width=PLOT_WIDTH ) -> alt.Chart: x_vals = np.linspace(self.minimum, self.maximum, 200) pdf_vals = [self.pdf(x) for x in x_vals] df = pd.DataFrame({'x': x_vals,'pdf': pdf_vals }) chart = ( alt.Chart(df) .mark_area(line={"color": color}, color=color, opacity=0.5) .encode( x=alt.X('x', title='$'), y=alt.Y('pdf', title='', axis=alt.Axis( labels=False, ticks=False, grid=False, domain=False ) ) ) .properties( width=width, height=height, title=title ) )return chart
Amy
\[
C_A \sim \text{PERT}(20, 50, 150)
\]
Amy is not a big spender. She is usually at work, but when she has a day off she never spends more than $150 during the day, and usually hovers around the $50 mark.
Brett’s habits are a lot more volatile. He is always at home, and overall much more likely to spend. When he does spend, it is anything between a small $20 bet to significant $1500 wagers.
Using the spending day simulation in combination with correlated sampling of the two spending distributions, I can estimate the monthly spending of each member of the household:
Code
# Negative correlation between Amy and Brett spendingrho =-0.95spend_corr_matrix = np.array([ [1.0, rho], [rho, 1.0],])# Use a Copula to generate correlated spending days for the monthuniform_simdata = stats.norm.cdf( np.random.multivariate_normal( mean=[0, 0], cov=spend_corr_matrix, size=(n, days) ))simdata_a = spend_a.ppf(uniform_simdata[:, :, 0])simdata_b = spend_b.ppf(uniform_simdata[:, :, 1])# Mask with array of spending dayssimdata_a = np.where( spending_days[:, :, 0], simdata_a,0)simdata_b = np.where( spending_days[:, :, 1], simdata_b,0)df = pd.DataFrame({"Amy": np.sum(simdata_a, axis=1),"Brett": np.sum(simdata_b, axis=1),})flat_df = df.melt(var_name="Person", value_name="Spending")df["TOTAL"] = df["Amy"] + df["Brett"]
Plotting to show Amy and Brett’s spending separately paints a pretty clear picture:
This tells us that the household has a 31.28% chance of losing money each month, and that Amy should probably get a divorce.
Summary
This is a silly example, but hopefully it shows that with a fairly basic toolkit you can learn to simulate a wide range of problems. Statistics and probability are domain agnostic, so they are very useful “glue” skills to develop, compared to narrower skillsets.