import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
from scipy.stats import norm
%config InlineBackend.figure_format = 'retina'
Independent Probabilistic A/B calls
Testing probabilistic modeling of E1 into A/B calls on low-res data with independent bins
Probabilistic Modeling of A/B Compartments from E1
This model infers A/B chromatin compartment structure from the first eigenvector (E1) of the Hi-C contact matrix using a Bayesian framework.
Goals
- Replace hard sign-thresholding of E1 with a probabilistic compartment assignment.
- Quantify uncertainty in compartment calls for each genomic bin.
- Allow for modeling of missing E1 values due to low signal.
Assumptions
- Each bin belongs to one of two latent states: A or B compartment.
- E1 values are generated from Gaussian distributions specific to each compartment.
- Prior belief: A has positive E1 values (μ ≈ 0.5), B has negative (μ ≈ -0.5).
- Bins are initially treated as independent
- Spatial depency can be added later (HMM)
The data
# Input: csv file with only the E1 columns
# Low res example
= 100000
resolution
= pd.Series(pd.read_csv(f"../data/eigs/fibroblast.eigs.{resolution}.cis.vecs.tsv", sep="\t")['E1'].values.flatten())
y = np.arange(0, y.shape[0])
bins = pd.Series(bins*resolution)
x
# Make a DataFrame object
= pd.DataFrame({"bin": bins,
df "start": x,
"e1": y})
"bin", inplace=True)
df.set_index(
=True)
df.dropna(inplace
display(df)
start | e1 | |
---|---|---|
bin | ||
24 | 2400000 | 0.922314 |
25 | 2500000 | 0.932528 |
26 | 2600000 | 0.756230 |
27 | 2700000 | 0.816144 |
28 | 2800000 | 1.082757 |
... | ... | ... |
1617 | 161700000 | 0.006918 |
1618 | 161800000 | 0.007641 |
1619 | 161900000 | -0.026140 |
1620 | 162000000 | -0.036445 |
1621 | 162100000 | 0.262607 |
1460 rows × 2 columns
# Plot the data as a track (stairs for niceity)
= plt.subplots(figsize=(15, 3))
fig, ax
= np.zeros(2*y.shape[0])
x_stair = np.zeros(2*y.shape[0])
y_stair 0::2] = x
x_stair[1::2] = x + resolution
x_stair[0::2] = y
y_stair[1::2] = y
y_stair[
=(y_stair<0), color="tab:blue", ec='None')
ax.fill_between(x_stair, y_stair, where=(y_stair>0), color="tab:red", ec='None')
ax.fill_between(x_stair, y_stair, where
0, df["start"].max()+resolution)
ax.set_xlim(= np.linspace(0, df["start"].max()+resolution, num=10)
ticks
ax.set_xticks(ticks)round(ticks / 1e6, 1), rotation=45)
ax.set_xticklabels(np."Genomic Position (Mbp)")
ax.set_xlabel("E1")
ax.set_ylabel("E1"); ax.set_title(
# Histogram of distribution of E1 values (A and B)
= plt.subplots(figsize=(13, 3))
f, ax
# A compartments
>0], bins=100, color="tab:red", alpha=0.5, label="A", density=True)
ax.hist(y[y# B compartments
<0], bins=100, color="tab:blue", alpha=0.5, label="B", density=True)
ax.hist(y[y# Mean values (vline)
>0].mean(), color="tab:red", linestyle="--", label="Mean A")
ax.axvline(y[y<0].mean(), color="tab:blue", linestyle="--", label="Mean B")
ax.axvline(y[y
# Plot the normal distributions
= np.linspace(y.min(), y.max(), 1000)
x_values = y[y>0].median()
mean_a = y[y<0].median()
mean_b = y[y>0].std()
std_a = y[y<0].std()
std_b
=mean_a, scale=std_a), color="tab:red", label="Normal A")
ax.plot(x_values, norm.pdf(x_values, loc=mean_b, scale=std_b), color="tab:blue", label="Normal B")
ax.plot(x_values, norm.pdf(x_values, loc= norm.pdf(x_values, loc=mean_a, scale=std_a) + norm.pdf(x_values, loc=mean_b, scale=std_b)
stacked ="tab:purple", label="Stacked Normals")
ax.plot(x_values, stacked, color
"E1")
plt.xlabel("Count")
plt.ylabel(='upper left')
plt.legend(loc plt.tight_layout()
Model Specification
Before specifying the model, we need to layout the model structure.
The model is a mixture of two Gaussian distributions, one for each compartment (A and B). Then a latent variable indicates the compartment assignment for each bin. More formally;
Description
We aim to model the A/B chromatin compartments using a probabilistic generative approach, rather than relying on deterministic eigenvector sign thresholding. Each genomic bin is modeled as belonging to one of two latent states (A or B), with corresponding expected E1 values.
Notation
Let:
- \(y_i\): Observed E1 value at bin \(i\)
- \(z_i \in \{0, 1\}\): Latent compartment assignment (0 = B, 1 = A)
- \(\mu_A, \mu_B\): Mean E1 values for compartments A and B
- \(\sigma\): Shared standard deviation across bins
- \(p\): Prior probability of a bin belonging to compartment A
Model variables
\[ \begin{aligned} \mu_A &\sim \mathcal{N}(0.5, 0.5) \\ \mu_B &\sim \mathcal{N}(-0.5, 0.5) \\ \sigma &\sim \text{HalfNormal}(1) \\ p &\sim \text{Beta}(a, b) \quad \text{(e.g., } a = 2, b = 2 \text{ or something bimodal)} \\ z_i &\sim \text{Bernoulli}(p) \\ \mu_i &= z_i \cdot \mu_A + (1 - z_i) \cdot \mu_B \\ y_i &\sim \mathcal{N}(\mu_i, \sigma) \\ \end{aligned} \]
This model allows us to infer both the compartment state and the uncertainty of the assignment, especially around boundaries between A and B compartments.
The Priors
- \(\mu_A\) and \(\mu_B\): We assume that the mean E1 values for compartments A and B are normally distributed around 0.5 and -0.5, respectively, with a standard deviation of 0.5. This reflects our prior belief about the expected E1 values for each compartment.
- \(\sigma\): The standard deviation is modeled using a half-normal distribution, which ensures that it is always positive. This allows for flexibility in modeling the variability of E1 values within each compartment.
- \(p\): The prior probability of a bin belonging to compartment A is modeled using a Beta distribution. This allows us to incorporate prior beliefs about the expected proportion of bins in each compartment. The parameters \(a\) and \(b\) can be adjusted to reflect different prior beliefs (e.g., bimodal distributions).
- \(z_i\): The latent variable indicating the compartment assignment for each bin is modeled using a Bernoulli distribution, with the probability of being in compartment A given by \(p\).
- \(y_i\): The observed E1 values are modeled as normally distributed around the mean E1 value for the assigned compartment, with a shared standard deviation \(\sigma\). This generative model allows us to infer the compartment assignments and their associated uncertainties based on the observed E1 values, while also accounting for the variability in the data.
Let’s look at a few different options for the Beta distribution. There are 3 main options:
- Uniform: \(a = 1, b = 1\) (no prior preference for A or B)
- Bimodal: \(a = 0.5, b = 0.5\) (most observations are certain to belong to A or B, but a few observations are hard to categorize—borders)
- Unimodal: \(a = 2, b = 2\) (most obsevations are hard to categorize)
Here, they are visualized:
# First, have a look at different Beta distributions for the prior on the latent state (z)
= plt.subplots()
fig, ax
= [(0.5, 0.5), (2, 2), (1, 1)]
params = ["tab:blue", "tab:orange", "tab:green"]
colors
for (a, b),col in zip(params, colors):
with pm.Model() as model:
= pm.Beta("p", alpha=a, beta=b)
p = pm.sample_prior_predictive(samples=10000)
prior "p"].values, label=f"Beta({a}, {b})", ax=ax, fill_kwargs={"alpha": 0.2, "color":col})
az.plot_kde(prior.prior[
"Comparison of Beta priors for p")
ax.set_title("p")
ax.set_xlabel("Density")
ax.set_ylabel(
ax.legend()
plt.show()
Sampling: [p]
Sampling: [p]
Sampling: [p]
~~I find that the most useful prior is the bimodal one, as we have a strong prior belief that most bins are either A or B, but a few bins are hard to categorize. This is especially useful for the edges of compartments, where the E1 values are close to 0. The unimodal prior is not very useful, as it assumes that most bins are hard to categorize, which is not the case in our data. The uniform prior could be useful as the most uninformative prior —the least biased.*~~
Above logic is flawed. This is not what the prior is reflecting. A Beta(0.5, 0.5) prior assumes that a data set is most likely consisting of only A or B compartments. Therefore, we need Beta(2,2) to reflect that compartments are expected to of both A and B.
= plt.subplots((2))
fig, ax
#params = [(0.5, 0.5), (2, 2), (1, 1)]
= [(1,0), (3,0), (5,0)]
params = ["tab:blue", "tab:orange", "tab:green"]
colors
for (a, b),col in zip(params, colors):
with pm.Model() as ylik_priors:
= pm.HalfNormal("nu_norm", sigma=a)
nu_norm = pm.Exponential("nu_exp", lam=a)
nu_exp = pm.sample_prior_predictive(samples=10000)
prior 0].set_title("Comparison of HalfNormal priors for nu")
ax["nu_norm"].values, label=f"HalfN(sigma={a})", ax=ax[0], fill_kwargs={"alpha": 0.2, "color":col})
az.plot_kde(prior.prior[1].set_title("Comparison of Exponential priors for nu")
ax["nu_exp"].values, label=f"Exp(lam={a})", ax=ax[1], fill_kwargs={"alpha": 0.2, "color":col})
az.plot_kde(prior.prior[
plt.tight_layout()
Sampling: [nu_exp, nu_norm]
Sampling: [nu_exp, nu_norm]
Sampling: [nu_exp, nu_norm]
Model in PyMC
= {"bins": df.index.values,
coords "pos": df.start.values}
with pm.Model(coords=coords) as model:
# Let the model be aware of the data
## Cannot be done with missing values or masked arrays
# e1 = pm.Data("e1", df.e1.values, dims="bins")
# Two means for A and B compartments
= pm.Normal("mu_a", 0.5, 0.3)
mu_a = pm.Normal("mu_b", -0.5, 0.3)
mu_b
# Standard deviation for the noise (only positive values HalfStudent)
= pm.HalfNormal("sigma", 0.3)
sigma
# Latent state for each bin: 1 = A, 0 = B
= pm.Beta("p", 2, 2, dims="bins")
p
# Latent state variable z, indicating whether a bin is A or B
= pm.Bernoulli("z", p=p, dims="bins")
z
# per-bin expected mean
= pm.math.switch(z, mu_a, mu_b)
mu
# Likelihood estimate
= pm.Normal("y_lik", mu=mu, sigma=sigma, observed=df.e1.values, dims="bins") y_lik
Let’s overwrite the model with an alternative that does not try to infer the actual state (Bernoulli part), but only draws the mean value from a Beta dist. Then, the model draw the E1 values from a mixture of mu_a
and mu_b
with the weights given by the probabiliy w
(w
and 1-w
). This could be more accurate, and also it will be much faster to sample from, as we do not need to sample the latent variable z_i
for each bin which is using a really slow BinaryMetropolisHastings sampler.
with pm.Model(coords={"pos": df.index.values}) as model:
# mu_a = pm.Normal("mu_a", 0.5, 0.3)
# mu_b = pm.Normal("mu_b", -0.5, 0.3)
# sigma = pm.HalfNormal("sigma", 0.3)
# New: Make ordered parameters for mu_a and mu_b in stead
= pm.Normal("mu", mu=[-0.5, 0.5], sigma=0.3,
mu =pm.distributions.transforms.ordered, # IMPORTANT
transform=2,
shape
)= pm.HalfNormal("sigma", 0.3, shape=2)
sigma
= pm.Beta("w", 1, 1)
w
# components = pm.Normal.dist(mu = pm.math.stack([mu_a, mu_b]), sigma = sigma, shape=(2,))
# New: Use the ordered mu parameters (cleaner code as well)
= pm.StudentT.dist(nu=10, mu=mu, sigma=sigma, shape=2)
components
# Likelihood estimate
= pm.Mixture("y_lik", w=[w,1-w], comp_dists=components, observed=df.e1.values) y_lik
Right, let’s rewrite it again, so that the model uses a switch again, but this time based on the latent probability og being in A.
= {"bins": df.index.values,
coords "pos": df.start.values}
with pm.Model(coords=coords) as model_inactive:
# Let the model be aware of the data
## Cannot be done with missing values or masked arrays
#e1 = pm.Data("e1", df.e1.dropna().values, dims="bins")
# Two means for A and B compartments
= pm.Normal("mu_a", 0.5, 0.3)
mu_a = pm.Normal("mu_b", -0.5, 0.3)
mu_b
# Standard deviation for the noise (only positive values HalfStudent)
= pm.HalfNormal("sigma", 0.3)
sigma
# Latent state for each bin: 1 = A, 0 = B
= pm.Beta("p", 2, 2, dims="bins")
p
# per-bin expected mean
= pm.math.switch(p > 0.5, mu_a, mu_b)
mu
# Likelihood estimate
= pm.Normal("y_lik", mu=mu, sigma=sigma, observed=df.e1.values, dims="bins") y_lik
Model graphs
Test the priors
with model:
# Sample from the prior
= pm.sample_prior_predictive(samples=1000)
trace
trace
Sampling: [mu, sigma, w, y_lik]
-
<xarray.Dataset> Size: 48kB Dimensions: (chain: 1, draw: 1000, sigma_dim_0: 2, mu_dim_0: 2) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 * sigma_dim_0 (sigma_dim_0) int64 16B 0 1 * mu_dim_0 (mu_dim_0) int64 16B 0 1 Data variables: w (chain, draw) float64 8kB 0.7316 0.09718 ... 0.8654 0.5234 sigma (chain, draw, sigma_dim_0) float64 16kB 0.1871 ... 0.2639 mu (chain, draw, mu_dim_0) float64 16kB -0.6882 ... -0.03438 Attributes: created_at: 2025-07-03T08:51:20.930468+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 12MB Dimensions: (chain: 1, draw: 1000, y_lik_dim_0: 1460) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (chain, draw, y_lik_dim_0) float64 12MB -0.6862 ... -1.436 Attributes: created_at: 2025-07-03T08:51:20.935245+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 23kB Dimensions: (y_lik_dim_0: 1460) Coordinates: * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (y_lik_dim_0) float64 12kB 0.9223 0.9325 ... -0.03644 0.2626 Attributes: created_at: 2025-07-03T08:51:20.936718+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
= az.extract(trace, group="prior_predictive", var_names=["y_lik"])
prior_predictive_e1
print(f"""
y_lik (min, max, std): {prior_predictive_e1.values.min(), prior_predictive_e1.values.max(), prior_predictive_e1.values.std()}
""")
#y_lik_norm (min, max, std): {prior_predictive_e1.y_lik_norm.values.min(), prior_predictive_e1.y_lik_norm.values.max(), prior_predictive_e1.y_lik_norm.values.std()}
y_lik (min, max, std): (np.float64(-6.2835348565157405), np.float64(7.055839972817577), np.float64(0.6783511312136824))
= df['e1'].dropna().values
e1_obs
az.plot_dist(
e1_obs,="hist",
kind="C1",
color={"alpha": 0.6, "bins": 25},
hist_kwargs="observed"
label
)
= trace.prior_predictive["y_lik"].stack(sample=("chain", "draw")).values
prior_pred_obs = prior_pred_obs[prior_pred_obs > e1_obs.min()] # Constrain the range to observed values
prior_pred_obs = prior_pred_obs[prior_pred_obs < e1_obs.max()] # Constrain the range to observed values
prior_pred_obs
az.plot_dist(
prior_pred_obs,="hist",
kind={"alpha": 0.6, "bins":25},
hist_kwargs="simulated",
label
)=90);
plt.xticks(rotation
This was quite involved to get to plot, as the prior predictions take values between -Inf and Inf. It could mean that the priors are not super robust to the random variables defining the StudentT draws from y_lik
: If the nu
or sigma
draws are too small, the T-dist could become numerically unstable and produce Inf values.
With y_lik adhering to a Normal distribution, there are no infinites, and the range is more narrow. That must mean something to the predicitions as well, but I’m not quite sure about the effect.
Posterior sampling
Here, we both create a trace (by sampling) and then sample from the predictive posterior.
with model:
# Sample from the posterior
= pm.sample(2000, tune=2000, cores=4, chains=4, target_accept=0.9,
trace # step = [
# pm.NUTS([mu_a, mu_b, sigma, p], target_accept=0.9),
# pm.BinaryGibbsMetropolis([z], transit_p=0.8)],
# nuts_sampler='nutpie'
)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [mu, sigma, w]
Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 12 seconds.
with model:
=True) pm.sample_posterior_predictive(trace, extend_inferencedata
Sampling: [y_lik]
with model:
= pm.sample_prior_predictive(samples=1000)
prior_pred trace.extend(prior_pred)
Sampling: [mu, sigma, w, y_lik]
with model:
=True) pm.compute_log_likelihood(trace, extend_inferencedata
Evaluation
We now have some variables that are fitted and all have a posterior distribution.
Which ones are interesting? Well, the p
represents the probability \(p \in [0;1]\) for a bin to be in compartment A (remember, A is 1, and B is 0). z
is the actual compartment assignment \(z \in \{0, 1\}\), where \(z \sim \mathcal{Bernoulli}(p) = \mathcal{B}(1,p)\). y
is the observed E1 value drawn from a Normal distribution with the fitted mu and sigma.
trace
-
<xarray.Dataset> Size: 336kB Dimensions: (chain: 4, draw: 2000, mu_dim_0: 2, sigma_dim_0: 2) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 16kB 0 1 2 3 4 5 ... 1994 1995 1996 1997 1998 1999 * mu_dim_0 (mu_dim_0) int64 16B 0 1 * sigma_dim_0 (sigma_dim_0) int64 16B 0 1 Data variables: mu (chain, draw, mu_dim_0) float64 128kB -0.4062 0.4609 ... 0.4697 sigma (chain, draw, sigma_dim_0) float64 128kB 0.2515 ... 0.3594 w (chain, draw) float64 64kB 0.5492 0.547 ... 0.5485 0.5599 Attributes: created_at: 2025-07-03T08:51:40.365650+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0 sampling_time: 12.236668586730957 tuning_steps: 2000
-
<xarray.Dataset> Size: 93MB Dimensions: (chain: 4, draw: 2000, y_lik_dim_0: 1460) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 16kB 0 1 2 3 4 5 ... 1994 1995 1996 1997 1998 1999 * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (chain, draw, y_lik_dim_0) float64 93MB -0.02907 ... -0.5131 Attributes: created_at: 2025-07-03T08:52:23.988254+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 93MB Dimensions: (chain: 4, draw: 2000, y_lik_dim_0: 1460) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 16kB 0 1 2 3 4 5 ... 1994 1995 1996 1997 1998 1999 * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (chain, draw, y_lik_dim_0) float64 93MB -1.507 ... -0.8269 Attributes: created_at: 2025-07-03T08:52:31.319747+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 992kB Dimensions: (chain: 4, draw: 2000) Coordinates: * chain (chain) int64 32B 0 1 2 3 * draw (draw) int64 16kB 0 1 2 3 4 ... 1996 1997 1998 1999 Data variables: (12/17) step_size (chain, draw) float64 64kB 0.3518 0.3518 ... 0.4007 perf_counter_start (chain, draw) float64 64kB 7.867e+06 ... 7.867e+06 reached_max_treedepth (chain, draw) bool 8kB False False ... False False energy (chain, draw) float64 64kB 1.086e+03 ... 1.084e+03 process_time_diff (chain, draw) float64 64kB 0.002204 ... 0.002185 lp (chain, draw) float64 64kB -1.085e+03 ... -1.083e+03 ... ... perf_counter_diff (chain, draw) float64 64kB 0.002224 ... 0.002197 smallest_eigval (chain, draw) float64 64kB nan nan nan ... nan nan index_in_trajectory (chain, draw) int64 64kB -1 2 -5 -1 -4 ... 6 -3 -3 -7 step_size_bar (chain, draw) float64 64kB 0.4114 0.4114 ... 0.4229 acceptance_rate (chain, draw) float64 64kB 0.9918 0.8169 ... 0.9845 largest_eigval (chain, draw) float64 64kB nan nan nan ... nan nan Attributes: created_at: 2025-07-03T08:51:40.386804+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0 sampling_time: 12.236668586730957 tuning_steps: 2000
-
<xarray.Dataset> Size: 48kB Dimensions: (chain: 1, draw: 1000, sigma_dim_0: 2, mu_dim_0: 2) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 * sigma_dim_0 (sigma_dim_0) int64 16B 0 1 * mu_dim_0 (mu_dim_0) int64 16B 0 1 Data variables: w (chain, draw) float64 8kB 0.05413 0.7738 ... 0.0997 0.8081 sigma (chain, draw, sigma_dim_0) float64 16kB 0.308 0.301 ... 0.493 mu (chain, draw, mu_dim_0) float64 16kB -0.3437 0.5381 ... 0.3318 Attributes: created_at: 2025-07-03T08:52:29.829557+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 12MB Dimensions: (chain: 1, draw: 1000, y_lik_dim_0: 1460) Coordinates: * chain (chain) int64 8B 0 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999 * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (chain, draw, y_lik_dim_0) float64 12MB 0.6404 ... -0.4974 Attributes: created_at: 2025-07-03T08:52:29.832465+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
-
<xarray.Dataset> Size: 23kB Dimensions: (y_lik_dim_0: 1460) Coordinates: * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 Data variables: y_lik (y_lik_dim_0) float64 12kB 0.9223 0.9325 ... -0.03644 0.2626 Attributes: created_at: 2025-07-03T08:51:40.391266+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
Let’s do some stuff on the log-likelihood of the posterior. We can use arviz to generate Leave-One-Out (LOO) cross-validation
= az.loo(trace)
model_loo model_loo
Computed from 8000 posterior samples and 1460 observations log-likelihood matrix.
Estimate SE
elpd_loo -1084.72 24.90
p_loo 4.77 -
------
Pareto k diagnostic values:
Count Pct.
(-Inf, 0.70] (good) 1460 100.0%
(0.70, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%
=500, group="prior", data_pairs={"y_lik": "y_lik"});
az.plot_ppc(trace, num_pp_samples="upper right")
plt.legend(loc0,10);
plt.ylim(=500, group="posterior");
az.plot_ppc(trace, num_pp_samples="upper right")
plt.legend(loc plt.tight_layout()
Slight deviation near 1 in the posterior predictive distribution suggests possible additional states, but the model otherwise captures the main features of the E1 distribution.
= az.summary(trace, var_names=['y_lik'], kind='stats', group='posterior_predictive') trace_ylik
trace_ylik
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
y_lik[0] | -0.017 | 0.545 | -0.900 | 0.983 |
y_lik[1] | -0.006 | 0.556 | -0.867 | 1.049 |
y_lik[2] | -0.016 | 0.549 | -0.853 | 1.030 |
y_lik[3] | -0.017 | 0.554 | -0.908 | 1.014 |
y_lik[4] | -0.010 | 0.551 | -0.902 | 1.019 |
... | ... | ... | ... | ... |
y_lik[1455] | -0.007 | 0.549 | -0.857 | 1.017 |
y_lik[1456] | -0.005 | 0.547 | -0.847 | 1.034 |
y_lik[1457] | -0.011 | 0.550 | -0.869 | 1.010 |
y_lik[1458] | -0.017 | 0.550 | -0.895 | 0.999 |
y_lik[1459] | -0.008 | 0.548 | -0.845 | 1.045 |
1460 rows × 4 columns
print(pd.DataFrame({
"mean": [trace_ylik['mean'].mean(), trace_ylik['mean'].std()],
"sd": [trace_ylik['sd'].mean(), trace_ylik['sd'].std()]
'means', 'sds'])).round(4))
}).set_index(pd.Index([
= plt.subplots(1,2, figsize=(10, 5))
f,ax = ax.flatten()
ax 0].set_title("Mean of y_lik estimate")
ax[0].hist(trace_ylik['mean'], bins=300, color="tab:grey", alpha=0.5)
ax[#ax[0].set_xlim(-0.05,0.05) # zoom-in on the near-zero peak
1].set_title("SD of y_lik")
ax[1].hist(trace_ylik['sd'], bins=30, color="tab:grey", alpha=0.5)
ax[ plt.tight_layout()
mean sd
means -0.0134 0.5497
sds 0.0063 0.0040
Why does it seem to contain a lot of means on zero? Let’s find out how many
'mean'] == 0 ).sum() (trace_ylik[
np.int64(13)
Well, that was not very informative, but it looks like the peak is very close to the mean (-0.0149), and not zero (upon closer inspection).
Let’s look at the actual values from the trace.
= plt.subplots(figsize=(12,4))
f,ax 'start'], trace_ylik['mean'], lw=0.3, label="prediction confidence")
ax.plot(df['start'], trace_ylik['hdi_97%'], trace_ylik['hdi_3%'], color="tab:grey", alpha=0.3, label="95% HDI")
ax.fill_between(df[ plt.tight_layout()
That looks promising, but we have to look at the predictive samples. Let’s look at the posterior predictive samples to see how well the model fits the data.
=("chain", "draw")) trace.posterior_predictive.stack(sample
<xarray.Dataset> Size: 94MB Dimensions: (y_lik_dim_0: 1460, sample: 8000) Coordinates: * y_lik_dim_0 (y_lik_dim_0) int64 12kB 0 1 2 3 4 ... 1455 1456 1457 1458 1459 * sample (sample) object 64kB MultiIndex * chain (sample) int64 64kB 0 0 0 0 0 0 0 0 0 0 ... 3 3 3 3 3 3 3 3 3 3 * draw (sample) int64 64kB 0 1 2 3 4 5 ... 1995 1996 1997 1998 1999 Data variables: y_lik (y_lik_dim_0, sample) float64 93MB -0.02907 0.4569 ... -0.5131 Attributes: created_at: 2025-07-03T08:52:23.988254+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.22.0
# p_B = trace.posterior["z"].mean(dim=["chain", "draw"]).values
= trace.posterior_predictive["y_lik"]
y_ppc
= y_ppc.stack(sample=("chain", "draw"))
y_ppc_flat
= y_ppc_flat.mean(dim="sample").values # mean of predictions per bin
ppc_mean = y_ppc_flat.std(dim="sample").values # predictive std (uncertainty)
ppc_std = az.hdi(y_ppc, hdi_prob=0.95) # 95% HDI using ArviZ ppc_hdi
Posterior predictive checks
How does the posterior predictive look? We can visualize the distribution of predicted E1s based on the posterior samples. Here, the mean and sd of the posterior predictive for each bin are visualized in histograms.
= plt.subplots(1,2, figsize=(10, 3))
f,ax0].set_title("Posterior Predictive Mean of y_lik")
ax[0].set_ylabel("Counts")
ax[0].hist(ppc_mean, bins=40, alpha=0.7, label="Mean(y_lik)")
ax[1].set_title("Posterior Predictive Std of y_lik")
ax[1].hist(ppc_std, bins=40, alpha=0.7, label="Std(y_lik)")
ax[ plt.tight_layout()
Full mixture Hmpf. That was not very promising at all. We would like the posterior predictive to match the observed E1 values, but what we get is a unimodal. Why can that be? The data is modeled as a mixture without a latent binary variable, \(z \in {0,1}\).
Latent z
hard assignment Here, we do get a distribution of \(\hat{y}\) that somewhat resembles the means of the two suggested distributions for A and B compartments.
Latent z
model Here, there are clear signs of overdispersion, or just something about the structure of the model that does not fit. I’m thinking the model is overdispersed as the bins are i.i.d. There is no limit on the ‘walk distance’ between the bins, so the model can easily make too large jumps between bins.
Consistency
We can also try to calculate how consistent the predicted E1 is with the observed E1. Maybe this is useful.
"y_lik"].mean(dim=["chain", "draw"]).values trace.posterior_predictive[
array([-0.01666207, -0.00627334, -0.01618608, ..., -0.01147345,
-0.01697497, -0.00775287], shape=(1460,))
# ppc_mean: shape (bins,)
# mean_ll: shape (bins,)
= trace.posterior_predictive["y_lik"].mean(dim=["chain", "draw"]).values
posterior_mean = trace.log_likelihood.y_lik.mean(dim=["chain", "draw"]).values
mean_ll
# Plot the data as a track (stairs for niceity)
= plt.subplots(figsize=(20, 5))
fig, ax
# Stairs
= np.zeros(2*y.shape[0])
x_stair = np.zeros(2*y.shape[0])
y_stair 0::2] = x
x_stair[1::2] = x + resolution
x_stair[0::2] = y
y_stair[1::2] = y
y_stair[
=(y_stair<0), color="tab:blue", alpha=0.5, ec='None')
ax.fill_between(x_stair, y_stair, where=(y_stair>0), color="tab:red", alpha=0.5, ec='None')
ax.fill_between(x_stair, y_stair, where
# plot the y_ppc_flat mean
= np.zeros(2*y.shape[0])
post_mean_stairs 0::2] = posterior_mean
post_mean_stairs[1::2] = posterior_mean
post_mean_stairs[=1, label="Posterior Predictive Mean", color="tab:orange")
ax.plot(x_stair, post_mean_stairs, lw
# Plot the point-wise log-likelihood
= np.zeros(2*mean_ll.shape[0])
ll_stair 0::2] = mean_ll
ll_stair[1::2] = mean_ll
ll_stair[=1, label="Binwise Log-Likelihood", color="tab:green")
ax.plot(x_stair, ll_stair, lw
0, 57_984_683)
ax.set_xlim(= np.arange(0, 57_984_683, step=5_000_000)
ticks
ax.set_xticks(ticks)/ 1e6).astype(int), rotation=45)
ax.set_xticklabels((ticks "Genomic Position (Mbp)")
ax.set_xlabel("E1")
ax.set_ylabel("E1")
ax.set_title(="best"); ax.legend(loc
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[32], line 24 22 # plot the y_ppc_flat mean 23 post_mean_stairs = np.zeros(2*y.shape[0]) ---> 24 post_mean_stairs[0::2] = posterior_mean 25 post_mean_stairs[1::2] = posterior_mean 26 ax.plot(x_stair, post_mean_stairs, lw=1, label="Posterior Predictive Mean", color="tab:orange") ValueError: could not broadcast input array from shape (1460,) into shape (1623,)
= np.arange(len(y))
bins
=(15, 4))
plt.figure(figsize
# Observed data
="Observed E1", color="black", alpha=0.5)
plt.plot(bins, y, label
# Posterior predictive mean
="Posterior Mean", color="tab:red")
plt.plot(bins, ppc_mean, label
# Credible interval shading
plt.fill_between(
bins,='lower').values,
ppc_hdi.y_lik.sel(hdi='higher').values,
ppc_hdi.y_lik.sel(hdi="tab:red",
color=0.3,
alpha="95% Credible Interval"
label
)
"Genomic bin")
plt.xlabel("E1 value")
plt.ylabel("Posterior Predictive Mean with 95% CI")
plt.title(
plt.legend()
plt.tight_layout() plt.show()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[33], line 9 6 plt.plot(bins, y, label="Observed E1", color="black", alpha=0.5) 8 # Posterior predictive mean ----> 9 plt.plot(bins, ppc_mean, label="Posterior Mean", color="tab:red") 11 # Credible interval shading 12 plt.fill_between( 13 bins, 14 ppc_hdi.y_lik.sel(hdi='lower').values, (...) 18 label="95% Credible Interval" 19 ) File ~/miniconda3/envs/pymc/lib/python3.13/site-packages/matplotlib/pyplot.py:3838, in plot(scalex, scaley, data, *args, **kwargs) 3830 @_copy_docstring_and_deprecators(Axes.plot) 3831 def plot( 3832 *args: float | ArrayLike | str, (...) 3836 **kwargs, 3837 ) -> list[Line2D]: -> 3838 return gca().plot( 3839 *args, 3840 scalex=scalex, 3841 scaley=scaley, 3842 **({"data": data} if data is not None else {}), 3843 **kwargs, 3844 ) File ~/miniconda3/envs/pymc/lib/python3.13/site-packages/matplotlib/axes/_axes.py:1777, in Axes.plot(self, scalex, scaley, data, *args, **kwargs) 1534 """ 1535 Plot y versus x as lines and/or markers. 1536 (...) 1774 (``'green'``) or hex strings (``'#008000'``). 1775 """ 1776 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D) -> 1777 lines = [*self._get_lines(self, *args, data=data, **kwargs)] 1778 for line in lines: 1779 self.add_line(line) File ~/miniconda3/envs/pymc/lib/python3.13/site-packages/matplotlib/axes/_base.py:297, in _process_plot_var_args.__call__(self, axes, data, return_kwargs, *args, **kwargs) 295 this += args[0], 296 args = args[1:] --> 297 yield from self._plot_args( 298 axes, this, kwargs, ambiguous_fmt_datakey=ambiguous_fmt_datakey, 299 return_kwargs=return_kwargs 300 ) File ~/miniconda3/envs/pymc/lib/python3.13/site-packages/matplotlib/axes/_base.py:494, in _process_plot_var_args._plot_args(self, axes, tup, kwargs, return_kwargs, ambiguous_fmt_datakey) 491 axes.yaxis.update_units(y) 493 if x.shape[0] != y.shape[0]: --> 494 raise ValueError(f"x and y must have same first dimension, but " 495 f"have shapes {x.shape} and {y.shape}") 496 if x.ndim > 2 or y.ndim > 2: 497 raise ValueError(f"x and y can be no greater than 2D, but have " 498 f"shapes {x.shape} and {y.shape}") ValueError: x and y must have same first dimension, but have shapes (1623,) and (1460,)
Next up: Introducing dependency between bins
In the next notebook, we will implement a Gaussian Random Walk (GRW) model to introduce spatial dependency between bins. We will investigate how it affects smoothing and uncertainty in compartment assignment.
We will update the model to
\[\begin{aligned} p_t &\sim \mathcal{N}(p_{t-1}, \sigma^2_{t-1}) \\ z_t &\sim \text{Bernoulli}(p_t) \\ \\ \mu_{z_t} &= z_t \cdot \mu_A + (1 - z_t) \cdot \mu_B \\ y_t &\sim \mathcal{N}(\mu_{z_t}, \sigma^2) \end{aligned}\]