15 Hierarchical SCC Elimination and Composition

Introduction

This file formalizes the hierarchical strongly-connected-component (SCC) elimination procedure that phasic uses to keep parameterized moment computation tractable on graphs with hundreds of thousands of vertices. Where the monolithic elimination of [06] builds a single parameterized reward-compute graph (PRC) at cost O(|V|^3), the hierarchical procedure builds one PRC per SCC — each on a small synthetic SCC subgraph with just two extra vertices — and composes the per-SCC results in sink-first level order. When SCCs are small (the typical regime for spatial coalescent and queuing models), the total elimination cost drops from O(|V|^3) to \sum_i O((n_i + 2)^3), often with a large additional reduction from hash-deduplicated caching when many SCCs are structurally identical.

The synthetic-graph construction wraps each SCC C_i with a single synthetic source vertex s_i^\dagger (which feeds the SCC’s external inputs as channel-indexed placeholders) and a single synthetic absorbing vertex a_i^\dagger (onto which all edges leaving the SCC are collapsed, again as channel-indexed placeholders). The per-SCC PRC — the elimination-command sequence of [06] applied to this synthetic graph — is the unit of caching and the unit of parallel work. Composition binds the placeholder edges to (i) the parent graph’s actual edge weights for inputs and (ii) the already-computed downstream-SCC result values for outputs, then replays the per-SCC PRC and writes its internal-vertex results into the parent-wide result vector.

The hierarchical path is opt-in: it activates only when the environment variable PHASIC_HIERAR_ELIMINATION=1 is set, the graph is parameterised, and no reward vector is supplied. Reward-transformed moments (Definition 2.9) always fall through to the monolithic path of [06]. The output is numerically equivalent to monolithic elimination at the same \boldsymbol{\theta} by Theorem 15.1; the composition is exact, not approximate.

A separate Python module src/phasic/hierarchical_trace_cache.py implements an alternative hierarchical pipeline based on Python EliminationTrace stitching (the path consumed by JAX trace replay). That alternative is independent of the C composer formalized here and is not invoked when ptd_compose_scc_prcs runs; it is documented only as an Implementation Note at the end of this file.

Prerequisites: [01], [02], [04], [06], [11], [12]

Source files:

src/c/scc_compose.c (functions: ptd_compose_scc_prcs, ptd_compose_scc_prcs_inner, ptd_compose_scc_one)
src/c/scc_synthetic.c (functions: ptd_scc_build_synthetic_graph, ptd_scc_get_or_compute_prc, ptd_synth_get_or_compute_prc)
src/c/phasic.c (the hierarchical gate inside ptd_expected_waiting_time, L8576–L8615; the parameterised PRC serializer used by both monolithic and per-SCC caches, L2871–L3700)
api/c/phasic.h (public API: ptd_compose_scc_prcs, ptd_scc_get_or_compute_prc, ptd_synth_get_or_compute_prc, telemetry struct ptd_scc_compose_stats)
src/phasic/cache.py (Python introspection: param_compute_cache_info, clear_param_compute_cache)

Definitions

Recall (Definition 5.1). A strongly connected component (SCC) of G is a maximal subset C \subseteq V such that for every pair of vertices u, v \in C there exists a directed path from u to v and from v to u.

Recall (Definition 5.2). The condensation of G is the DAG G^{\mathrm{SCC}} = (V^{\mathrm{SCC}}, E^{\mathrm{SCC}}) whose meta-vertices are the SCCs and whose edges represent inter-component transitions in G.

Recall (Definition 7.6). The reward compute graph is a cached sequence of N commands C = (c_0, c_1, \ldots, c_{N-1}), each of the form (\text{from}, \text{to}, \text{multiplier}) representing \mathrm{result}[\text{from}] \mathrel{+}= \mathrm{result}[\text{to}] \times \text{multiplier}. For parameterised graphs, the multiplier slot is a pointer to a parameter-dependent value rather than a concrete double; we call the resulting object the parameterised reward-compute graph (PRC).

Definition 15.1 (Synthetic SCC subgraph) Let G = (V, E, W) be a parameterised phase-type graph and let C_i \subseteq V be an SCC of G with n_i = |C_i|. The synthetic SCC subgraph for C_i is the parameterised phase-type graph \widehat{G}_i = (\widehat{V}_i, \widehat{E}_i, \widehat{W}_i) defined as follows.

Vertices: \widehat{V}_i = C_i \cup \{s_i^\dagger, a_i^\dagger\}, where s_i^\dagger is the synthetic source (the starting vertex of \widehat{G}_i) and a_i^\dagger is the synthetic absorber.
Internal edges: for every edge (u \to v) \in E with u, v \in C_i, the edge (u \to v) \in \widehat{E}_i carries the same coefficient vector \mathbf{c}_{u\to v} as in G.
Input edges: every vertex v \in C_i that receives at least one edge from V \setminus C_i in G gets a single input edge (s_i^\dagger \to v) \in \widehat{E}_i. This input edge is a channel-indexed placeholder: it carries the special marker PTD_PCG_PTR_EXTERNAL with channel index equal to the position of v in the canonically ordered list of input vertices of C_i. The placeholder’s value is bound at composition time to the sum of the parent’s external-to-v edge weights at the current \boldsymbol{\theta}.
Output edges: every edge (v \to z) \in E with v \in C_i and z \notin C_i contributes one channel-indexed placeholder edge (v \to a_i^\dagger) \in \widehat{E}_i. The channel index is the position of (v \to z) in the canonically ordered list of outgoing-from-C_i edges of G. The placeholder’s value is bound at composition time to the parent’s edge weight times the downstream-SCC result r_z already computed.

Canonical ordering: input and output channels are assigned indices by sorting lexicographically first by the state vector of the in-C_i endpoint, then by the (state vector, channel) pair of the out-of-C_i endpoint. This ensures \widehat{G}_i’s structural signature (Definition 14.1) depends only on C_i’s shape in G, not on the absolute vertex indices of G.

Intuition

The synthetic subgraph is the smallest self-contained parameterised phase-type graph whose elimination yields, after appropriate boundary binding, the same per-vertex moment values for the internal vertices of C_i as the monolithic elimination on G would yield. The two synthetic vertices play distinct roles: s_i^\dagger is a source that injects probability mass from upstream SCCs, while a_i^\dagger is an absorber that captures all outflow to downstream SCCs in one place. The placeholders are the “ports” of the SCC — they store which external edge each channel corresponds to, without committing to a value until composition.

Example

Consider a 3-vertex SCC C_i = \{u_1, u_2, u_3\} in a larger graph G, with one external incoming edge w \to u_2 from outside and two external outgoing edges u_1 \to z_1, u_3 \to z_2. The synthetic subgraph \widehat{G}_i has 5 vertices: \{s_i^\dagger, u_1, u_2, u_3, a_i^\dagger\}. It carries the three internal edges of C_i (e.g. u_1 \to u_2, u_2 \to u_3, u_3 \to u_1 if C_i is a 3-cycle), plus one input placeholder s_i^\dagger \to u_2 (channel 0) and two output placeholders u_1 \to a_i^\dagger (channel 0) and u_3 \to a_i^\dagger (channel 1).

Definition 15.2 (Per-SCC parameterised reward-compute graph (per-SCC PRC)) The per-SCC PRC for component C_i is the parameterised reward-compute graph \widehat{P}_i = (\widehat{c}_0, \widehat{c}_1, \ldots, \widehat{c}_{N_i - 1}) obtained by applying the parameterised elimination algorithm of [06] (ptd_graph_ex_absorbation_time_comp_graph_parameterized) to the synthetic subgraph \widehat{G}_i.

Because \widehat{G}_i’s structural signature is determined by C_i’s internal shape and the number and identity of its input/output channels (and not by the absolute vertex indices of G), the graph content hash H(\widehat{G}_i) from [12] depends only on the canonical structure of C_i. Two structurally identical SCCs anywhere in the same graph, or across different parent graphs, yield the same hash and therefore share the same on-disk PRC.

The per-SCC PRC is cached on disk at ~/.phasic_cache/parameterized_reward_compute/scc_<hash>.bin, with the magic PTDPRMC1, format_revision = 2 (required for the PTD_PCG_PTR_EXTERNAL operand kind), and atomic write-then-rename semantics. Caching is gated by a size threshold: SCCs with n_i < \tau_{\mathrm{cache}} (where \tau_{\mathrm{cache}} is read from the environment variable PHASIC_MIN_SCC_SIZE_TO_CACHE, default 4) skip both load and save and are rebuilt every call. Both load and save honour PHASIC_DISABLE_CACHE=1.

Intuition

The per-SCC PRC is the symbolic version of the SCC’s contribution to the parent’s moment computation. Because it is parameterised, it does not commit to a concrete \boldsymbol{\theta} at build time — the same PRC is replayed across all SVGD/MCMC iterations. Caching it on disk amortises the O(n_i^3) build cost across processes (notebook restarts, SLURM workers, CLI runs) for any model that contains a structurally identical SCC.

Example

In a hexagonal spatial coalescent with K cells, each cell’s local lineage-merger dynamics forms an SCC. All K cells share the same canonical SCC structure — they differ only in their parent-graph vertex indices and in which external edges they connect to. The per-SCC PRC is built once (or loaded once from disk) and reused K times per moment evaluation.

Definition 15.3 (SCC level and level scheduling) Let G^{\mathrm{SCC}} be the condensation of G. The level \ell\colon V^{\mathrm{SCC}} \to \mathbb{N} of an SCC is defined recursively as the longest directed path from that SCC to a sink in G^{\mathrm{SCC}}:

\ell(C_i) = \begin{cases} 0 & \text{if } C_i \text{ has no outgoing edges in } G^{\mathrm{SCC}}, \\ 1 + \max\{ \ell(C_j) : (C_i \to C_j) \in E^{\mathrm{SCC}} \} & \text{otherwise.} \end{cases} \tag{15.1}

The level scheduling of G partitions V^{\mathrm{SCC}} into level sets \mathcal{L}_l = \{C_i : \ell(C_i) = l\} for l = 0, 1, \ldots, L where L = \max_i \ell(C_i). Within any single level \mathcal{L}_l there are no inter-SCC edges in G^{\mathrm{SCC}}.

Intuition

Levels group SCCs that can be processed independently. By construction, all of C_i’s downstream SCCs — the SCCs whose result vectors are needed to bind C_i’s output-edge placeholders — are at strictly lower levels. Processing levels in order 0, 1, \ldots, L guarantees that by the time level l runs, every SCC at every lower level is finished and its results are available. Within a level, SCCs are independent and can run in parallel without synchronisation.

Definition 15.4 (SCC composition) Let G be a parameterised phase-type graph with SCC decomposition \{C_1, \ldots, C_m\}, per-SCC PRCs \{\widehat{P}_1, \ldots, \widehat{P}_m\}, and level scheduling \mathcal{L}_0, \ldots, \mathcal{L}_L. Let \boldsymbol{\theta} \in \mathbb{R}^k be a parameter vector. The SCC composition of \{\widehat{P}_i\} at \boldsymbol{\theta} produces a parent-wide result vector \mathbf{r} \in \mathbb{R}^{|V|} as follows.

Initialize \mathbf{r} \leftarrow \mathbf{0} and update the parent’s edge weights from \boldsymbol{\theta} via ptd_graph_update_weights. Then for l = 0, 1, \ldots, L, process every C_i \in \mathcal{L}_l (in parallel within the level) by:

Build or load \widehat{G}_i and \widehat{P}_i via ptd_scc_get_or_compute_prc.
Bind input-channel placeholders: for each input vertex v \in C_i, set channel j_v^{\mathrm{in}}’s placeholder value to \sum_{u \notin C_i,\,(u\to v) \in E} w(u \to v; \boldsymbol{\theta}) (the parent’s accumulated external inflow to v).
Bind output-channel placeholders: for each external edge (v \to z) \in E with v \in C_i, z \notin C_i, set channel j_{v\to z}^{\mathrm{out}}’s placeholder value to w(v \to z; \boldsymbol{\theta}) \cdot r_z where r_z is the already-computed result at the downstream vertex z.
Replay \widehat{P}_i to obtain a synthetic-graph result vector \widehat{\mathbf{r}}_i \in \mathbb{R}^{n_i + 2}.
Copy each internal-vertex result back to the parent: for every v \in C_i, set r_{\mathrm{idx}(v)} \leftarrow \widehat{r}_{i,\mathrm{idx}_i(v)} where \mathrm{idx} and \mathrm{idx}_i are the parent-graph and synthetic-graph vertex indices of v.

The output of composition is the resulting \mathbf{r}.

Intuition

Steps 2 and 3 are the only places where parent-graph \boldsymbol{\theta}-dependent weights cross the SCC boundary. By binding them as concrete values into the placeholders, the per-SCC PRC — which knows nothing about the parent graph beyond its own input/output channel signatures — can replay autonomously. Step 4 is a constant-time table lookup (the PRC is precomputed), and step 5 is a vector copy.

Theorems and Proofs

Theorem 15.1 (Synthetic-graph correctness) Let G be a parameterised phase-type graph, C_i an SCC of G, and \boldsymbol{\theta} a parameter vector at which the monolithic elimination of [06] applied to G produces a finite result \mathbf{r}^{\mathrm{mono}} \in \mathbb{R}^{|V|}. Let \mathbf{r}^{\mathrm{comp}} be the vector produced by the composition of Definition 15.4 with downstream results already correctly computed (i.e. r^{\mathrm{comp}}_z = r^{\mathrm{mono}}_z for every z \notin C_i with \ell(\mathrm{SCC}(z)) < \ell(C_i)). Then for every v \in C_i, r^{\mathrm{comp}}_{\mathrm{idx}(v)} = r^{\mathrm{mono}}_{\mathrm{idx}(v)}.

Proof. The monolithic moment computation of Algorithm 7.3 satisfies, for every vertex v, r^{\mathrm{mono}}_v = \mathbb{E}_v[\tilde{\tau}] = x_v + \sum_{(v \to u) \in E} w'(v \to u; \boldsymbol{\theta}) \cdot r^{\mathrm{mono}}_u, where w' denotes the normalised edge weight (i.e. weight divided by total outgoing rate at v) and x_v is the vertex scalar from [06]. Splitting the sum at the SCC boundary: r^{\mathrm{mono}}_v = x_v + \sum_{u \in C_i,\,(v \to u) \in E} w'(v \to u) r^{\mathrm{mono}}_u + \sum_{z \notin C_i,\,(v \to z) \in E} w'(v \to z) r^{\mathrm{mono}}_z.

Now consider the synthetic graph \widehat{G}_i. Its internal edges are exactly the within-C_i edges of G; its s_i^\dagger \to v placeholder edges contribute zero rate to any v \in C_i during that vertex’s elimination (because s_i^\dagger is the source, not a child, of v); and its v \to a_i^\dagger placeholder edges, once bound by composition step 3 to w(v \to z; \boldsymbol{\theta}) \cdot r^{\mathrm{mono}}_z for each external (v \to z), contribute exactly \sum_{z \notin C_i} w(v \to z) r^{\mathrm{mono}}_z to vertex v’s outflow. Dividing by the same denominator (the total outgoing rate at v, which is identical in G and \widehat{G}_i at any vertex in C_i, because \widehat{G}_i duplicates every outgoing edge of every v \in C_i — internal edges directly and external edges collapsed onto a_i^\dagger), the normalised-weight equation for v in \widehat{G}_i is identical to the split form above.

The within-C_i self-consistency equations (the first two terms of the split) form a closed linear system on \{r^{\mathrm{comp}}_v\}_{v \in C_i}, with the third term (the boundary contribution) appearing as a constant once the downstream results have been bound. By the correctness of Algorithm 7.3 on \widehat{G}_i, replaying \widehat{P}_i at \boldsymbol{\theta} solves this system exactly. The hypothesis r^{\mathrm{comp}}_z = r^{\mathrm{mono}}_z for z \notin C_i ensures the constants are correct, so the unique solution agrees with r^{\mathrm{mono}} on C_i. \square

Theorem 15.2 (Level-scheduling correctness) Composition in sink-first level order (Definition 15.3, Definition 15.4) produces \mathbf{r}^{\mathrm{comp}} = \mathbf{r}^{\mathrm{mono}} at every vertex, independently of the order in which SCCs within the same level are processed.

Proof. We induct on the level. Base case (l = 0): SCCs in \mathcal{L}_0 are sinks of G^{\mathrm{SCC}}, hence have no outgoing inter-SCC edges; every z \notin C_i that is a child of some v \in C_i in G is itself absorbing in G (because absorbing vertices of G are their own SCCs and become sinks of G^{\mathrm{SCC}}). By convention, r^{\mathrm{mono}}_z = r^{\mathrm{comp}}_z = 0 at every absorbing vertex of G (or whatever boundary value the moment computation specifies; for \mathbb{E}[\tau] this is 0). The hypothesis of Theorem 15.1 is satisfied, so each C_i \in \mathcal{L}_0 produces correct internal results. Since SCCs in \mathcal{L}_0 have no inter-SCC edges between them, they read disjoint slots of \mathbf{r}^{\mathrm{comp}} and write disjoint slots, so the order of processing within \mathcal{L}_0 is immaterial.

Inductive step: assume r^{\mathrm{comp}}_v = r^{\mathrm{mono}}_v for every v in an SCC of level < l. Consider any C_i \in \mathcal{L}_l. Every downstream SCC C_j in G^{\mathrm{SCC}} (i.e. C_j such that some v \in C_i has an edge to some z \in C_j) satisfies \ell(C_j) < l = \ell(C_i) by definition of \ell. By the inductive hypothesis, r^{\mathrm{comp}}_z = r^{\mathrm{mono}}_z for every such z. The hypothesis of Theorem 15.1 is satisfied, so C_i’s composition produces correct internal results. Within \mathcal{L}_l there are no inter-SCC edges, so simultaneous processing is safe. \square

Theorem 15.3 (Composition complexity) Let G be a parameterised phase-type graph with SCC decomposition \{C_1, \ldots, C_m\} of sizes n_1, \ldots, n_m and per-SCC PRC lengths N_1, \ldots, N_m. Let u be the number of distinct content hashes among \{H(\widehat{G}_i)\}_{i=1}^m. Then:

The cost of building the per-SCC PRCs from scratch is O\!\big(\sum_{i=1}^{m} (n_i + 2)^3\big), which simplifies in the hash-deduplicated case to O\!\big(\sum_{j=1}^{u} (n_j + 2)^3\big) when SCCs with the same hash share their PRC.
The cost of composition (steps 2–5 of Definition 15.4 for all SCCs) is \Theta\!\big(\sum_{i=1}^{m} N_i\big).
The wall-clock cost benefits from intra-level parallelism: at level l with |\mathcal{L}_l| SCCs and T available threads, the level’s wall time is \Theta\!\big(\max_{C_i \in \mathcal{L}_l} N_i\big) \cdot \lceil |\mathcal{L}_l|/T \rceil rather than \sum_{C_i \in \mathcal{L}_l} N_i.

Proof.

The parameterised elimination of [06] runs in O(|\widehat{V}_i|^3) = O((n_i + 2)^3) time and produces a PRC of size O((n_i + 2)^2). Summing over all SCCs gives the first bound. With hash deduplication, each distinct \widehat{G}_i is built once, replacing m by u in the sum.
Composition for C_i performs: at most |\widehat{V}_i| placeholder bindings (steps 2–3), one PRC replay of cost \Theta(N_i) (step 4), and n_i result copies (step 5). Steps 2–3 and 5 are dominated by step 4 because N_i \geq |\widehat{V}_i| (every vertex has at least one elimination command). Summing \Theta(N_i) over all SCCs gives the bound.
Within level l, the SCC compositions are independent (no shared memory writes by Definition 15.3), so a parallel-for over |\mathcal{L}_l| tasks each costing \Theta(N_i) has wall time \Theta(\max_{C_i \in \mathcal{L}_l} N_i \cdot \lceil |\mathcal{L}_l|/T \rceil) on T threads with perfect load balancing. The C implementation realises this via OpenMP parallel for schedule(dynamic), capped by PHASIC_MAX_PARALLEL_SCCS to bound contention with nested OMP regions inside the per-SCC eliminator. \square

Algorithms

15.0.0.1 Hierarchical SCC Elimination and Composition

Description. This algorithm computes the parameterised reward-compute result vector \mathbf{r}(\boldsymbol{\theta}) \in \mathbb{R}^{|V|} on a parent graph G by decomposing G into SCCs, building (or loading from disk) a per-SCC PRC for each SCC’s synthetic subgraph, and composing the results in sink-first level order. The output is numerically equivalent to the monolithic elimination of Algorithm 7.3 at the same \boldsymbol{\theta} by Theorem 15.1, but the work is decomposed into smaller, cacheable, and parallel pieces.

Hierarchical SCC Elimination and Composition
 1: Let G = (V, E, W) be a parameterised phase-type graph
 2: Let θ ∈ R^k be a parameter vector
 3:
 4: function HierarchicalCompose(G, θ)
 5:   ▷ Stage 1: SCC decomposition and level scheduling
 6:   S = (C_1, ..., C_m) ← StronglyConnectedComponents(G)        ▷ @alg-04-tarjan-scc
 7:   G^SCC ← Condensation(G, S)
 8:   topo ← KahnTopologicalSort(G^SCC)                            ▷ forward-topological
 9:   for each C_i in topo reversed do
10:     ℓ[C_i] ← max(0, 1 + max{ℓ[C_j] : (C_i → C_j) ∈ E^SCC})    ▷ @def-13-scc-level
11:   end for
12:   L ← max ℓ
13:   levels ← partition of S by ℓ                                 ▷ levels[l] = {C : ℓ(C) = l}
14:
15:   ▷ Stage 2: parent edge weights from θ
16:   UpdateGraphWeights(G, θ)                                     ▷ ptd_graph_update_weights
17:   r ← 0 ∈ R^|V|                                                ▷ parent-wide result vector
18:
19:   ▷ Stage 3: level-by-level composition (sink-first)
20:   for l ← 0 to L do
21:     parallel for each C_i ∈ levels[l] do
22:       (Ĝ_i, P̂_i, meta_i) ← GetOrComputePRC(S, i)               ▷ @def-13-per-scc-prc
23:       ▷ Bind input-channel placeholders from parent edges
24:       for each input channel j in meta_i.in do
25:         v ← meta_i.in[j].target_vertex_in_C_i
26:         Ĝ_i.input_placeholder[j] ← Σ_{u∉C_i, (u→v)∈E} w(u→v; θ)
27:       end for
28:       ▷ Bind output-channel placeholders from downstream results
29:       for each output channel j in meta_i.out do
30:         (v, z) ← meta_i.out[j].external_edge                   ▷ v ∈ C_i, z ∉ C_i
31:         Ĝ_i.output_placeholder[j] ← w(v→z; θ) · r[idx(z)]
32:       end for
33:       ▷ Replay the per-SCC PRC
34:       r̂_i ← ReplayPRC(P̂_i, θ)
35:       ▷ Write internal-vertex results back into parent
36:       for each v in C_i do
37:         r[idx(v)] ← r̂_i[idx_i(v)]
38:       end for
39:     end parallel for
40:   end for
41:
42:   return r
43: end function
44:
45: function GetOrComputePRC(S, i)
46:   Ĝ_i ← BuildSyntheticSubgraph(S, i)                            ▷ @def-13-synthetic-subgraph
47:   meta_i ← canonical input/output channel tables of Ĝ_i
48:   if n_i < τ_cache then                                         ▷ PHASIC_MIN_SCC_SIZE_TO_CACHE
49:     P̂_i ← ParameterizedEliminate(Ĝ_i)                          ▷ no cache load/save
50:     return (Ĝ_i, P̂_i, meta_i)
51:   end if
52:   h_i ← GraphContentHash(Ĝ_i)                                   ▷ @alg-12-graph-hash
53:   path ← phasic_cache_dir / "parameterized_reward_compute" / ("scc_" ++ hex(h_i) ++ ".bin")
54:   if FileExists(path) and not PHASIC_DISABLE_CACHE then
55:     P̂_i ← LoadPRCFromDisk(path, Ĝ_i)                            ▷ may return NULL on rev mismatch
56:     if P̂_i ≠ NULL then return (Ĝ_i, P̂_i, meta_i) end if
57:   end if
58:   P̂_i ← ParameterizedEliminate(Ĝ_i)                            ▷ ptd_graph_ex_absorbation_time_comp_graph_parameterized
59:   if not PHASIC_DISABLE_CACHE then SavePRCToDisk(path, Ĝ_i, P̂_i) end if
60:   return (Ĝ_i, P̂_i, meta_i)
61: end function

Correspondence table:

Pseudocode variable	Math symbol	Code variable (file:function)
G	G	`parent` (`scc_compose.c:ptd_compose_scc_prcs_inner`)
S	\{C_1,\ldots,C_m\}	`scc_graph` (`api/c/phasic.h: struct ptd_scc_graph`)
G^{\mathrm{SCC}}	condensation	implicit in `scc_graph->vertices[k]->edges`
topo	Kahn’s forward-topological order	`topo_order` (`scc_compose.c:316--375`)
\ell	level (longest path to sink)	`level_of[k]` (`scc_compose.c:392--410`)
levels[l]	\mathcal{L}_l	`level_indices[level_offsets[l] \ldots level_offsets[l+1]]`
\mathbf{r}	parent result vector	`parent_result` (`scc_compose.c:282`)
\widehat{G}_i	synthetic subgraph	`synth` (`scc_synthetic.c:ptd_scc_build_synthetic_graph`)
\widehat{P}_i	per-SCC PRC	return value of `ptd_synth_get_or_compute_prc`
meta_i	channel tables	`metadata` (`scc_synthetic.c: struct ptd_scc_synthetic_metadata`)
ReplayPRC	parameterised PRC replay	inner loop of `ptd_compose_scc_one` (`scc_compose.c:138+`)
BuildSyntheticSubgraph	\widehat{G}_i constructor	`ptd_scc_build_synthetic_graph` (`scc_synthetic.c:509+`)
GetOrComputePRC	cache-or-compute factory	`ptd_scc_get_or_compute_prc` (`scc_synthetic.c:1195+`)
ParameterizedEliminate	parameterised PRC build	`ptd_graph_ex_absorbation_time_comp_graph_parameterized` (`src/c/phasic.c:7190+`)
LoadPRCFromDisk	binary cache load	`ptd_load_parameterized_reward_compute_graph` (`src/c/phasic.c:3672+`)
SavePRCToDisk	binary cache save	`ptd_save_parameterized_reward_compute_graph` (`src/c/phasic.c:3603+`)
\tau_{\mathrm{cache}}	min SCC size to cache	`PHASIC_MIN_SCC_SIZE_TO_CACHE` (`scc_synthetic.c:60--80`)

Complexity. Time (sequential): O\!\big(\sum_{j=1}^{u} (n_j + 2)^3\big) + \Theta\!\big(\sum_{i=1}^{m} N_i\big). Time (with T threads at level l): \Theta\!\big(\max_{C_i \in \mathcal{L}_l} N_i \cdot \lceil |\mathcal{L}_l|/T \rceil\big) per level, T \leq PHASIC_MAX_PARALLEL_SCCS. Space: O(|V| + \sum_i (n_i + 2)^2) for the parent result vector plus per-SCC PRCs held simultaneously.

Algorithm 15.1: Correctness. Follows from Theorem 15.1 (per-SCC) and Theorem 15.2 (level scheduling). The cache layer is correctness-preserving because the on-disk format includes the truncated content hash from [12] in its header; a stale cache file is silently rejected and triggers a rebuild rather than returning stale data.

Numerical Considerations

Re-entrancy guard. The composer sets a thread-local flag ptd_scc_compose_in_progress (declared in api/c/phasic.h, defined in scc_compose.c) before entering the level loop and clears it on exit. The hierarchical gate inside ptd_expected_waiting_time (src/c/phasic.c L8576–L8615) checks this flag and falls through to the monolithic path whenever it is set, so inner moment calls on synthetic graphs use monolithic elimination directly. Without this guard, the composer would recurse infinitely on the synthetic graph’s own SCC decomposition.

Cache format revision. Per-SCC PRCs require format_revision = 2 of the on-disk format because the channel-indexed placeholders are encoded with the operand kind PTD_PCG_PTR_EXTERNAL (src/c/phasic.c L2871–L2879). When a load encounters format_revision < 2 it is treated as a cache miss and the file is silently overwritten on save. Top-level (non-SCC) PRCs continue to use format_revision = 1 because they contain no external placeholders.

Compose-failure fall-through. If ptd_compose_scc_prcs fails for any reason (OOM, a per-SCC build error, a cycle in the condensation indicating a corrupted SCC decomposition), the gate clears ptd_err and falls through to the monolithic path. The user sees a slightly slower run, never a NaN or a missing result.

Reward-vector restriction. The gate explicitly requires rewards == NULL. For reward-transformed moments the synthetic-graph construction would need to propagate per-vertex reward state across the SCC boundary, which is not implemented. Reward-transformed calls always use the monolithic path, regardless of PHASIC_HIERAR_ELIMINATION.

MPFR fallback. The MPFR arbitrary-precision fallback of Algorithm 7.3 operates only on the monolithic path. Per-SCC PRCs do not currently have an MPFR variant; an ill-conditioned per-SCC PRC produces NaN, which the composer surfaces as a failure (triggering the monolithic fall-through above).

Numerical equivalence to monolithic. Composition is exact in exact arithmetic. In floating-point it differs from monolithic elimination only by the order of summation in the boundary bindings (step 2 of Definition 15.4), which sums external inflows in canonical channel order rather than parent edge-iteration order. Empirically the per-vertex relative difference is below 10^{-12} on well-conditioned graphs; ill-conditioned graphs that produce different values between the two paths are exactly the ones where the monolithic path triggers MPFR fall-back.

Cache observability. Process-wide telemetry counters (struct ptd_scc_compose_stats, api/c/phasic.h L1429+) track cache hits, cache misses, total compose calls, total wall time, and cache-bypassed SCCs (those below PHASIC_MIN_SCC_SIZE_TO_CACHE). Retrieve them with ptd_scc_compose_stats_get(&stats) and reset with ptd_scc_compose_stats_reset(). From Python, phasic.cache.param_compute_cache_info() reports the disk footprint of all parameterised PRC files (both monolithic <hash>.bin and per-SCC scc_<hash>.bin).

Implementation Notes

Source code mapping:

Algorithm	File	Function	Lines
Algorithm 15.1 (public entry)	`src/c/scc_compose.c`	`ptd_compose_scc_prcs`	L547–L570
Algorithm 15.1 (core body)	`src/c/scc_compose.c`	`ptd_compose_scc_prcs_inner`	L265–L540
Per-SCC worker (steps 22–38)	`src/c/scc_compose.c`	`ptd_compose_scc_one`	L138–L260
Cache-or-compute factory	`src/c/scc_synthetic.c`	`ptd_scc_get_or_compute_prc`	L1195–L1229
Synthetic graph construction (Definition 15.1)	`src/c/scc_synthetic.c`	`ptd_scc_build_synthetic_graph`	L509–L1110
Synth-only cache wrapper	`src/c/scc_synthetic.c`	`ptd_synth_get_or_compute_prc`	L1128–L1190
Hierarchical gate (in `ptd_expected_waiting_time`)	`src/c/phasic.c`	gate block	L8576–L8615
Binary PRC cache (load)	`src/c/phasic.c`	`ptd_load_parameterized_reward_compute_graph`	L3672+
Binary PRC cache (save)	`src/c/phasic.c`	`ptd_save_parameterized_reward_compute_graph`	L3603+
Telemetry struct	`api/c/phasic.h`	`struct ptd_scc_compose_stats`	L1429–L1442
Python cache introspection	`src/phasic/cache.py`	`param_compute_cache_info`, `clear_param_compute_cache`	(module top)

Deviations from pseudocode:

The composer splits into a thin public wrapper ptd_compose_scc_prcs (which handles telemetry counters and call timing) and a body ptd_compose_scc_prcs_inner (which contains the level loop). The split is to ensure every early-return path contributes to total_compose_ns and compose_calls even when the body fails.
Level computation uses Kahn’s algorithm to obtain a forward-topological order first, then walks that order in reverse to compute longest-path-to-sink levels. The forward sort is necessary because the SCC decomposition stored by ptd_isolate_starting_vertex_scc is not in general a valid topological order; line 8 of the pseudocode glosses over this.
Within each level, the parallel-for is schedule(dynamic) with the iteration count clamped to the level size and the thread count capped by PHASIC_MAX_PARALLEL_SCCS (when set). The cap applies only to the SCC-level parallel-for; nested OpenMP regions inside the per-SCC eliminator still see the full OMP_NUM_THREADS. This lets users limit SCC-level fan-out to keep memory bounded while still parallelising whatever the inner eliminator does.
The parent->edges array is mutated by ptd_graph_update_weights to reflect the current \boldsymbol{\theta} before composition begins. The composer assumes the caller wants this side effect (which matches the contract of ptd_expected_waiting_time’s monolithic path); the public API documentation in api/c/phasic.h makes this explicit.
Step 26 of the pseudocode glosses over the fact that the parent’s external inflow to a vertex is the sum of all external edge weights at the current \boldsymbol{\theta}, not a single edge weight. The C implementation accumulates this sum inside ptd_compose_scc_one (L138–L220) using a temporary buffer indexed by input-channel index from meta_i.

Relationship to the Python EliminationTrace hierarchical pipeline: A separate Python module src/phasic/hierarchical_trace_cache.py implements an alternative hierarchical pipeline based on Python EliminationTrace (Definition 13.2) stitching, recording one trace per SCC on an enhanced subgraph (the SCC plus a one-hop fringe of upstream and downstream vertices) and concatenating the per-SCC traces into a single global trace. That pipeline is consumed by JAX trace-replay workflows for autodiff and jit/vmap integration. The two pipelines are independent: the C composer formalised in this file is not invoked when the Python pipeline runs, and vice versa. The C composer and the Python pipeline can share their on-disk caches indirectly through the graph content hash from [12]: a top-level trace cache hit short-circuits the Python pipeline entirely, while a per-SCC PRC cache hit short-circuits an inner build of the C composer. They do not share cache files.

Python adapter: src/phasic/cache.py exposes param_compute_cache_info() and clear_param_compute_cache() for introspection and maintenance of both the top-level and per-SCC PRC caches (they live in the same directory). There is no Python-level handle to ptd_compose_scc_prcs itself; the composer is invoked transparently through ptd_expected_waiting_time when the environment is configured.

Symbol Index

Symbol	Name	First appearance
a_i^\dagger	Synthetic absorbing vertex of SCC C_i	Definition 15.1
\widehat{G}_i	Synthetic subgraph for SCC C_i	Definition 15.1
\mathcal{L}_l	Set of SCCs at level l	Definition 15.3
L	Maximum SCC level in G^{\mathrm{SCC}}	Definition 15.3
\ell(C_i)	Level of SCC C_i (longest path to sink)	Definition 15.3
n_i	Number of vertices in SCC C_i	Definition 15.1
N_i	Length of per-SCC PRC \widehat{P}_i	Theorem 15.3
\widehat{P}_i	Per-SCC parameterised reward-compute graph	Definition 15.2
\mathbf{r}^{\mathrm{comp}}	Composition result vector	Theorem 15.1
\mathbf{r}^{\mathrm{mono}}	Monolithic-elimination result vector	Theorem 15.1
s_i^\dagger	Synthetic source vertex of SCC C_i	Definition 15.1
\tau_{\mathrm{cache}}	Min SCC size to cache (`PHASIC_MIN_SCC_SIZE_TO_CACHE`)	Definition 15.2
\widehat{V}_i	Vertex set of \widehat{G}_i	Definition 15.1