Compartments Analysis

Analysis of Chromatin Compartments using the Open2C Ecosystem

Working with coolers

In this notebook, we use files generated by the workflow master_workflow that, in short, does the following:

  1. Download the data from the source
  2. Index the reference genome with bwa index and samtools faidx
  3. Align the reads to the reference genome with bwa mem
  4. Pair and sort the reads to .pairs files with pairtools parse | pairtools sort
  5. Deduplicate the pairs with pairtools dedup
  6. Convert the .pairs to cooler files with cooler cload pairs

We will:

  1. Load the cooler files
  2. Merge the coolers from the same BioSample ID –> Create ‘replicates’
  3. Zoomify the merged cooler files (coarsen) to create a multicooler (.mcool) file
  4. Balance the matrices (use commandline cooler balance)
  5. Calculate E1 compartments with cooltools compute_cis_eig

Overview

Data (Accessions)

To get an overview of the data accessions used in this analysis, we will first summarize the SRA-runtable.tsv that contains the accession numbers and some metadata for each sample (Table 17.1).

Table 5.1: The most relevant columns of the SRA-runtable.tsv file
  source_name BioSample Run GB Bases Reads
16 fibroblast SAMN08375237 SRR6502335 29.771059 73,201,141,800 244,003,806
17 fibroblast SAMN08375237 SRR6502336 22.755361 65,119,970,100 217,066,567
18 fibroblast SAMN08375236 SRR6502337 21.434722 52,769,196,300 175,897,321
19 fibroblast SAMN08375236 SRR6502338 21.420030 52,378,949,100 174,596,497
20 fibroblast SAMN08375236 SRR6502339 10.207410 28,885,941,600 96,286,472
9 fibroblast SAMN08375237 SRR7349189 52.729173 139,604,854,200 465,349,514
10 fibroblast SAMN08375236 SRR7349190 53.085520 142,008,353,400 473,361,178
21 pachytene spermatocyte SAMN08375234 SRR6502342 60.258880 150,370,993,500 501,236,645
22 pachytene spermatocyte SAMN08375234 SRR6502344 27.146048 65,697,684,300 218,992,281
23 pachytene spermatocyte SAMN08375234 SRR6502345 26.202707 63,490,538,700 211,635,129
0 pachytene spermatocyte SAMN09427370 SRR7345458 55.970557 153,281,577,900 510,938,593
1 pachytene spermatocyte SAMN09427370 SRR7345459 53.982492 144,993,841,200 483,312,804
11 pachytene spermatocyte SAMN08375235 SRR7349191 51.274476 137,821,979,100 459,406,597
24 round spermatid SAMN08375232 SRR6502351 20.924497 55,095,075,300 183,650,251
25 round spermatid SAMN08375232 SRR6502352 41.133960 115,578,475,800 385,261,586
26 round spermatid SAMN08375232 SRR6502353 36.444117 96,195,161,400 320,650,538
2 round spermatid SAMN09427369 SRR7345460 38.244654 104,105,827,200 347,019,424
3 round spermatid SAMN09427369 SRR7345461 53.996261 144,532,309,500 481,774,365
12 round spermatid SAMN08375232 SRR7349192 52.384556 140,431,608,000 468,105,360
29 sperm SAMN08375229 SRR6502360 26.653940 64,752,370,800 215,841,236
30 sperm SAMN08375228 SRR6502362 23.973440 58,369,232,700 194,564,109
13 sperm SAMN08375229 SRR7349193 52.806276 141,148,572,300 470,495,241
14 sperm SAMN08375229 SRR7349195 22.444378 60,523,788,600 201,745,962
15 sperm SAMN08375229 SRR7349196 38.253606 104,119,671,000 347,065,570
27 spermatogonia SAMN08375231 SRR6502356 22.845286 58,909,579,800 196,365,266
28 spermatogonia SAMN08375231 SRR6502357 17.947471 46,888,332,900 156,294,443
4 spermatogonia SAMN09427379 SRR7345462 18.686342 52,032,780,000 173,442,600
5 spermatogonia SAMN09427379 SRR7345463 29.956561 82,384,836,000 274,616,120
6 spermatogonia SAMN09427379 SRR7345464 39.145759 105,153,716,100 350,512,387
7 spermatogonia SAMN09427378 SRR7345465 35.816184 96,048,594,600 320,161,982
8 spermatogonia SAMN09427378 SRR7345467 28.396816 77,248,140,900 257,493,803
Table 5.2: Summary of the data accessions used in this analysis
source_name GB Bases Reads
0 fibroblast 211.403275 553,968,406,500 1,846,561,355
1 pachytene spermatocyte 274.835160 715,656,614,700 2,385,522,049
2 round spermatid 243.128044 655,938,457,200 2,186,461,524
3 sperm 164.131640 428,913,635,400 1,429,712,118
4 spermatogonia 192.794420 518,665,980,300 1,728,886,601

Folder structure

For ease of mind, here is the folder structure of the project. ../steps/bwa/PE/ is the base directory artificially defined in the master_workflow.py. It ccould be any other directory inside steps. It is defined relative to the master_workflow.py file (inside the worklow), and converted to an absolute path by python.

../steps/bwa/PE
├── bamfiles
│   ├── fibroblast
│   ├── pachytene_spermatocyte
│   ├── round_spermatid
│   ├── sperm
│   └── spermatogonia
├── cool
│   ├── fibroblast
│   ├── pachytene_spermatocyte
│   ├── round_spermatid
│   ├── sperm
│   └── spermatogonia
└── pairs
    ├── fibroblast
    ├── pachytene_spermatocyte
    ├── round_spermatid
    ├── sperm
    └── spermatogonia

18 directories
Figure 5.1

FullMerge (pool all from each source_name)

We will use cooler merge to merge all samples in each sub-folder (cell type) to just one interaction matrix for each cell type. The reason for that is that we choose to trust (Wang et al. 2019) when they say that compartments are highly reproducible between replicates, and by merging all replicates, we will have a more robust signal.

Thus, we will merge all samples from the same source_name into a single cooler file. The coolers are produced to only contain the real chromosomes (1-22, X, Y), and not the mitochondrial DNA or the unplaced contigs (~2,900), as they would not bring any information to the analysis.

Overview of this section:

  • Locate the coolers (glob –> dictionary)
  • Merge the coolers (cooler.merge_coolers)
  • Zoomify the merged cooler (cooler.zoomify_cooler) to resolutions: 10kb, 50kb, 100kb, 500kb.
  • Balance the matrices (!cooler balance) (use the CLI, as it is more easily parallelized)

Create cooler dictionary (glob)

First, we will create a dictionary with the paths to the coolers for each sample. We use glob.glob to fetch all the coolers in each sub-folder that remain after filtering and (automatic) quality control (those are the ones with nodups in their names).

import glob
import os.path as op
from pprint import pprint as pp
import pandas as pd

# Get the list of cell type dirs
base_dir = '../steps/bwa/PE/cool'
folders = glob.glob(op.join(base_dir, '*'))

files_dict = {f:glob.glob(f"{f}/*.nodups.*") for f in folders}
cooler_dict = {op.basename(k): [op.basename(f) for f in v] for k,v in files_dict.items()}
#pp(cooler_dict)
df = pd.DataFrame.from_dict(cooler_dict, orient='index').T.fillna('-')
df[['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']]
Table 5.3: Dictionary of coolers for each cell type
fibroblast spermatogonia pachytene_spermatocyte round_spermatid sperm
0 SRR6502339.nodups.10000.cool SRR6502357.nodups.10000.cool SRR7345459.nodups.10000.cool SRR7349192.nodups.10000.cool SRR7349196.nodups.10000.cool
1 SRR7349190.nodups.10000.cool SRR7345467.nodups.10000.cool SRR6502344.nodups.10000.cool SRR6502353.nodups.10000.cool SRR6502362.nodups.10000.cool
2 SRR7349189.nodups.10000.cool SRR6502356.nodups.10000.cool SRR6502342.nodups.10000.cool SRR6502352.nodups.10000.cool SRR7349193.nodups.10000.cool
3 SRR6502335.nodups.10000.cool SRR7345464.nodups.10000.cool SRR7345458.nodups.10000.cool SRR6502351.nodups.10000.cool SRR6502360.nodups.10000.cool
4 SRR6502338.nodups.10000.cool SRR7345462.nodups.10000.cool SRR6502345.nodups.10000.cool SRR7345460.nodups.10000.cool SRR7349195.nodups.10000.cool
5 SRR6502336.nodups.10000.cool SRR7345465.nodups.10000.cool SRR7349191.nodups.10000.cool SRR7345461.nodups.10000.cool -
6 SRR6502337.nodups.10000.cool SRR7345463.nodups.10000.cool - - -

Merge coolers

The coolers are merged by summing each bin in the matrices, meaning we can only merge matrices with same dimensions. We iterate through the dictionary and merge the coolers with cooler merge. The mergebuf parameter should be adjusted if you don’t have 32G memory. Default: mergebuf = 20000000. Below, we also check if the output file already exists. If it does, we skip the merge.

# NB adjust `mergebuf` if you don't have 32G of RAM

import cooler

for folder,cooler_list in cooler_dict.items():
    in_uris = [op.join(base_dir, folder, file) for file in cooler_list]
    out_uri = op.join(base_dir, folder, f'{folder}.fullmerge.cool')

    if op.exists(out_uri):
        print(f"Skipping {out_uri}: exists...")
        continue

    print(f"Creating {out_uri} by \nMerging {len(cooler_list)} coolers into one:", end=" ")
    print("\t",[file.split('.')[0] for file in cooler_list])
    cooler.merge_coolers(output_uri=out_uri,
                         input_uris=in_uris,
                         mergebuf=int(5e7),
                         )
    print("... Done!")
Skipping ../steps/bwa/PE/cool/round_spermatid/round_spermatid.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/spermatogonia/spermatogonia.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/sperm/sperm.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/fibroblast/fibroblast.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.fullmerge.cool: exists...

Zoomify the merged cooler files

After merging the coolers, we will zoomify them to resolutions: 10kb, 50kb, 100kb, 500kb. It will recursively create the zoom levels for the cooler file from the ‘base’ resolution (in our case 10kb bins) to the ‘max’ resolution (in our case 500kb bins), and save all zoom levels in a multi-resolution cooler (.mcool) file. The resolutions are stored under the resolutions key in the cooler file (e.g. cell_type.mcool::/resolutions/10000).

Here, we also check if the output file already exists. If it does, we skip the zoomify. Again, can tailor the chunksize (pixels loaded per process) parameter according to memory availability. I found that the command stalled completely (without quitting) when the memory allocation was not sufficient, as I started out with 32 cores and 32G of RAM. I did not test what was more time-efficient; increasing number of processes or chunksize. After all, it didn’t take long to run anyways.

Finally, I list the resolutions in the mcool file.

# NB 8 cores and 32G of RAM was used

import glob
import cooler
import os.path as op

base_dir = '../steps/bwa/PE/cool'
merged_coolers = glob.glob(op.join(base_dir, '*/*.fullmerge.cool'))


for clr in merged_coolers:
    out_uri = clr.replace('.fullmerge.cool', '.mcool')
    if op.exists(out_uri):
        print(f"Skipping {out_uri}: exists...")
        continue

    print(f"Zoomifying cooler: \n\t   {clr}\n\t-> {out_uri}", end="")
    
    cooler.zoomify_cooler(base_uris = clr,
                          outfile = out_uri,
                          resolutions = [10000,50000,100000,500000],
                          chunksize = 10000000,
                          nproc = 8)
    print(" --> done")
Skipping ../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool: exists...
Skipping ../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool: exists...
Skipping ../steps/bwa/PE/cool/sperm/sperm.mcool: exists...
Skipping ../steps/bwa/PE/cool/fibroblast/fibroblast.mcool: exists...
Skipping ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool: exists...
import glob
import cooler
mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")

for mcool in mcools:
    print(f"{mcool}:")
    print(cooler.fileops.list_coolers(mcool))
    print()
../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/sperm/sperm.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/fibroblast/fibroblast.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

Balance the matrices

Finally, we balance the matrices using the cooler CLI. We use the cooler balance command with the default options using 32 cores, which iteratively balances the matrix (Iterative Corecction). It is first described as a method for bias correction of Hi-C matrices in (Imakaev et al. 2012), where it is paired with eigenvector decomposition (ICE). Here the eigenvector decomposition of the obtained maps is shown to provide insights into local chromatin states.

According to cooler documentation, we have to balance the matrices on each resolution, and thus it cannot be done prior to zoomifying. The argument is that the balancing weights are resolution-specific and will no longer retain its meaning when binned with other weights. Therefore, we use a nested for-loop that iterates through all the .mcools and all the resolutions in each .mcool. cooler balance will create a new column in the bins group of each cooler , weight, which can then be included or not in the downstream analysis. This means we will have access to both the balanced and the unbalanced matrix.

The default mode uses genome-wide data to calculate the weights for each bin. It would maybe be more suitable to calculate the weights for cis contacts only, and that is possible through the --cis-only flag, and that can be added to another column, so that we can compare the difference between the two methods easily. However, we will only use the default mode for now. The default options are:

  • ignore-diags: 2 (ignore the diagonals (-1,0,1))
  • mad-max: 5 (median absolute deviation threshold)
  • min-nnz: 10 (minimum number of non-zero entries before exclusion)

Conveniently, cooler balance automatically checks if there is already a weight column, and skips the balancing if it already exists. We can overwrite with --force.

%%capture balance_out --no-stderr

import sys
import glob

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
resolutions = [10000, 50000, 100000, 500000]

for mcool in mcools:
    print(f"Balancing {mcool}:", file=sys.stderr)
    for res in resolutions:
        full_name = f"{mcool}::resolutions/{res}"
        print(f"\tresolution {res}...", end=" ", file=sys.stderr)
        # First, just default values
        !cooler balance -p 32 {full_name}
        # With only cis-contacts, save column seprately in the cooler file
        #!cooler balance -p 32 -c 20000000 --cis-only -n cis_weights {full_name}
        print("--> Done!", file=sys.stderr)
Balancing ../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/sperm/sperm.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/fibroblast/fibroblast.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
# Display the balancing stdout if you want (but it is up to 200 lines per cooler)
#balance_out()

RepMerge (pool all from each BioSample ID)

Initially, I planned to re-create the replicates that they used in the paper, but I reason that it is not necessary and might even be better to pool all the samples in stead. They state alredy that their compartments are highly repdoducible between replicates, so I choose to trust that and not bother with the replicates.

Therefore, this section was only briefly initialized, and now it is commented out.

#df.groupby(['source_name','BioSample'])['Reads'].sum()
# from pprintpp import pprint as pp

# grouped_df = df.groupby(['source_name','BioSample'])

# # Initialize an empty dictionary
# rep_dict = {}

# # Iterate over each group
# for (source_name, BioSample), group in grouped_df:
#     # Extract the 'Run' column and convert it to a list
#     run_list = group['Run'].tolist()
    
#     # Populate the nested dictionary
#     if source_name not in rep_dict:
#         rep_dict[source_name] = {}
#     rep_dict[source_name][BioSample] = run_list

# pp(rep_dict)
# for source_name, BioSample_dict in rep_dict.items():
#     print(f"source_name: {source_name}")
#     print(f"Changing working dir: {source_name}/")
#     for BioSample, run_list in BioSample_dict.items():
#         print(f"Merging samples for BioSample: {BioSample}: {run_list}")

Plotting

The following section contains the visualization of the matrices and includes the calculation of the E1 compartments and their visualization. I discuss differents methods for construction and try to get both methodologically and result-wise close to (Wang et al. 2019).

First, we will work on matrices in 500kb resolution, as documentation [maybe it was HiCExplorer??] states it is sufficient for chromosome-wide analysis and plotting. We will follow the cooltools-recommended pipeline (except that they use a 100kb cooler) for visualization and compartment calling with one example cooler (the merged fibroblast cooler at 500kb resolution).

Then, the most relevant parts will be generalized to run in a loop over all the coolers. First at 500kb resolution, then at 100kb resolution.

To accomodate the approach (barely) described in the paper, we will discuss a smoothing step to the observed/expected matrices before compartment calling, and we will apply a smoothing step to the E1 compartments and compare to the raw compartments.

500kb resolution

Example with a single sample

First, I will explore the visualization pipeline for a single cooler at 500kb resolution. I will modify the plot to be ‘stairs’ in stead of just a regular line plot, as it is both a more accurate representation of the data and it is more aesthetically pleasing with less spiky lines and holes.

In practice, the length of the dataframe is doubled, as it now contains an E1 value for both the start and end position for each bin in stead of only for the start. However, I first make the regular line plot to show the difference.

Imports

# import standard python libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import os, subprocess
# Import python package for working with cooler files and tools for analysis
import cooler
import cooltools.lib.plotting
import cooltools

Load cooler

mclr = "../steps/bwa/PE/cool/fibroblast/fibroblast.mcool"

clr = cooler.Cooler(f"{mclr}::resolutions/500000")

Calculate gc covariance (from the reference genome)

I will use bioframe to load the reference and calculate the GC content of each bin in the cooler file. The convention in Hi-C is to use GC content as a phasing track to orient eigenvector track to positively correlated with the GC content. In this subsection, I calculate the GC content for all chromosomes and filter afterwards to only include the ‘X’ chromosome. I will do that smarter in the next subsection.

import bioframe
import os

bins = clr.bins()[:]
rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
gc_cov_csv = '../steps/rheMac10_gc_cov_500kb.tsv'
if not os.path.exists(gc_cov_csv):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(gc_cov_csv, index=False, sep='\t')
    display(gc_cov)
else: 
    print("Already exists, just read from file:")
    gc_cov = pd.read_csv(gc_cov_csv, sep='\t')
    display(gc_cov)
Already exists, just read from file:
chrom start end GC
0 chr1 0 500000 0.424508
1 chr1 500000 1000000 0.378836
2 chr1 1000000 1500000 0.388272
3 chr1 1500000 2000000 0.445226
4 chr1 2000000 2500000 0.439485
... ... ... ... ...
5715 chrY 9500000 10000000 0.399518
5716 chrY 10000000 10500000 0.396028
5717 chrY 10500000 11000000 0.395280
5718 chrY 11000000 11500000 0.382006
5719 chrY 11500000 11753682 NaN

5720 rows × 4 columns

Calculate the E1 compartments

Here, I will calculate the E1 compartments for the ‘X’ chromosome in the fibroblast cooler at 500kb resolution. I will use the cooltools package to do eigendecomposition and calculate the E1 compartments. I use the GC content (gc_cov obtained above) as a phasing track.

I will only calculate the within-chromosomes compartmentalization (cis contacts). I use cooltools.eigs_cis that decorrelate the contact-frequency by distance before performing the eigendecomposition.

I this example, I will use the simplest ‘view’ (chromosome-wide) of the chromosome to calculate the E1 values, but later I will explore how partitioning the chromosome affects the E1 values. As the GC content was calculated for all chromosomes, the eigenvectors are also calculated for the whole matrix. This will also be optimized in the next subsection, as we only ever look at the ‘X’ chromosome.

# Make the full view frame
view_df = pd.DataFrame(
    {
    'chrom': clr.chromnames,
    'start': 0,
    'end': clr.chromsizes.values,
    'name': clr.chromnames
    }
                      )
#display(view_df)
# obtain first 3 eigenvectors
cis_eigs = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df,
                        n_eigs=3,
                        )

# cis_eigs[0] returns eigenvalues, here we focus on eigenvectors
eigenvector_track = cis_eigs[1][['chrom','start','end','E1']]
len(eigenvector_track)
5720
# full track
eigenvector_track
chrom start end E1
0 chr1 0 500000 -0.756052
1 chr1 500000 1000000 -1.001063
2 chr1 1000000 1500000 -0.928042
3 chr1 1500000 2000000 -0.507579
4 chr1 2000000 2500000 0.123708
... ... ... ... ...
5715 chrY 9500000 10000000 -0.499798
5716 chrY 10000000 10500000 -0.588820
5717 chrY 10500000 11000000 -0.988677
5718 chrY 11000000 11500000 -0.566585
5719 chrY 11500000 11753682 NaN

5720 rows × 4 columns

As I have made myself a detour and calculated the eigenvectors for all chromosomes, I will now filter the eigenvector track to only include the ‘X’ chromosome. Then, I will (in the name of explicitness) construct an array of all the indices where the E1 values change sign. This is the simplest way to save the coordinates of the compartment boundaries. It can later on be saved to .csv or whatever for comparison between different samples. Also, it will be simplified in the next subsection.

# subset the chrX   
eigenvector_track_chrX = eigenvector_track.loc[eigenvector_track['chrom'] == 'chrX']
nbins = len(eigenvector_track_chrX)
eigenvector_track_chrX
e1X_values = eigenvector_track_chrX['E1'].values
# Where does the E1 change sign
np.where(np.diff( (cis_eigs[1][cis_eigs[1]['chrom']=='chrX']['E1']>0).astype(int)))[0]
array([  1,   6,  16,  26,  29,  40,  42,  49,  54,  55,  59,  60,  61,
        62,  74,  83,  86, 101, 102, 103, 104, 108, 123, 124, 129, 138,
       141, 142, 147, 149, 185, 188, 194, 196, 197, 201, 206, 209, 211,
       215, 222, 224, 228, 233, 238, 240, 250, 253, 256, 265, 273, 274,
       288, 289, 290, 293, 296, 302])

Plot E1 and matrix

Finally, I will plot the matrix, the GC content, and the E1 compartments for the ‘X’ chromosome in the fibroblast cooler at 500kb resolution. I will use the cooltools package to plot the matrix and the compartments. I will use the matplotlib package to plot the GC content.

I will plot the matrix as a heatmap, the GC content as a line plot, and the E1 compartments as a line plot. I will also plot the compartment boundaries as vertical lines. The compartments are colored according to the sign of the E1 values.

Following the cooltools recommendation, the colorbar itself will be log-transformed, as otherwise the balanced interaction matrix (values from 0 to 1) will be hard to interpret. The E1 values are (as is the colorbar) added to the axis with an AxesDivider.append_axes object to make them match the axes of the matrix.

I’m testing out the %%capture magic to capture the plot to a variable, then displaying it in the next cell. Initially, the plot took a while to generate, and I wanted to avoid re-generating when adjusting graphics on cell level with the YAML options. Now it is not necessary, but I keep it for the sake of the example.

%%capture chrX_matrix_e1_500kb

from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable

f, ax = plt.subplots(
    figsize=(15, 10),
)

norm = LogNorm(vmax=0.1)

im = ax.matshow(
    clr.matrix().fetch('chrX'),
    norm=norm,
    cmap='fall',
);
plt.axis([0,nbins,nbins,0])


divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, label='corrected frequencies');
ax.set_ylabel('chrX:500kb bins, #bin')
ax.xaxis.set_visible(False)

ax1 = divider.append_axes("top", size="20%", pad=0.05, sharex=ax)
#weights = clr.bins()[:]['weight'].values
#ax1.plot([0,nbins],[0,0],'k',lw=0.25)

#ax1.plot(e1X_values, label='E1')

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

ax1.set_ylabel('E1')
ax1.set_xticks([]);
ax1.get_subplotspec()


for i in np.where(np.diff( (cis_eigs[1][cis_eigs[1]['chrom']=='chrX']['E1']>0).astype(int)))[0]:
    # Horisontal lines where E1 intersects 0
    ax.plot([0,nbins],[i,i],'k',lw=0.4)

    # Vertical lines where E1 intersects 0
    ax.plot([i,i],[0,nbins],'k',lw=0.4)
chrX_matrix_e1_500kb.outputs[0]
Figure 6.1: chrX interaction matrix with E1 eigenvector values in 500kb resolution. The sign change of the E1 is overlayed as thin black lines, making it more easily interpretable. We should qualitatively determine how well the E1 values capture the plaid pattern in the matrix.

Markdown: Figure 10.1

We observe that the E1 values only somewhat captures the plaid pattern in the matrix Figure 10.1, but it is not related to the size of the compartments, as both small and large compartments can be observed. The B-compartment that starts from around bin 150 (75.000.000 bp) seems to have squares that are not captured by the E1 values. Maybe it could be captures by TAD calling, but that is not the scope of this analysis.

import matplotlib.pyplot as plt

f, ax1 = plt.subplots(
    figsize=(10, 2),
)

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

#ax1.set_ylabel('E1')
ax1.set_xticks([])
ax1.set_yticks([])

# Remove borders
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.spines['left'].set_visible(False)
ax1.spines['bottom'].set_visible(False)

plt.tight_layout()

# Save the plot as a SVG file
#plt.savefig('e1_plot.svg', bbox_inches='tight')

plt.show()
Figure 6.2: 500kb binned E1 eigenvector values for chrX. Freshly calculated from the cooler file.

Markdown the next plot: Figure 10.3 that is

Stairs plot of the E1 compartments

(Less spiky, more smooth)

import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(2,1,
    figsize=(10, 4)
)

chrom_start = eigenvector_track_chrX['start'].values
window_size = chrom_start[1] - chrom_start[0]

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

# Create stairs
x = np.zeros(2*chrom_start.size)
y = np.zeros(2*chrom_start.size)
x[0::2] = chrom_start
x[1::2] = chrom_start + window_size
y[0::2] = e1X_values
y[1::2] = e1X_values

# Layout
ax1.set_ylabel('Spiky E1')
ax2.set_ylabel('Stairs E1')
ax1.set_xticks([])
ax1.set_yticks([])
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_ylim(-1, 1)

# Remove borders
ax1.spines[:].set_visible(False)
ax2.spines[:].set_visible(False)

ax2.fill_between(x, y, 0, where=(y > 0), color='tab:red')
ax2.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

plt.tight_layout()
# Save the plot as a high-resolution PNG file
#plt.savefig('../steps/e1_plot.png', dpi=320, bbox_inches='tight')

500kb binned E1 eigenvector values for chrX.

500kb binned E1 eigenvector values for chrX.

All 5 full merges

Load coolers

import glob
import os.path as op
import cooler

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
res = "::resolutions/500000"

clrs = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

clrs = {key: clrs[key] for key in chron_order}
clrs
{'fibroblast': <Cooler "fibroblast.mcool::/resolutions/500000">,
 'spermatogonia': <Cooler "spermatogonia.mcool::/resolutions/500000">,
 'pachytene_spermatocyte': <Cooler "pachytene_spermatocyte.mcool::/resolutions/500000">,
 'round_spermatid': <Cooler "round_spermatid.mcool::/resolutions/500000">,
 'sperm': <Cooler "sperm.mcool::/resolutions/500000">}

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# Try with only the gc_cov for chrX

import bioframe
import pandas as pd
import os.path as op

bins = clrs['fibroblast'].bins().fetch('chrX')[:]
out_name = '../steps/rheMac10_gc_cov_X_500kb.tsv'

rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
if not op.exists(out_name):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(out_name, index=False,sep='\t')
    print(gc_cov.info())
else: 
    print("Already exists, read from file")
    gc_cov = pd.read_csv(out_name, sep='\t')
    print(gc_cov.info())
Already exists, read from file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   chrom   307 non-null    object 
 1   start   307 non-null    int64  
 2   end     307 non-null    int64  
 3   GC      307 non-null    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 9.7+ KB
None

Calculate the E1 compartments

Loop: view_df, cis_eigs, e1_values

Plot GC covariance
import matplotlib.pyplot as plt

f, ax = plt.subplots(
    figsize=(10, 2),
)

ax.plot(gc_cov['start'],gc_cov['GC'])

plt.tight_layout()

Chromosome restricted E1 compartments
# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools
import pandas as pd

eigs_full = {}
e1_values_full = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Divide into chromosome arms
clr = clrs['fibroblast']
view_df_full = pd.DataFrame(
    {
    'chrom': 'chrX',
    'start': 0,
    'end': chrX_size,
    'name': 'chrX'
    }, index=[0]
                      )

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}, size {clr.binsize}")
    cis_eigs_full = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df_full,
                        n_eigs=3,
                        )
    eigs_full[name] = cis_eigs_full[1]
    e1_track_full = cis_eigs_full[1][['chrom','start','end','E1']]
    e1_values_full[name] = e1_track_full['E1'].values

#eigs
Calculating eigenvectors for fibroblast, size 500000
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for spermatogonia, size 500000
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte, size 500000
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for round_spermatid, size 500000
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm, size 500000
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Chromosome arms restricted
# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools
import pandas as pd

eigs = {}
e1_values = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Divide into chromosome arms
view_df = pd.DataFrame(
    {
    'chrom': 'chrX',
    'start': [0, 59_000_001],
    'end': [59_000_000, chrX_size],
    'name': ['X_short', 'X_long']
    }, index=[0,1]
                      )

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}")
    cis_eigs = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df,
                        n_eigs=3,
                        )
    eigs[name] = cis_eigs[1]
    e1_track = cis_eigs[1][['chrom','start','end','E1']]
    e1_values[name] = e1_track['E1'].values

#eigs
Calculating eigenvectors for fibroblast
Calculating eigenvectors for spermatogonia
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte
Calculating eigenvectors for round_spermatid
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Refined: 10Mb windows restricted
# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools

eigs_10mb = {}
e1_values_10mb = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Calculate in 1Mb windows
# Define the window size (1Mb)

window_size = 10_000_000

# Generate the start and end positions for each window
start_positions = list(range(0, chrX_size, window_size))
end_positions = [min(start + window_size, chrX_size) for start in start_positions]
# Create the DataFrame
view_df_10mb = pd.DataFrame({
    'chrom': ['chrX'] * len(start_positions),
    'start': start_positions,
    'end': end_positions,
    'name': [f'X_{i}' for i in range(len(start_positions))]
})
#display(view_df)

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}")
    cis_eigs_10mb = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df_10mb,
                        n_eigs=3,
                        )
    eigs_10mb[name] = cis_eigs_10mb[1]
    e1_track_10mb = cis_eigs_10mb[1][['chrom','start','end','E1']]
    e1_values_10mb[name] = e1_track_10mb['E1'].values

#eigs
Calculating eigenvectors for fibroblast
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for spermatogonia
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for round_spermatid
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Save the A compartment coordinates
### Just testing for a single sample

# import numpy as np
# import pandas as pd

# e1 = pd.Series(e1_values_full['sperm'])
# res = 500_000

# e1_filled = e1.ffill(limit=2,limit_area='inside')

# # Type is preserved for boolean arrays, so the result will contain 
# # False when consecutive elements are the same and True when they differ.
# # We then use np.where to get the indices where the values change sign
# sign_change_coords = np.where(np.diff((e1_filled > 0).astype(int)))[0]
# # Filter out the indices that point to a NaN value
# sign_change_coords = sign_change_coords[~e1_filled.isna().iloc[sign_change_coords].values]

# a_start_bin = sign_change_coords[e1_filled[sign_change_coords+1] > 0]
# a_end_bin = sign_change_coords[e1_filled[sign_change_coords] > 0]

# # if a_start_bin[0] > b_start_bin[0]:
# #     a_start_bin = np.insert(a_start_bin, 0, 0)
# #     b_start_bin = np.insert(b_start_bin, len(b_start_bin), len(e1))


# display(sign_change_coords)
# print(f"A compartment start {len(a_start_bin)}:", a_start_bin)
# print(f"A compartment end {len(a_end_bin)}:", a_end_bin)

# display(e1_filled[:20])
This is copied to hicstuff.py as a function for later import
# ###### Restriction: full chromosome E1 values #######

# import pandas as pd
# import numpy as np
# import os.path as op

# for name,e1 in e1_values_full.items():
#     e1 = pd.Series(e1)
#     e1_filled = e1.ffill(limit=2)
#     sign_change_coords = np.where(np.diff((e1_filled > 0).astype(int)))[0]

#     a_start_bin = sign_change_coords[e1_filled[sign_change_coords+1] > 0]
#     a_end_bin = sign_change_coords[e1_filled[sign_change_coords] > 0]

#     print(f"Calculating A-compartment intervals for {name}")
#     print(len(a_start_bin), len(a_end_bin))

#     if e1_filled.iloc[-1] > 0:
#         print("Last bin is A, appending end bin")
#         a_end_bin = np.append(a_end_bin, len(e1_filled)-1)
#         print(len(a_start_bin), len(a_end_bin))
    
#     df = pd.DataFrame({
#         'chrom': ['chrX'] * len(a_start_bin),
#         'bin_start': a_start_bin,
#         'bin_end': a_end_bin,
#         'start': a_start_bin*res,
#         'end': a_end_bin*res,
#         'length': (a_end_bin-a_start_bin)*res
#     })
#     csv_name = f'../results/{name}_a_comp_coords_{int(res*0.001)}kb_full.csv'
#     print(csv_name)
#     if not op.exists(csv_name):
#         df[['chrom', 'start', 'end']].to_csv(csv_name, index=False)
#     else:
#         print("Already exists, skipping")
#     print()
######## Restriction: chromosome arm E1 values ########
import importlib
import hicstuff
import os
importlib.reload(hicstuff)
from hicstuff import extract_a_coordinates


# Binsize in bp
res = 500_000
outdir = '../results/compartments/'
if not os.path.exists(outdir):
    os.makedirs(outdir)
    

######## Restriction: full chromosome E1 values ########
restriction = 'full'
for name, e1 in e1_values_full.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)

######## Restriction: chromosome arm E1 values ########
restriction = 'arms'
for name, e1 in e1_values.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)

######## Restriction: 10Mb E1 values ########
restriction = '10Mb'
for name, e1 in e1_values_10mb.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)
Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_full.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_full.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_full.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_full.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_full.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_arms.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_arms.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_arms.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_arms.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_arms.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_10Mb.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_10Mb.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_10Mb.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_10Mb.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_10Mb.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Plot NaN histogram
# Check the number of NaN values in the E1 column
import numpy as np

# Check the number of NaN values in the E1 column and create a DataFrame
nan_counts = {k: {'length': len(v), 'NaNs': np.isnan(v).sum()} for k, v in e1_values.items()}
#display(pd.DataFrame.from_dict(nan_counts, orient='index'))

# Locate the NaN values (histogram)
import matplotlib.pyplot as plt

f, ax = plt.subplots(1, 5, figsize=(20,2))
for i, (name, track) in enumerate(eigs.items()):
    e1 = track['E1'].values
    # Locate NaN values
    e1_nan = np.where(np.isnan(e1))
    # Plot histogram
    ax[i].hist(e1_nan, bins=100)
    
    # Plot median line
    median_pos = np.median(e1_nan)
    ax[i].axvline(median_pos, color='r')

    # Layout
    ax[i].set_title(name)
    xticks = np.linspace(0, len(e1), num=5)
    xticks = np.append(xticks, median_pos)  # Add median position to xticks
    ax[i].set_xticks(xticks)
    xticklabels = np.linspace(0, len(e1) * 0.5, num=5, dtype = 'int').tolist()
    xticklabels.append(median_pos*0.5)  # Add median label
    ax[i].set_xticklabels(xticklabels, rotation=45)
    ax[i].set_xlabel('Position (Mbp)')
                     

Histogram of NaN values in the E1 eigenvector for each cell type.

Histogram of NaN values in the E1 eigenvector for each cell type.

Plot the E1 compartments

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.patches import Rectangle

res = 500_000
names_abbr = {'fibroblast': 'Fibroblast', 'spermatogonia': 'Spermatogonia', 'pachytene_spermatocyte': 'Pachytene Spermatogonia', 'round_spermatid': 'Round Spermatid', 'sperm': 'Sperm'}

chrom_start = e1_track_full['start'].values
chrom_end = e1_track_full['end'].values-1

f, axs = plt.subplots(5, 3, figsize=(18, 8), sharex=True, sharey=True)

# Populate the first column
axs[0,0].set_title('E1: Full-chromosome restricted')
for i, (name, e1) in enumerate(e1_values_full.items()):
    ax = axs[i,0]
    ax.set_ylabel(names_abbr[name])

    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_full.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Populate the second column
axs[0,1].set_title('E1: Chromosome-arms restricted')
for i, (name, e1) in enumerate(e1_values.items()):
    ax = axs[i,1]
    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_arms.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Populate the third column
axs[0,2].set_title('E1: 10Mb restricted (refined)')
for i, (name, e1) in enumerate(e1_values_10mb.items()):
    ax = axs[i,2]
    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_10Mb.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Set y-limits for all subplots
for ax in axs.flat:
    ax.set_ylim(-0.8, 0.8)
    ax.spines[:].set_visible(False)
    ax.set_yticks([])
    ax.set_xticks([])

plt.tight_layout()
plt.savefig('../steps/e1_plot_full.svg', bbox_inches='tight')

E1 eigenvector values for all merged samples at 500kb resolution. Left: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.

E1 eigenvector values for all merged samples at 500kb resolution. Left: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.

Plot matrices with compartments (round spermatid)

import cooltools.lib.plotting
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from cooltools.lib.numutils import adaptive_coarsegrain, interp_nan



clr = clrs['round_spermatid']
chrom_start, chrom_end = e1_track_full['start'].values, e1_track_full['end'].values-1
e1 = e1_values_10mb['round_spermatid']
nbins = len(clr.bins().fetch('chrX'))

f, ax = plt.subplots(
    figsize=(9,6),
)

norm = LogNorm(vmax=0.1)


e1_list = [e1_values_10mb['round_spermatid'], e1_values['round_spermatid'], e1_values_full['round_spermatid']]
e1_names = ['10Mb', 'Arms', 'Full']
colors = ['tab:red', 'tab:blue', 'tab:green']



im = ax.matshow(
    clr.matrix().fetch('chrX'),
    norm=norm,
    cmap='fall',
);
plt.axis([0,nbins,nbins,0])

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.05)
plt.colorbar(im, cax=cax, label='corrected frequencies');
ax.set_ylabel('chrX:500kb bins, #bin')
ax.xaxis.set_visible(False)

for i, e1 in enumerate(e1_list):
    ax1 = divider.append_axes("top", size="10%", pad=0.05, sharex = ax)
    # weights = clr.bins()[:]['weight'].values

    # ax1.plot(e1, label='E1')

    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    #smooth_y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start/500_000
    x[1::2] = chrom_end/500_000
    y[0::2] = e1
    y[1::2] = e1
    # smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
    # smooth_y[0::2] = smooth_e1
    # smooth_y[1::2] = smooth_e1


    ax1.fill_between(x, y, 0, where=(y > 0), color='tab:red')
    ax1.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

    ax1.set_ylabel(e1_names[i], rotation=70)
    ax1.set_ylim(-0.8, 0.8)
    ax1.set_xticks([])

ax1.set_title('Round Spermatid: 500kb bins')


#    # Plot the sign changes on the matrix 
#     col = colors[i]
#     for i in np.where(np.diff( (pd.Series(e1)>0).astype(int)))[0]:
#         # Horisontal lines where E1 intersects 0
#         ax.plot([0,nbins],[i,i],col,lw=0.5)

#         # Vertical lines where E1 intersects 0
#         ax.plot([i,i],[0,nbins],col,lw=0.5)

plt.tight_layout()
f.canvas.draw()

E1 eigenvector values for all merged samples at 500kb resolution for round spermatid, as well as the interaction matrix. E1 was restricted to top: Full-chromosome, middle: Chromosome-arms, bottom: 10Mb window.

E1 eigenvector values for all merged samples at 500kb resolution for round spermatid, as well as the interaction matrix. E1 was restricted to top: Full-chromosome, middle: Chromosome-arms, bottom: 10Mb window.

Sliding window summed E1 compartments

To mimic the smoothing applied in the Wang et al. 2019 paper, where they slide a 400kb window in 100kb steps on the obs/exp matrix, we will similarly slide a 400kb window in 100kb steps directly on the E1 compartments.

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
import pandas as pd
from scipy.signal.windows import triang

resolution = 500_000
window_size = 2_500_000
step_size = window_size // resolution

chrom_start = e1_track['start'].values
chrom_end = e1_track['end'].values-1

f, axs = plt.subplots(5, 2, figsize=(30, 10), sharex=True)

axs[0, 0].set_title('Chromosome-arm E1 (arm restricted)')
axs[0, 1].set_title('Smoothed by summation')

for i, (name, e1) in enumerate(e1_values.items()):
    # print(i, name, e1.size)
    smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))

    ax0 = axs[i, 0]
    ax1 = axs[i, 1]
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    smooth_y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1
    smooth_y[0::2] = smooth_e1['value'].values
    smooth_y[1::2] = smooth_e1['value'].values


    ax0.fill_between(x, y, 0, where=(y > 0), color='tab:red')
    ax0.fill_between(x, y, 0, where=(y < 0), color='tab:blue')
    
    # Overlay the smoothed line (divided by 5 to make it a mean)
    ax0.plot(x, smooth_y/5, color='C1')

    ax1.fill_between(x, smooth_y, 0, where=(smooth_y > 0), color='tab:red')
    ax1.fill_between(x, smooth_y, 0, where=(smooth_y < 0), color='tab:blue')

    ax0.set_ylabel(name)
    ylim = 1.5
    #ax0.set_ylim(-ylim, ylim)
    #ax1.set_ylim(-ylim*4, ylim*4)

for ax in axs.flat:
    ax.spines[:].set_visible(False)
    ax.set_xticks([])
    #ax.set_yticks([])

# plt.tight_layout()

E1 eigenvector values for all merged samples at 500kb resolution with rolling summation, with window size 2.5Mb, step size 500Kb: Each value is now the sum of the surrounding n=5 bins.

E1 eigenvector values for all merged samples at 500kb resolution with rolling summation, with window size 2.5Mb, step size 500Kb: Each value is now the sum of the surrounding n=5 bins.

100kb resolution

All 5 full merges

Load coolers

import glob
import os.path as op
import cooler

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
res = "::resolutions/100000"

clrs = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

clrs = {key: clrs[key] for key in chron_order}
clrs

names_abbr = {'fibroblast': 'Fibroblast', 'spermatogonia': 'Spermatogonia', 'pachytene_spermatocyte': 'Pachytene Spermatogonia', 'round_spermatid': 'Round Spermatid', 'sperm': 'Sperm'}

# Calculate chromstart and chromend for each bin on chrX
chrX_size = clrs['fibroblast'].chromsizes['chrX']
chrom_start = clrs['fibroblast'].bins().fetch('chrX')['start'].values
chrom_end = clrs['fibroblast'].bins().fetch('chrX')['end'].values-1
nbins = len(clrs['fibroblast'].bins().fetch('chrX'))
binsize = clrs['fibroblast'].binsize

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# Try with only the gc_cov for chrX

import bioframe
import pandas as pd
import os.path as op

bins = clrs['fibroblast'].bins().fetch('chrX')[:]
out_name = '../steps/rheMac10_gc_cov_X_100kb.tsv'

rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
if not op.exists(out_name):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(out_name, index=False,sep='\t')
    print(gc_cov.info())
else: 
    print("Already exists, read from file")
    gc_cov = pd.read_csv(out_name, sep='\t')
    print(gc_cov.info())
Already exists, read from file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 0 to 1533
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   chrom   1534 non-null   object 
 1   start   1534 non-null   int64  
 2   end     1534 non-null   int64  
 3   GC      1533 non-null   float64
dtypes: float64(1), int64(2), object(1)
memory usage: 48.1+ KB
None

Calculate the E1 compartments

Loop: view_df, cis_eigs, e1_values

Plot GC covariance
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(10, 2))

ax.plot(gc_cov['start'],gc_cov['GC'])
plt.tight_layout()

Create viewframes: full, arms, 10Mb windows
import pandas as pd

# Fetch the chromsize of X from one of the coolers
chrX_size = clrs['fibroblast'].chromsizes['chrX']

views = {}

# Make the full view frame
views['full'] = pd.DataFrame({
    'chrom': 'chrX',
    'start': 0,
    'end': chrX_size,
    'name': 'chrX'}, index=[0])

# Divide into chromosome arms
views['arms'] = pd.DataFrame({
    'chrom': 'chrX',
    'start': [0, 59_000_001],
    'end': [59_000_000, chrX_size],
    'name': ['X_short', 'X_long']}, index=[0,1])

# Calculate in 10Mb windows
window_size = 10_000_000
start_positions = list(range(0, chrX_size, window_size))
end_positions = [min(start + window_size, chrX_size) for start in start_positions]

# Create the DataFrame
views['10Mb'] = pd.DataFrame({
    'chrom': ['chrX'] * len(start_positions),
    'start': start_positions,
    'end': end_positions,
    'name': [f'X_{i}' for i in range(len(start_positions))]
})
Calulate the E1 compartments
import cooltools

eigs = {}
e1_values = {}

for name, clr in clrs.items():
    if name not in eigs or name not in e1_values:
        eigs[name] = {} 
        e1_values[name] = {}
    for view, view_df in views.items():
        print(f"Calculating eigenvectors for {name} at {view}")
        cis_eigs = cooltools.eigs_cis(
                            clr,
                            gc_cov,
                            view_df=view_df,
                            n_eigs=1)
        eigs[name][view] = cis_eigs[1]
        e1_values[name][view] = cis_eigs[1]['E1'].values
        
Calculating eigenvectors for fibroblast at full
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for fibroblast at arms
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for fibroblast at 10Mb
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for spermatogonia at full
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for spermatogonia at arms
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for spermatogonia at 10Mb
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte at full
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte at arms
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for pachytene_spermatocyte at 10Mb
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for round_spermatid at full
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for round_spermatid at arms
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for round_spermatid at 10Mb
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm at full
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm at arms
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
Calculating eigenvectors for sperm at 10Mb
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Extract the A compartment coordinates

import importlib
import hicstuff
importlib.reload(hicstuff)
from hicstuff import extract_a_coordinates


res = 100_000
outdir = '../results/compartments/'

for name, view_df in e1_values.items():
    print(f"Calculating A-compartment intervals for {name}")
    for view, e1 in view_df.items():
        print(view)
        extract_a_coordinates(
            e1=e1,
            name=name,
            restriction=view,
            chrom='chrX',
            res=res,
            csv=True,
            output_dir=outdir,
            force=True
        )
        extract_a_coordinates(
            e1=e1,
            name=name,
            restriction=view,
            chrom='chrX',
            res=res,
            csv=True,
            output_dir=outdir,
            smooth=True,
            force=True
        )
    
Calculating A-compartment intervals for fibroblast
full
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_full.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_arms.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_10Mb.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for spermatogonia
full
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_full.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_arms.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_10Mb.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for pachytene_spermatocyte
full
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_full.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_arms.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_10Mb.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for round_spermatid
full
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_full.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_arms.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_10Mb.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for sperm
full
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_full.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_arms.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_10Mb.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.

Plot the E1 compartments

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable

f, axs = plt.subplots(5, 3, figsize=(18, 8), sharex=True, sharey=True)

# Loop through eigs and e1_values to plot the E1 values
for i, (name, e1_dict) in enumerate(e1_values.items()):
    axs[i,0].set_ylabel(names_abbr[name])
    for j, (view, e1) in enumerate(e1_dict.items()):
        ax = axs[i, j]

        if i==0:
            ax.set_title(view)

        # Create stairs
        x = np.zeros(2*chrom_start.size)
        y = np.zeros(2*chrom_start.size)
        smooth_y = np.zeros(2*chrom_start.size)
        x[0::2] = chrom_start
        x[1::2] = chrom_end
        y[0::2] = e1
        y[1::2] = e1
        smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
        smooth_y[0::2] = smooth_e1
        smooth_y[1::2] = smooth_e1


        # ax.fill_between(x, y, 0, where=(y > 0), color='tab:red')
        # ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue')
        # ax.plot(x, smooth_y/5, color='C1', linewidth=0.5)

        ax.fill_between(x, smooth_y, 0, where=(smooth_y > 0), color='tab:red')
        ax.fill_between(x, smooth_y, 0, where=(smooth_y < 0), color='tab:blue')

        # Test to see how well my coordinates match the e1 values
        coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_100kb_{view}_smoothed.csv')
        # Iterate over each interval in the DataFrame
        for start, end in zip(coords['start'], coords['end']):
            rect = Rectangle((start, 0.4), width=end-start, height=0.5, color='k', linewidth=0.05)
            ax.add_patch(rect)

# Set y-limits for all subplots
for ax in axs.flat:
    ax.set_ylim(-4, 4)
    ax.spines[:].set_visible(False)
    ax.set_xticks([])
    ax.set_yticks([])

plt.tight_layout()
plt.savefig('../steps/e1_plot_full_100kb_smoothed.svg', bbox_inches='tight')
        
    

E1 eigenvector values for all merged samples at 100kb resolution. Left: Chromosomes (not restricted) Middle: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.Values are smoothed with a sliding window of 5 bins, step size 1 bin.

E1 eigenvector values for all merged samples at 100kb resolution. Left: Chromosomes (not restricted) Middle: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.Values are smoothed with a sliding window of 5 bins, step size 1 bin.

Plot matrices with compartments

import cooltools.lib.plotting
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from cooltools.lib.numutils import adaptive_coarsegrain, interp_nan

f, axs = plt.subplots(1,5,
    figsize=(25,10)
)

norm = LogNorm(vmax=0.1)

# Loop through the clrs and its matrix: plot the matrix on its axis
for i, (name, e1_dict) in enumerate(e1_values.items()):
    ax = axs[i]

    im = ax.matshow(
        clr.matrix().fetch('chrX'),
        norm=norm,
        cmap='fall',
    );
    ax.set_xlim(0, nbins)
    ax.set_ylim(nbins, 0)

    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="2%", pad=0.05)
    plt.colorbar(im, cax=cax, label='corrected frequencies');
    ax.set_ylabel('chrX:100kb bins, #bin')
    ax.xaxis.set_visible(False)

    for j, (view, e1) in enumerate(e1_dict.items()):
        ax1 = divider.append_axes("top", size="10%", pad=0.1, sharex = ax)

        # ax1.plot(e1, label='E1')

        # Create stairs
        x = np.zeros(2*chrom_start.size)
        y = np.zeros(2*chrom_start.size)
        #smooth_y = np.zeros(2*chrom_start.size)
        x[0::2] = chrom_start/binsize
        x[1::2] = chrom_end/binsize
        y[0::2] = e1
        y[1::2] = e1
        # smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
        # smooth_y[0::2] = smooth_e1
        # smooth_y[1::2] = smooth_e1


        ax1.fill_between(x, y, 0, where=(y > 0), color='tab:red')
        ax1.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

        ax1.set_ylabel(view, rotation=-70)
        ax1.set_ylim(-0.8, 0.8)
        ax1.set_xticks([])

    ax1.set_title(name)

#    # Plot the sign changes on the matrix 
#     col = colors[i]
#     for i in np.where(np.diff( (pd.Series(e1)>0).astype(int)))[0]:
#         # Horisontal lines where E1 intersects 0
#         ax.plot([0,nbins],[i,i],col,lw=0.5)

#         # Vertical lines where E1 intersects 0
#         ax.plot([i,i],[0,nbins],col,lw=0.5)

# Now show the plot
f.set_layout_engine('compressed')
f.canvas.draw()
plt.show()

E1 eigenvector values for all merged samples at 500kb resolution for round spermatid, as well as the interaction matrix. E1 was restricted to top: Full-chromosome, middle: Chromosome-arms, bottom: 10Mb window.

E1 eigenvector values for all merged samples at 500kb resolution for round spermatid, as well as the interaction matrix. E1 was restricted to top: Full-chromosome, middle: Chromosome-arms, bottom: 10Mb window.

Smooth the Observed/Expected matrix

Wang et al. 2019 used a 400Kb window with a step size of 100Kb to smooth the O/E matrix before calculating the E1. I have tried various different methods to smooth the interaction matrix, but I can’t get dimensions of the pixels to match with the bins. Also, cooltools.eigs_cis need a full cooler to fetch data from, and I can’t get it to work without a pixels object.

Thus, I will try smoothing the E1 values instead, as I can’t really rationalize what biological difference it would make. I think I need help figuring that out.

# import glob
# import os.path as op
# import cooler

# mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
# res = "::resolutions/100000"

# clrs100kb = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

# chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

# clrs100kb = {key: clrs100kb[key] for key in chron_order}
# clrs100kb

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# import bioframe
# import pandas as pd
# import os.path as op

# bins = clrs100kb['fibroblast'].bins().fetch('chrX')[:]
# out_name = 'rheMac10_gc_cov_X_100kb.tsv'

# rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
# if not op.exists(out_name):
#     print('Calculate the fraction of GC basepairs for each bin')
#     gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
#     gc_cov.to_csv(out_name, index=False,sep='\t')
#     print(gc_cov.info())
# else: 
#     print("Already exists, read from file")
#     gc_cov = pd.read_csv(out_name, sep='\t')
#     print(gc_cov.info())
# import cooler
# import numpy as np

# def smooth_matrix(matrix, window_size=4, step_size=1):
#     n = matrix.shape[0]
#     smoothed_matrix = np.zeros_like(matrix)
    
#     for i in range(0, n - window_size + 1, step_size):
#         for j in range(0, n - window_size + 1, step_size):
#             window = matrix[i:i+window_size, j:j+window_size]
#             smoothed_matrix[i:i+window_size, j:j+window_size] = np.sum(window)
    
#     return smoothed_matrix

# # Load the Hi-C interaction matrix from a multi-resolution .mcool file
# cooler_file = '../steps/bwa/PE/cool/fibroblast/fibroblast.mcool'
# clr = cooler.Cooler(cooler_file + '::resolutions/100000')

# # Extract the matrix
# matrix = clr.matrix(balance=True).fetch('chrX')[:]

# # Smooth the matrix
# smoothed_matrix = smooth_matrix(matrix, window_size=4, step_size=1)
# #clr.bins().fetch('chrX')[:]
# clr.pixels().fetch('chrX')[:]
# #print(matrix.shape)
# #print(smoothed_matrix.shape)
bin1_id bin2_id count
145469195 26898 26898 2013
145469196 26898 26899 171
145469197 26898 26901 75
145469198 26898 26902 100
145469199 26898 26903 187
... ... ... ...
146369705 28431 28532 2
146369706 28431 28533 1
146369707 28431 28534 2
146369708 28431 28539 1
146369709 28431 28544 1

900515 rows × 3 columns

# import pandas as pd
# import cooler
# import numpy as np

# pixels = pd.DataFrame(smoothed_matrix).stack().rename_axis(['bin1_id', 'bin2_id']).reset_index(name='count')
# pixels = pixels[pixels['bin1_id'] <= pixels['bin2_id']]
# pixels.sort_values(['bin1_id', 'bin2_id'])[['bin1_id', 'bin2_id', 'count']].reset_index(drop=True)
# # Get the eigs

# import cooltools
# import pandas as pd

# eigvecs, eigvals = cooltools.lib.numutils.get_eig(matrix, n=1)
# # create a cooler for the smoothed matrix

# import cooler
# import numpy as np

# # Create a new cooler file for the smoothed matrix
# smoothed_cooler_file = cooler_file.replace('.mcool', '.smoothed.mcool')

# # Create a new cooler object
# cooler.create_cooler(smoothed_cooler_file, bins=clr.bins()[:], pixels=smoothed_matrix)