Compartments Analysis

Analysis of Chromatin Compartments using the Open2C Ecosystem

Working with coolers

In this notebook, we use files generated by the workflow master_workflow that, in short, does the following:

Download the data from the source
Index the reference genome with bwa index and samtools faidx
Align the reads to the reference genome with bwa mem
Pair and sort the reads to .pairs files with pairtools parse | pairtools sort
Deduplicate the pairs with pairtools dedup
Convert the .pairs to cooler files with cooler cload pairs

We will:

Load the cooler files
Merge the coolers from the same BioSample ID –> Create ‘replicates’
Zoomify the merged cooler files (coarsen) to create a multicooler (.mcool) file
Balance the matrices (use commandline cooler balance)
Calculate E1 compartments with cooltools compute_cis_eig

Overview

Data (Accessions)

To get an overview of the data accessions used in this analysis, we will first summarize the SRA-runtable.tsv that contains the accession numbers and some metadata for each sample (Table 17.1).

Table 5.1: The most relevant columns of the SRA-runtable.tsv file

	source_name	BioSample	Run	GB	Bases	Reads
16	fibroblast	SAMN08375237	SRR6502335	29.771059	73,201,141,800	244,003,806
17	fibroblast	SAMN08375237	SRR6502336	22.755361	65,119,970,100	217,066,567
18	fibroblast	SAMN08375236	SRR6502337	21.434722	52,769,196,300	175,897,321
19	fibroblast	SAMN08375236	SRR6502338	21.420030	52,378,949,100	174,596,497
20	fibroblast	SAMN08375236	SRR6502339	10.207410	28,885,941,600	96,286,472
9	fibroblast	SAMN08375237	SRR7349189	52.729173	139,604,854,200	465,349,514
10	fibroblast	SAMN08375236	SRR7349190	53.085520	142,008,353,400	473,361,178
21	pachytene spermatocyte	SAMN08375234	SRR6502342	60.258880	150,370,993,500	501,236,645
22	pachytene spermatocyte	SAMN08375234	SRR6502344	27.146048	65,697,684,300	218,992,281
23	pachytene spermatocyte	SAMN08375234	SRR6502345	26.202707	63,490,538,700	211,635,129
0	pachytene spermatocyte	SAMN09427370	SRR7345458	55.970557	153,281,577,900	510,938,593
1	pachytene spermatocyte	SAMN09427370	SRR7345459	53.982492	144,993,841,200	483,312,804
11	pachytene spermatocyte	SAMN08375235	SRR7349191	51.274476	137,821,979,100	459,406,597
24	round spermatid	SAMN08375232	SRR6502351	20.924497	55,095,075,300	183,650,251
25	round spermatid	SAMN08375232	SRR6502352	41.133960	115,578,475,800	385,261,586
26	round spermatid	SAMN08375232	SRR6502353	36.444117	96,195,161,400	320,650,538
2	round spermatid	SAMN09427369	SRR7345460	38.244654	104,105,827,200	347,019,424
3	round spermatid	SAMN09427369	SRR7345461	53.996261	144,532,309,500	481,774,365
12	round spermatid	SAMN08375232	SRR7349192	52.384556	140,431,608,000	468,105,360
29	sperm	SAMN08375229	SRR6502360	26.653940	64,752,370,800	215,841,236
30	sperm	SAMN08375228	SRR6502362	23.973440	58,369,232,700	194,564,109
13	sperm	SAMN08375229	SRR7349193	52.806276	141,148,572,300	470,495,241
14	sperm	SAMN08375229	SRR7349195	22.444378	60,523,788,600	201,745,962
15	sperm	SAMN08375229	SRR7349196	38.253606	104,119,671,000	347,065,570
27	spermatogonia	SAMN08375231	SRR6502356	22.845286	58,909,579,800	196,365,266
28	spermatogonia	SAMN08375231	SRR6502357	17.947471	46,888,332,900	156,294,443
4	spermatogonia	SAMN09427379	SRR7345462	18.686342	52,032,780,000	173,442,600
5	spermatogonia	SAMN09427379	SRR7345463	29.956561	82,384,836,000	274,616,120
6	spermatogonia	SAMN09427379	SRR7345464	39.145759	105,153,716,100	350,512,387
7	spermatogonia	SAMN09427378	SRR7345465	35.816184	96,048,594,600	320,161,982
8	spermatogonia	SAMN09427378	SRR7345467	28.396816	77,248,140,900	257,493,803

Table 5.2: Summary of the data accessions used in this analysis

	source_name	GB	Bases	Reads
0	fibroblast	211.403275	553,968,406,500	1,846,561,355
1	pachytene spermatocyte	274.835160	715,656,614,700	2,385,522,049
2	round spermatid	243.128044	655,938,457,200	2,186,461,524
3	sperm	164.131640	428,913,635,400	1,429,712,118
4	spermatogonia	192.794420	518,665,980,300	1,728,886,601

Folder structure

For ease of mind, here is the folder structure of the project. ../steps/bwa/PE/ is the base directory artificially defined in the master_workflow.py. It ccould be any other directory inside steps. It is defined relative to the master_workflow.py file (inside the worklow), and converted to an absolute path by python.

../steps/bwa/PE
├── bamfiles
│   ├── fibroblast
│   ├── pachytene_spermatocyte
│   ├── round_spermatid
│   ├── sperm
│   └── spermatogonia
├── cool
│   ├── fibroblast
│   ├── pachytene_spermatocyte
│   ├── round_spermatid
│   ├── sperm
│   └── spermatogonia
└── pairs
    ├── fibroblast
    ├── pachytene_spermatocyte
    ├── round_spermatid
    ├── sperm
    └── spermatogonia

18 directories

Figure 5.1

FullMerge (pool all from each source_name)

We will use cooler merge to merge all samples in each sub-folder (cell type) to just one interaction matrix for each cell type. The reason for that is that we choose to trust (Wang et al. 2019) when they say that compartments are highly reproducible between replicates, and by merging all replicates, we will have a more robust signal.

Thus, we will merge all samples from the same source_name into a single cooler file. The coolers are produced to only contain the real chromosomes (1-22, X, Y), and not the mitochondrial DNA or the unplaced contigs (~2,900), as they would not bring any information to the analysis.

Overview of this section:

Locate the coolers (glob –> dictionary)
Merge the coolers (cooler.merge_coolers)
Zoomify the merged cooler (cooler.zoomify_cooler) to resolutions: 10kb, 50kb, 100kb, 500kb.
Balance the matrices (!cooler balance) (use the CLI, as it is more easily parallelized)

Create cooler dictionary (`glob`)

First, we will create a dictionary with the paths to the coolers for each sample. We use glob.glob to fetch all the coolers in each sub-folder that remain after filtering and (automatic) quality control (those are the ones with nodups in their names).

import glob
import os.path as op
from pprint import pprint as pp
import pandas as pd

# Get the list of cell type dirs
base_dir = '../steps/bwa/PE/cool'
folders = glob.glob(op.join(base_dir, '*'))

files_dict = {f:glob.glob(f"{f}/*.nodups.*") for f in folders}
cooler_dict = {op.basename(k): [op.basename(f) for f in v] for k,v in files_dict.items()}
#pp(cooler_dict)
df = pd.DataFrame.from_dict(cooler_dict, orient='index').T.fillna('-')
df[['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']]

Table 5.3: Dictionary of coolers for each cell type

	fibroblast	spermatogonia	pachytene_spermatocyte	round_spermatid	sperm
0	SRR6502339.nodups.10000.cool	SRR6502357.nodups.10000.cool	SRR7345459.nodups.10000.cool	SRR7349192.nodups.10000.cool	SRR7349196.nodups.10000.cool
1	SRR7349190.nodups.10000.cool	SRR7345467.nodups.10000.cool	SRR6502344.nodups.10000.cool	SRR6502353.nodups.10000.cool	SRR6502362.nodups.10000.cool
2	SRR7349189.nodups.10000.cool	SRR6502356.nodups.10000.cool	SRR6502342.nodups.10000.cool	SRR6502352.nodups.10000.cool	SRR7349193.nodups.10000.cool
3	SRR6502335.nodups.10000.cool	SRR7345464.nodups.10000.cool	SRR7345458.nodups.10000.cool	SRR6502351.nodups.10000.cool	SRR6502360.nodups.10000.cool
4	SRR6502338.nodups.10000.cool	SRR7345462.nodups.10000.cool	SRR6502345.nodups.10000.cool	SRR7345460.nodups.10000.cool	SRR7349195.nodups.10000.cool
5	SRR6502336.nodups.10000.cool	SRR7345465.nodups.10000.cool	SRR7349191.nodups.10000.cool	SRR7345461.nodups.10000.cool	-
6	SRR6502337.nodups.10000.cool	SRR7345463.nodups.10000.cool	-	-	-

Merge coolers

The coolers are merged by summing each bin in the matrices, meaning we can only merge matrices with same dimensions. We iterate through the dictionary and merge the coolers with cooler merge. The mergebuf parameter should be adjusted if you don’t have 32G memory. Default: mergebuf = 20000000. Below, we also check if the output file already exists. If it does, we skip the merge.

# NB adjust `mergebuf` if you don't have 32G of RAM

import cooler

for folder,cooler_list in cooler_dict.items():
    in_uris = [op.join(base_dir, folder, file) for file in cooler_list]
    out_uri = op.join(base_dir, folder, f'{folder}.fullmerge.cool')

    if op.exists(out_uri):
        print(f"Skipping {out_uri}: exists...")
        continue

    print(f"Creating {out_uri} by \nMerging {len(cooler_list)} coolers into one:", end=" ")
    print("\t",[file.split('.')[0] for file in cooler_list])
    cooler.merge_coolers(output_uri=out_uri,
                         input_uris=in_uris,
                         mergebuf=int(5e7),
                         )
    print("... Done!")

Skipping ../steps/bwa/PE/cool/round_spermatid/round_spermatid.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/spermatogonia/spermatogonia.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/sperm/sperm.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/fibroblast/fibroblast.fullmerge.cool: exists...
Skipping ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.fullmerge.cool: exists...

Zoomify the merged cooler files

After merging the coolers, we will zoomify them to resolutions: 10kb, 50kb, 100kb, 500kb. It will recursively create the zoom levels for the cooler file from the ‘base’ resolution (in our case 10kb bins) to the ‘max’ resolution (in our case 500kb bins), and save all zoom levels in a multi-resolution cooler (.mcool) file. The resolutions are stored under the resolutions key in the cooler file (e.g. cell_type.mcool::/resolutions/10000).

Here, we also check if the output file already exists. If it does, we skip the zoomify. Again, can tailor the chunksize (pixels loaded per process) parameter according to memory availability. I found that the command stalled completely (without quitting) when the memory allocation was not sufficient, as I started out with 32 cores and 32G of RAM. I did not test what was more time-efficient; increasing number of processes or chunksize. After all, it didn’t take long to run anyways.

Finally, I list the resolutions in the mcool file.

# NB 8 cores and 32G of RAM was used

import glob
import cooler
import os.path as op

base_dir = '../steps/bwa/PE/cool'
merged_coolers = glob.glob(op.join(base_dir, '*/*.fullmerge.cool'))


for clr in merged_coolers:
    out_uri = clr.replace('.fullmerge.cool', '.mcool')
    if op.exists(out_uri):
        print(f"Skipping {out_uri}: exists...")
        continue

    print(f"Zoomifying cooler: \n\t   {clr}\n\t-> {out_uri}", end="")
    
    cooler.zoomify_cooler(base_uris = clr,
                          outfile = out_uri,
                          resolutions = [10000,50000,100000,500000],
                          chunksize = 10000000,
                          nproc = 8)
    print(" --> done")

Skipping ../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool: exists...
Skipping ../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool: exists...
Skipping ../steps/bwa/PE/cool/sperm/sperm.mcool: exists...
Skipping ../steps/bwa/PE/cool/fibroblast/fibroblast.mcool: exists...
Skipping ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool: exists...

import glob
import cooler
mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")

for mcool in mcools:
    print(f"{mcool}:")
    print(cooler.fileops.list_coolers(mcool))
    print()

../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/sperm/sperm.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/fibroblast/fibroblast.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool:
['/resolutions/10000', '/resolutions/50000', '/resolutions/100000', '/resolutions/500000']

Balance the matrices

Finally, we balance the matrices using the cooler CLI. We use the cooler balance command with the default options using 32 cores, which iteratively balances the matrix (Iterative Corecction). It is first described as a method for bias correction of Hi-C matrices in (Imakaev et al. 2012), where it is paired with eigenvector decomposition (ICE). Here the eigenvector decomposition of the obtained maps is shown to provide insights into local chromatin states.

According to cooler documentation, we have to balance the matrices on each resolution, and thus it cannot be done prior to zoomifying. The argument is that the balancing weights are resolution-specific and will no longer retain its meaning when binned with other weights. Therefore, we use a nested for-loop that iterates through all the .mcools and all the resolutions in each .mcool. cooler balance will create a new column in the bins group of each cooler , weight, which can then be included or not in the downstream analysis. This means we will have access to both the balanced and the unbalanced matrix.

The default mode uses genome-wide data to calculate the weights for each bin. It would maybe be more suitable to calculate the weights for cis contacts only, and that is possible through the --cis-only flag, and that can be added to another column, so that we can compare the difference between the two methods easily. However, we will only use the default mode for now. The default options are:

ignore-diags: 2 (ignore the diagonals (-1,0,1))
mad-max: 5 (median absolute deviation threshold)
min-nnz: 10 (minimum number of non-zero entries before exclusion)

Conveniently, cooler balance automatically checks if there is already a weight column, and skips the balancing if it already exists. We can overwrite with --force.

%%capture balance_out --no-stderr

import sys
import glob

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
resolutions = [10000, 50000, 100000, 500000]

for mcool in mcools:
    print(f"Balancing {mcool}:", file=sys.stderr)
    for res in resolutions:
        full_name = f"{mcool}::resolutions/{res}"
        print(f"\tresolution {res}...", end=" ", file=sys.stderr)
        # First, just default values
        !cooler balance -p 32 {full_name}
        # With only cis-contacts, save column seprately in the cooler file
        #!cooler balance -p 32 -c 20000000 --cis-only -n cis_weights {full_name}
        print("--> Done!", file=sys.stderr)

Balancing ../steps/bwa/PE/cool/round_spermatid/round_spermatid.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/spermatogonia/spermatogonia.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/sperm/sperm.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/fibroblast/fibroblast.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!
Balancing ../steps/bwa/PE/cool/pachytene_spermatocyte/pachytene_spermatocyte.mcool:
    resolution 10000... --> Done!
    resolution 50000... --> Done!
    resolution 100000... --> Done!
    resolution 500000... --> Done!

# Display the balancing stdout if you want (but it is up to 200 lines per cooler)
#balance_out()

RepMerge (pool all from each BioSample ID)

Initially, I planned to re-create the replicates that they used in the paper, but I reason that it is not necessary and might even be better to pool all the samples in stead. They state alredy that their compartments are highly repdoducible between replicates, so I choose to trust that and not bother with the replicates.

Therefore, this section was only briefly initialized, and now it is commented out.

#df.groupby(['source_name','BioSample'])['Reads'].sum()

# from pprintpp import pprint as pp

# grouped_df = df.groupby(['source_name','BioSample'])

# # Initialize an empty dictionary
# rep_dict = {}

# # Iterate over each group
# for (source_name, BioSample), group in grouped_df:
#     # Extract the 'Run' column and convert it to a list
#     run_list = group['Run'].tolist()
    
#     # Populate the nested dictionary
#     if source_name not in rep_dict:
#         rep_dict[source_name] = {}
#     rep_dict[source_name][BioSample] = run_list

# pp(rep_dict)

# for source_name, BioSample_dict in rep_dict.items():
#     print(f"source_name: {source_name}")
#     print(f"Changing working dir: {source_name}/")
#     for BioSample, run_list in BioSample_dict.items():
#         print(f"Merging samples for BioSample: {BioSample}: {run_list}")

Plotting

The following section contains the visualization of the matrices and includes the calculation of the E1 compartments and their visualization. I discuss differents methods for construction and try to get both methodologically and result-wise close to (Wang et al. 2019).

First, we will work on matrices in 500kb resolution, as documentation [maybe it was HiCExplorer??] states it is sufficient for chromosome-wide analysis and plotting. We will follow the cooltools-recommended pipeline (except that they use a 100kb cooler) for visualization and compartment calling with one example cooler (the merged fibroblast cooler at 500kb resolution).

Then, the most relevant parts will be generalized to run in a loop over all the coolers. First at 500kb resolution, then at 100kb resolution.

To accomodate the approach (barely) described in the paper, we will discuss a smoothing step to the observed/expected matrices before compartment calling, and we will apply a smoothing step to the E1 compartments and compare to the raw compartments.

500kb resolution

Example with a single sample

First, I will explore the visualization pipeline for a single cooler at 500kb resolution. I will modify the plot to be ‘stairs’ in stead of just a regular line plot, as it is both a more accurate representation of the data and it is more aesthetically pleasing with less spiky lines and holes.

In practice, the length of the dataframe is doubled, as it now contains an E1 value for both the start and end position for each bin in stead of only for the start. However, I first make the regular line plot to show the difference.

Imports

# import standard python libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import os, subprocess

# Import python package for working with cooler files and tools for analysis
import cooler
import cooltools.lib.plotting
import cooltools

Load cooler

mclr = "../steps/bwa/PE/cool/fibroblast/fibroblast.mcool"

clr = cooler.Cooler(f"{mclr}::resolutions/500000")

Calculate gc covariance (from the reference genome)

I will use bioframe to load the reference and calculate the GC content of each bin in the cooler file. The convention in Hi-C is to use GC content as a phasing track to orient eigenvector track to positively correlated with the GC content. In this subsection, I calculate the GC content for all chromosomes and filter afterwards to only include the ‘X’ chromosome. I will do that smarter in the next subsection.

import bioframe
import os

bins = clr.bins()[:]
rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
gc_cov_csv = '../steps/rheMac10_gc_cov_500kb.tsv'
if not os.path.exists(gc_cov_csv):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(gc_cov_csv, index=False, sep='\t')
    display(gc_cov)
else: 
    print("Already exists, just read from file:")
    gc_cov = pd.read_csv(gc_cov_csv, sep='\t')
    display(gc_cov)

Already exists, just read from file:

	chrom	start	end	GC
0	chr1	0	500000	0.424508
1	chr1	500000	1000000	0.378836
2	chr1	1000000	1500000	0.388272
3	chr1	1500000	2000000	0.445226
4	chr1	2000000	2500000	0.439485
...	...	...	...	...
5715	chrY	9500000	10000000	0.399518
5716	chrY	10000000	10500000	0.396028
5717	chrY	10500000	11000000	0.395280
5718	chrY	11000000	11500000	0.382006
5719	chrY	11500000	11753682	NaN

5720 rows × 4 columns

Calculate the E1 compartments

Here, I will calculate the E1 compartments for the ‘X’ chromosome in the fibroblast cooler at 500kb resolution. I will use the cooltools package to do eigendecomposition and calculate the E1 compartments. I use the GC content (gc_cov obtained above) as a phasing track.

I will only calculate the within-chromosomes compartmentalization (cis contacts). I use cooltools.eigs_cis that decorrelate the contact-frequency by distance before performing the eigendecomposition.

I this example, I will use the simplest ‘view’ (chromosome-wide) of the chromosome to calculate the E1 values, but later I will explore how partitioning the chromosome affects the E1 values. As the GC content was calculated for all chromosomes, the eigenvectors are also calculated for the whole matrix. This will also be optimized in the next subsection, as we only ever look at the ‘X’ chromosome.

# Make the full view frame
view_df = pd.DataFrame(
    {
    'chrom': clr.chromnames,
    'start': 0,
    'end': clr.chromsizes.values,
    'name': clr.chromnames
    }
                      )
#display(view_df)

# obtain first 3 eigenvectors
cis_eigs = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df,
                        n_eigs=3,
                        )

# cis_eigs[0] returns eigenvalues, here we focus on eigenvectors
eigenvector_track = cis_eigs[1][['chrom','start','end','E1']]
len(eigenvector_track)

# full track
eigenvector_track

	chrom	start	end	E1
0	chr1	0	500000	-0.756052
1	chr1	500000	1000000	-1.001063
2	chr1	1000000	1500000	-0.928042
3	chr1	1500000	2000000	-0.507579
4	chr1	2000000	2500000	0.123708
...	...	...	...	...
5715	chrY	9500000	10000000	-0.499798
5716	chrY	10000000	10500000	-0.588820
5717	chrY	10500000	11000000	-0.988677
5718	chrY	11000000	11500000	-0.566585
5719	chrY	11500000	11753682	NaN

5720 rows × 4 columns

As I have made myself a detour and calculated the eigenvectors for all chromosomes, I will now filter the eigenvector track to only include the ‘X’ chromosome. Then, I will (in the name of explicitness) construct an array of all the indices where the E1 values change sign. This is the simplest way to save the coordinates of the compartment boundaries. It can later on be saved to .csv or whatever for comparison between different samples. Also, it will be simplified in the next subsection.

# subset the chrX   
eigenvector_track_chrX = eigenvector_track.loc[eigenvector_track['chrom'] == 'chrX']
nbins = len(eigenvector_track_chrX)
eigenvector_track_chrX
e1X_values = eigenvector_track_chrX['E1'].values

# Where does the E1 change sign
np.where(np.diff( (cis_eigs[1][cis_eigs[1]['chrom']=='chrX']['E1']>0).astype(int)))[0]

array([  1,   6,  16,  26,  29,  40,  42,  49,  54,  55,  59,  60,  61,
        62,  74,  83,  86, 101, 102, 103, 104, 108, 123, 124, 129, 138,
       141, 142, 147, 149, 185, 188, 194, 196, 197, 201, 206, 209, 211,
       215, 222, 224, 228, 233, 238, 240, 250, 253, 256, 265, 273, 274,
       288, 289, 290, 293, 296, 302])

Plot E1 and matrix

Finally, I will plot the matrix, the GC content, and the E1 compartments for the ‘X’ chromosome in the fibroblast cooler at 500kb resolution. I will use the cooltools package to plot the matrix and the compartments. I will use the matplotlib package to plot the GC content.

I will plot the matrix as a heatmap, the GC content as a line plot, and the E1 compartments as a line plot. I will also plot the compartment boundaries as vertical lines. The compartments are colored according to the sign of the E1 values.

Following the cooltools recommendation, the colorbar itself will be log-transformed, as otherwise the balanced interaction matrix (values from 0 to 1) will be hard to interpret. The E1 values are (as is the colorbar) added to the axis with an AxesDivider.append_axes object to make them match the axes of the matrix.

I’m testing out the %%capture magic to capture the plot to a variable, then displaying it in the next cell. Initially, the plot took a while to generate, and I wanted to avoid re-generating when adjusting graphics on cell level with the YAML options. Now it is not necessary, but I keep it for the sake of the example.

%%capture chrX_matrix_e1_500kb

from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable

f, ax = plt.subplots(
    figsize=(15, 10),
)

norm = LogNorm(vmax=0.1)

im = ax.matshow(
    clr.matrix().fetch('chrX'),
    norm=norm,
    cmap='fall',
);
plt.axis([0,nbins,nbins,0])


divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
plt.colorbar(im, cax=cax, label='corrected frequencies');
ax.set_ylabel('chrX:500kb bins, #bin')
ax.xaxis.set_visible(False)

ax1 = divider.append_axes("top", size="20%", pad=0.05, sharex=ax)
#weights = clr.bins()[:]['weight'].values
#ax1.plot([0,nbins],[0,0],'k',lw=0.25)

#ax1.plot(e1X_values, label='E1')

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

ax1.set_ylabel('E1')
ax1.set_xticks([]);
ax1.get_subplotspec()


for i in np.where(np.diff( (cis_eigs[1][cis_eigs[1]['chrom']=='chrX']['E1']>0).astype(int)))[0]:
    # Horisontal lines where E1 intersects 0
    ax.plot([0,nbins],[i,i],'k',lw=0.4)

    # Vertical lines where E1 intersects 0
    ax.plot([i,i],[0,nbins],'k',lw=0.4)

chrX_matrix_e1_500kb.outputs[0]

Figure 6.1: chrX interaction matrix with E1 eigenvector values in 500kb resolution. The sign change of the E1 is overlayed as thin black lines, making it more easily interpretable. We should qualitatively determine how well the E1 values capture the *plaid* pattern in the matrix.

Markdown: Figure 10.1

We observe that the E1 values only somewhat captures the plaid pattern in the matrix Figure 10.1, but it is not related to the size of the compartments, as both small and large compartments can be observed. The B-compartment that starts from around bin 150 (75.000.000 bp) seems to have squares that are not captured by the E1 values. Maybe it could be captures by TAD calling, but that is not the scope of this analysis.

import matplotlib.pyplot as plt

f, ax1 = plt.subplots(
    figsize=(10, 2),
)

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

#ax1.set_ylabel('E1')
ax1.set_xticks([])
ax1.set_yticks([])

# Remove borders
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)
ax1.spines['left'].set_visible(False)
ax1.spines['bottom'].set_visible(False)

plt.tight_layout()

# Save the plot as a SVG file
#plt.savefig('e1_plot.svg', bbox_inches='tight')

plt.show()

Figure 6.2: 500kb binned E1 eigenvector values for chrX. Freshly calculated from the cooler file.

Markdown the next plot: Figure 10.3 that is

Stairs plot of the E1 compartments

(Less spiky, more smooth)

import matplotlib.pyplot as plt

f, (ax1, ax2) = plt.subplots(2,1,
    figsize=(10, 4)
)

chrom_start = eigenvector_track_chrX['start'].values
window_size = chrom_start[1] - chrom_start[0]

# Fill between the line and 0
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values > 0), color='tab:red')
ax1.fill_between(range(len(e1X_values)), e1X_values, 0, where=(e1X_values < 0), color='tab:blue')

# Create stairs
x = np.zeros(2*chrom_start.size)
y = np.zeros(2*chrom_start.size)
x[0::2] = chrom_start
x[1::2] = chrom_start + window_size
y[0::2] = e1X_values
y[1::2] = e1X_values

# Layout
ax1.set_ylabel('Spiky E1')
ax2.set_ylabel('Stairs E1')
ax1.set_xticks([])
ax1.set_yticks([])
ax2.set_xticks([])
ax2.set_yticks([])
ax2.set_ylim(-1, 1)

# Remove borders
ax1.spines[:].set_visible(False)
ax2.spines[:].set_visible(False)

ax2.fill_between(x, y, 0, where=(y > 0), color='tab:red')
ax2.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

plt.tight_layout()
# Save the plot as a high-resolution PNG file
#plt.savefig('../steps/e1_plot.png', dpi=320, bbox_inches='tight')

500kb binned E1 eigenvector values for chrX.

All 5 full merges

Load coolers

import glob
import os.path as op
import cooler

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
res = "::resolutions/500000"

clrs = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

clrs = {key: clrs[key] for key in chron_order}
clrs

{'fibroblast': <Cooler "fibroblast.mcool::/resolutions/500000">,
 'spermatogonia': <Cooler "spermatogonia.mcool::/resolutions/500000">,
 'pachytene_spermatocyte': <Cooler "pachytene_spermatocyte.mcool::/resolutions/500000">,
 'round_spermatid': <Cooler "round_spermatid.mcool::/resolutions/500000">,
 'sperm': <Cooler "sperm.mcool::/resolutions/500000">}

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# Try with only the gc_cov for chrX

import bioframe
import pandas as pd
import os.path as op

bins = clrs['fibroblast'].bins().fetch('chrX')[:]
out_name = '../steps/rheMac10_gc_cov_X_500kb.tsv'

rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
if not op.exists(out_name):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(out_name, index=False,sep='\t')
    print(gc_cov.info())
else: 
    print("Already exists, read from file")
    gc_cov = pd.read_csv(out_name, sep='\t')
    print(gc_cov.info())

Already exists, read from file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   chrom   307 non-null    object 
 1   start   307 non-null    int64  
 2   end     307 non-null    int64  
 3   GC      307 non-null    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 9.7+ KB
None

Calculate the E1 compartments

Loop: view_df, cis_eigs, e1_values

Plot GC covariance

import matplotlib.pyplot as plt

f, ax = plt.subplots(
    figsize=(10, 2),
)

ax.plot(gc_cov['start'],gc_cov['GC'])

plt.tight_layout()

Chromosome restricted E1 compartments

# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools
import pandas as pd

eigs_full = {}
e1_values_full = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Divide into chromosome arms
clr = clrs['fibroblast']
view_df_full = pd.DataFrame(
    {
    'chrom': 'chrX',
    'start': 0,
    'end': chrX_size,
    'name': 'chrX'
    }, index=[0]
                      )

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}, size {clr.binsize}")
    cis_eigs_full = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df_full,
                        n_eigs=3,
                        )
    eigs_full[name] = cis_eigs_full[1]
    e1_track_full = cis_eigs_full[1][['chrom','start','end','E1']]
    e1_values_full[name] = e1_track_full['E1'].values

#eigs

Calculating eigenvectors for fibroblast, size 500000

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for spermatogonia, size 500000

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte, size 500000

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for round_spermatid, size 500000

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm, size 500000

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Chromosome arms restricted

# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools
import pandas as pd

eigs = {}
e1_values = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Divide into chromosome arms
view_df = pd.DataFrame(
    {
    'chrom': 'chrX',
    'start': [0, 59_000_001],
    'end': [59_000_000, chrX_size],
    'name': ['X_short', 'X_long']
    }, index=[0,1]
                      )

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}")
    cis_eigs = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df,
                        n_eigs=3,
                        )
    eigs[name] = cis_eigs[1]
    e1_track = cis_eigs[1][['chrom','start','end','E1']]
    e1_values[name] = e1_track['E1'].values

#eigs

Calculating eigenvectors for fibroblast
Calculating eigenvectors for spermatogonia

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte
Calculating eigenvectors for round_spermatid

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")
/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Refined: 10Mb windows restricted

# Use gc_cov to calculate eigenvectors with cooltools.eigs_cis

import cooltools

eigs_10mb = {}
e1_values_10mb = {}

chrX_size = clrs['fibroblast'].chromsizes['chrX']

# Calculate in 1Mb windows
# Define the window size (1Mb)

window_size = 10_000_000

# Generate the start and end positions for each window
start_positions = list(range(0, chrX_size, window_size))
end_positions = [min(start + window_size, chrX_size) for start in start_positions]
# Create the DataFrame
view_df_10mb = pd.DataFrame({
    'chrom': ['chrX'] * len(start_positions),
    'start': start_positions,
    'end': end_positions,
    'name': [f'X_{i}' for i in range(len(start_positions))]
})
#display(view_df)

for name, clr in clrs.items():
    print(f"Calculating eigenvectors for {name}")
    cis_eigs_10mb = cooltools.eigs_cis(
                        clr,
                        gc_cov,
                        view_df=view_df_10mb,
                        n_eigs=3,
                        )
    eigs_10mb[name] = cis_eigs_10mb[1]
    e1_track_10mb = cis_eigs_10mb[1][['chrom','start','end','E1']]
    e1_values_10mb[name] = e1_track_10mb['E1'].values

#eigs

Calculating eigenvectors for fibroblast

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for spermatogonia

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for round_spermatid

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Save the A compartment coordinates

### Just testing for a single sample

# import numpy as np
# import pandas as pd

# e1 = pd.Series(e1_values_full['sperm'])
# res = 500_000

# e1_filled = e1.ffill(limit=2,limit_area='inside')

# # Type is preserved for boolean arrays, so the result will contain 
# # False when consecutive elements are the same and True when they differ.
# # We then use np.where to get the indices where the values change sign
# sign_change_coords = np.where(np.diff((e1_filled > 0).astype(int)))[0]
# # Filter out the indices that point to a NaN value
# sign_change_coords = sign_change_coords[~e1_filled.isna().iloc[sign_change_coords].values]

# a_start_bin = sign_change_coords[e1_filled[sign_change_coords+1] > 0]
# a_end_bin = sign_change_coords[e1_filled[sign_change_coords] > 0]

# # if a_start_bin[0] > b_start_bin[0]:
# #     a_start_bin = np.insert(a_start_bin, 0, 0)
# #     b_start_bin = np.insert(b_start_bin, len(b_start_bin), len(e1))


# display(sign_change_coords)
# print(f"A compartment start {len(a_start_bin)}:", a_start_bin)
# print(f"A compartment end {len(a_end_bin)}:", a_end_bin)

# display(e1_filled[:20])

This is copied to hicstuff.py as a function for later import

# ###### Restriction: full chromosome E1 values #######

# import pandas as pd
# import numpy as np
# import os.path as op

# for name,e1 in e1_values_full.items():
#     e1 = pd.Series(e1)
#     e1_filled = e1.ffill(limit=2)
#     sign_change_coords = np.where(np.diff((e1_filled > 0).astype(int)))[0]

#     a_start_bin = sign_change_coords[e1_filled[sign_change_coords+1] > 0]
#     a_end_bin = sign_change_coords[e1_filled[sign_change_coords] > 0]

#     print(f"Calculating A-compartment intervals for {name}")
#     print(len(a_start_bin), len(a_end_bin))

#     if e1_filled.iloc[-1] > 0:
#         print("Last bin is A, appending end bin")
#         a_end_bin = np.append(a_end_bin, len(e1_filled)-1)
#         print(len(a_start_bin), len(a_end_bin))
    
#     df = pd.DataFrame({
#         'chrom': ['chrX'] * len(a_start_bin),
#         'bin_start': a_start_bin,
#         'bin_end': a_end_bin,
#         'start': a_start_bin*res,
#         'end': a_end_bin*res,
#         'length': (a_end_bin-a_start_bin)*res
#     })
#     csv_name = f'../results/{name}_a_comp_coords_{int(res*0.001)}kb_full.csv'
#     print(csv_name)
#     if not op.exists(csv_name):
#         df[['chrom', 'start', 'end']].to_csv(csv_name, index=False)
#     else:
#         print("Already exists, skipping")
#     print()

######## Restriction: chromosome arm E1 values ########
import importlib
import hicstuff
import os
importlib.reload(hicstuff)
from hicstuff import extract_a_coordinates


# Binsize in bp
res = 500_000
outdir = '../results/compartments/'
if not os.path.exists(outdir):
    os.makedirs(outdir)
    

######## Restriction: full chromosome E1 values ########
restriction = 'full'
for name, e1 in e1_values_full.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)

######## Restriction: chromosome arm E1 values ########
restriction = 'arms'
for name, e1 in e1_values.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)

######## Restriction: 10Mb E1 values ########
restriction = '10Mb'
for name, e1 in e1_values_10mb.items():
    extract_a_coordinates(e1=e1, name=name, restriction=restriction, chrom='chrX', res=500_000, csv=True,output_dir=outdir, force=True)

Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_full.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_full.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_full.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_full.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_full.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_arms.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_arms.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_arms.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_arms.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_arms.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_500kb_10Mb.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_500kb_10Mb.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_500kb_10Mb.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_500kb_10Mb.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_500kb_10Mb.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_500kb_10Mb.csv already exists. Overwriting.

Plot NaN histogram

# Check the number of NaN values in the E1 column
import numpy as np

# Check the number of NaN values in the E1 column and create a DataFrame
nan_counts = {k: {'length': len(v), 'NaNs': np.isnan(v).sum()} for k, v in e1_values.items()}
#display(pd.DataFrame.from_dict(nan_counts, orient='index'))

# Locate the NaN values (histogram)
import matplotlib.pyplot as plt

f, ax = plt.subplots(1, 5, figsize=(20,2))
for i, (name, track) in enumerate(eigs.items()):
    e1 = track['E1'].values
    # Locate NaN values
    e1_nan = np.where(np.isnan(e1))
    # Plot histogram
    ax[i].hist(e1_nan, bins=100)
    
    # Plot median line
    median_pos = np.median(e1_nan)
    ax[i].axvline(median_pos, color='r')

    # Layout
    ax[i].set_title(name)
    xticks = np.linspace(0, len(e1), num=5)
    xticks = np.append(xticks, median_pos)  # Add median position to xticks
    ax[i].set_xticks(xticks)
    xticklabels = np.linspace(0, len(e1) * 0.5, num=5, dtype = 'int').tolist()
    xticklabels.append(median_pos*0.5)  # Add median label
    ax[i].set_xticklabels(xticklabels, rotation=45)
    ax[i].set_xlabel('Position (Mbp)')

Histogram of NaN values in the E1 eigenvector for each cell type.

Plot the E1 compartments

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.patches import Rectangle

res = 500_000
names_abbr = {'fibroblast': 'Fibroblast', 'spermatogonia': 'Spermatogonia', 'pachytene_spermatocyte': 'Pachytene Spermatogonia', 'round_spermatid': 'Round Spermatid', 'sperm': 'Sperm'}

chrom_start = e1_track_full['start'].values
chrom_end = e1_track_full['end'].values-1

f, axs = plt.subplots(5, 3, figsize=(18, 8), sharex=True, sharey=True)

# Populate the first column
axs[0,0].set_title('E1: Full-chromosome restricted')
for i, (name, e1) in enumerate(e1_values_full.items()):
    ax = axs[i,0]
    ax.set_ylabel(names_abbr[name])

    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_full.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Populate the second column
axs[0,1].set_title('E1: Chromosome-arms restricted')
for i, (name, e1) in enumerate(e1_values.items()):
    ax = axs[i,1]
    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_arms.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Populate the third column
axs[0,2].set_title('E1: 10Mb restricted (refined)')
for i, (name, e1) in enumerate(e1_values_10mb.items()):
    ax = axs[i,2]
    #ax.set_title(name)
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1

    ax.fill_between(x, y, 0, where=(y > 0), color='tab:red', linewidth=0)
    ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue', linewidth=0)

    # Test to see how well my coordinates match the e1 values
    coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_500kb_10Mb.csv')
    # Iterate over each interval in the DataFrame
    for start, end in zip(coords['start'], coords['end']):
        rect = Rectangle((start, 0.2), width=end-start, height=0.05, color='k', linewidth=0.05)
        ax.add_patch(rect)

# Set y-limits for all subplots
for ax in axs.flat:
    ax.set_ylim(-0.8, 0.8)
    ax.spines[:].set_visible(False)
    ax.set_yticks([])
    ax.set_xticks([])

plt.tight_layout()
plt.savefig('../steps/e1_plot_full.svg', bbox_inches='tight')

E1 eigenvector values for all merged samples at 500kb resolution. Left: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.

Plot matrices with compartments (round spermatid)

import cooltools.lib.plotting
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from cooltools.lib.numutils import adaptive_coarsegrain, interp_nan



clr = clrs['round_spermatid']
chrom_start, chrom_end = e1_track_full['start'].values, e1_track_full['end'].values-1
e1 = e1_values_10mb['round_spermatid']
nbins = len(clr.bins().fetch('chrX'))

f, ax = plt.subplots(
    figsize=(9,6),
)

norm = LogNorm(vmax=0.1)


e1_list = [e1_values_10mb['round_spermatid'], e1_values['round_spermatid'], e1_values_full['round_spermatid']]
e1_names = ['10Mb', 'Arms', 'Full']
colors = ['tab:red', 'tab:blue', 'tab:green']



im = ax.matshow(
    clr.matrix().fetch('chrX'),
    norm=norm,
    cmap='fall',
);
plt.axis([0,nbins,nbins,0])

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="2%", pad=0.05)
plt.colorbar(im, cax=cax, label='corrected frequencies');
ax.set_ylabel('chrX:500kb bins, #bin')
ax.xaxis.set_visible(False)

for i, e1 in enumerate(e1_list):
    ax1 = divider.append_axes("top", size="10%", pad=0.05, sharex = ax)
    # weights = clr.bins()[:]['weight'].values

    # ax1.plot(e1, label='E1')

    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    #smooth_y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start/500_000
    x[1::2] = chrom_end/500_000
    y[0::2] = e1
    y[1::2] = e1
    # smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
    # smooth_y[0::2] = smooth_e1
    # smooth_y[1::2] = smooth_e1


    ax1.fill_between(x, y, 0, where=(y > 0), color='tab:red')
    ax1.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

    ax1.set_ylabel(e1_names[i], rotation=70)
    ax1.set_ylim(-0.8, 0.8)
    ax1.set_xticks([])

ax1.set_title('Round Spermatid: 500kb bins')


#    # Plot the sign changes on the matrix 
#     col = colors[i]
#     for i in np.where(np.diff( (pd.Series(e1)>0).astype(int)))[0]:
#         # Horisontal lines where E1 intersects 0
#         ax.plot([0,nbins],[i,i],col,lw=0.5)

#         # Vertical lines where E1 intersects 0
#         ax.plot([i,i],[0,nbins],col,lw=0.5)

plt.tight_layout()
f.canvas.draw()

E1 eigenvector values for all merged samples at 500kb resolution for round spermatid, as well as the interaction matrix. E1 was restricted to top: Full-chromosome, middle: Chromosome-arms, bottom: 10Mb window.

Sliding window summed E1 compartments

To mimic the smoothing applied in the Wang et al. 2019 paper, where they slide a 400kb window in 100kb steps on the obs/exp matrix, we will similarly slide a 400kb window in 100kb steps directly on the E1 compartments.

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
import pandas as pd
from scipy.signal.windows import triang

resolution = 500_000
window_size = 2_500_000
step_size = window_size // resolution

chrom_start = e1_track['start'].values
chrom_end = e1_track['end'].values-1

f, axs = plt.subplots(5, 2, figsize=(30, 10), sharex=True)

axs[0, 0].set_title('Chromosome-arm E1 (arm restricted)')
axs[0, 1].set_title('Smoothed by summation')

for i, (name, e1) in enumerate(e1_values.items()):
    # print(i, name, e1.size)
    smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))

    ax0 = axs[i, 0]
    ax1 = axs[i, 1]
    
    # Create stairs
    x = np.zeros(2*chrom_start.size)
    y = np.zeros(2*chrom_start.size)
    smooth_y = np.zeros(2*chrom_start.size)
    x[0::2] = chrom_start
    x[1::2] = chrom_end
    y[0::2] = e1
    y[1::2] = e1
    smooth_y[0::2] = smooth_e1['value'].values
    smooth_y[1::2] = smooth_e1['value'].values


    ax0.fill_between(x, y, 0, where=(y > 0), color='tab:red')
    ax0.fill_between(x, y, 0, where=(y < 0), color='tab:blue')
    
    # Overlay the smoothed line (divided by 5 to make it a mean)
    ax0.plot(x, smooth_y/5, color='C1')

    ax1.fill_between(x, smooth_y, 0, where=(smooth_y > 0), color='tab:red')
    ax1.fill_between(x, smooth_y, 0, where=(smooth_y < 0), color='tab:blue')

    ax0.set_ylabel(name)
    ylim = 1.5
    #ax0.set_ylim(-ylim, ylim)
    #ax1.set_ylim(-ylim*4, ylim*4)

for ax in axs.flat:
    ax.spines[:].set_visible(False)
    ax.set_xticks([])
    #ax.set_yticks([])

# plt.tight_layout()

E1 eigenvector values for all merged samples at 500kb resolution with rolling summation, with window size 2.5Mb, step size 500Kb: Each value is now the sum of the surrounding n=5 bins.

100kb resolution

All 5 full merges

Load coolers

import glob
import os.path as op
import cooler

mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
res = "::resolutions/100000"

clrs = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

clrs = {key: clrs[key] for key in chron_order}
clrs

names_abbr = {'fibroblast': 'Fibroblast', 'spermatogonia': 'Spermatogonia', 'pachytene_spermatocyte': 'Pachytene Spermatogonia', 'round_spermatid': 'Round Spermatid', 'sperm': 'Sperm'}

# Calculate chromstart and chromend for each bin on chrX
chrX_size = clrs['fibroblast'].chromsizes['chrX']
chrom_start = clrs['fibroblast'].bins().fetch('chrX')['start'].values
chrom_end = clrs['fibroblast'].bins().fetch('chrX')['end'].values-1
nbins = len(clrs['fibroblast'].bins().fetch('chrX'))
binsize = clrs['fibroblast'].binsize

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# Try with only the gc_cov for chrX

import bioframe
import pandas as pd
import os.path as op

bins = clrs['fibroblast'].bins().fetch('chrX')[:]
out_name = '../steps/rheMac10_gc_cov_X_100kb.tsv'

rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
if not op.exists(out_name):
    print('Calculate the fraction of GC basepairs for each bin')
    gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
    gc_cov.to_csv(out_name, index=False,sep='\t')
    print(gc_cov.info())
else: 
    print("Already exists, read from file")
    gc_cov = pd.read_csv(out_name, sep='\t')
    print(gc_cov.info())

Already exists, read from file
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1534 entries, 0 to 1533
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   chrom   1534 non-null   object 
 1   start   1534 non-null   int64  
 2   end     1534 non-null   int64  
 3   GC      1533 non-null   float64
dtypes: float64(1), int64(2), object(1)
memory usage: 48.1+ KB
None

Calculate the E1 compartments

Loop: view_df, cis_eigs, e1_values

Plot GC covariance

import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(10, 2))

ax.plot(gc_cov['start'],gc_cov['GC'])
plt.tight_layout()

Create viewframes: full, arms, 10Mb windows

import pandas as pd

# Fetch the chromsize of X from one of the coolers
chrX_size = clrs['fibroblast'].chromsizes['chrX']

views = {}

# Make the full view frame
views['full'] = pd.DataFrame({
    'chrom': 'chrX',
    'start': 0,
    'end': chrX_size,
    'name': 'chrX'}, index=[0])

# Divide into chromosome arms
views['arms'] = pd.DataFrame({
    'chrom': 'chrX',
    'start': [0, 59_000_001],
    'end': [59_000_000, chrX_size],
    'name': ['X_short', 'X_long']}, index=[0,1])

# Calculate in 10Mb windows
window_size = 10_000_000
start_positions = list(range(0, chrX_size, window_size))
end_positions = [min(start + window_size, chrX_size) for start in start_positions]

# Create the DataFrame
views['10Mb'] = pd.DataFrame({
    'chrom': ['chrX'] * len(start_positions),
    'start': start_positions,
    'end': end_positions,
    'name': [f'X_{i}' for i in range(len(start_positions))]
})

Calulate the E1 compartments

import cooltools

eigs = {}
e1_values = {}

for name, clr in clrs.items():
    if name not in eigs or name not in e1_values:
        eigs[name] = {} 
        e1_values[name] = {}
    for view, view_df in views.items():
        print(f"Calculating eigenvectors for {name} at {view}")
        cis_eigs = cooltools.eigs_cis(
                            clr,
                            gc_cov,
                            view_df=view_df,
                            n_eigs=1)
        eigs[name][view] = cis_eigs[1]
        e1_values[name][view] = cis_eigs[1]['E1'].values

Calculating eigenvectors for fibroblast at full

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for fibroblast at arms

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for fibroblast at 10Mb

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for spermatogonia at full

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for spermatogonia at arms

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for spermatogonia at 10Mb

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte at full

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte at arms

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for pachytene_spermatocyte at 10Mb

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for round_spermatid at full

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for round_spermatid at arms

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for round_spermatid at 10Mb

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm at full

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm at arms

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Calculating eigenvectors for sperm at 10Mb

/home/sojern/miniconda3/envs/hic/lib/python3.12/site-packages/cooltools/lib/common.py:489: UserWarning: less than 50% of valid bins have been assigned a value
  warnings.warn("less than 50% of valid bins have been assigned a value")

Extract the A compartment coordinates

import importlib
import hicstuff
importlib.reload(hicstuff)
from hicstuff import extract_a_coordinates


res = 100_000
outdir = '../results/compartments/'

for name, view_df in e1_values.items():
    print(f"Calculating A-compartment intervals for {name}")
    for view, e1 in view_df.items():
        print(view)
        extract_a_coordinates(
            e1=e1,
            name=name,
            restriction=view,
            chrom='chrX',
            res=res,
            csv=True,
            output_dir=outdir,
            force=True
        )
        extract_a_coordinates(
            e1=e1,
            name=name,
            restriction=view,
            chrom='chrX',
            res=res,
            csv=True,
            output_dir=outdir,
            smooth=True,
            force=True
        )

Calculating A-compartment intervals for fibroblast
full
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_full.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_arms.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_10Mb.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/fibroblast_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for fibroblast
File ../results/compartments/fibroblast_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for spermatogonia
full
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_full.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_arms.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_10Mb.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/spermatogonia_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for spermatogonia
File ../results/compartments/spermatogonia_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for pachytene_spermatocyte
full
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_full.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_arms.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_10Mb.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/pachytene_spermatocyte_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for pachytene_spermatocyte
File ../results/compartments/pachytene_spermatocyte_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for round_spermatid
full
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_full.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_arms.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_10Mb.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/round_spermatid_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for round_spermatid
File ../results/compartments/round_spermatid_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.
Calculating A-compartment intervals for sperm
full
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_full.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_full.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_full_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_full_smoothed.csv already exists. Overwriting.
arms
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_arms.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_arms.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_arms_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_arms_smoothed.csv already exists. Overwriting.
10Mb
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_10Mb.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_10Mb.csv already exists. Overwriting.
Saving eigenvector track to: ../results/compartments/sperm_e1_100kb_10Mb_smoothed.csv
Calculating A-compartment intervals for sperm
File ../results/compartments/sperm_a_comp_coords_100kb_10Mb_smoothed.csv already exists. Overwriting.

Plot the E1 compartments

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable

f, axs = plt.subplots(5, 3, figsize=(18, 8), sharex=True, sharey=True)

# Loop through eigs and e1_values to plot the E1 values
for i, (name, e1_dict) in enumerate(e1_values.items()):
    axs[i,0].set_ylabel(names_abbr[name])
    for j, (view, e1) in enumerate(e1_dict.items()):
        ax = axs[i, j]

        if i==0:
            ax.set_title(view)

        # Create stairs
        x = np.zeros(2*chrom_start.size)
        y = np.zeros(2*chrom_start.size)
        smooth_y = np.zeros(2*chrom_start.size)
        x[0::2] = chrom_start
        x[1::2] = chrom_end
        y[0::2] = e1
        y[1::2] = e1
        smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
        smooth_y[0::2] = smooth_e1
        smooth_y[1::2] = smooth_e1


        # ax.fill_between(x, y, 0, where=(y > 0), color='tab:red')
        # ax.fill_between(x, y, 0, where=(y < 0), color='tab:blue')
        # ax.plot(x, smooth_y/5, color='C1', linewidth=0.5)

        ax.fill_between(x, smooth_y, 0, where=(smooth_y > 0), color='tab:red')
        ax.fill_between(x, smooth_y, 0, where=(smooth_y < 0), color='tab:blue')

        # Test to see how well my coordinates match the e1 values
        coords = pd.read_csv(f'../results/compartments/{name}_a_comp_coords_100kb_{view}_smoothed.csv')
        # Iterate over each interval in the DataFrame
        for start, end in zip(coords['start'], coords['end']):
            rect = Rectangle((start, 0.4), width=end-start, height=0.5, color='k', linewidth=0.05)
            ax.add_patch(rect)

# Set y-limits for all subplots
for ax in axs.flat:
    ax.set_ylim(-4, 4)
    ax.spines[:].set_visible(False)
    ax.set_xticks([])
    ax.set_yticks([])

plt.tight_layout()
plt.savefig('../steps/e1_plot_full_100kb_smoothed.svg', bbox_inches='tight')

E1 eigenvector values for all merged samples at 100kb resolution. Left: Chromosomes (not restricted) Middle: Chromosome-arm restricted E1. Right: 10Mb window restricted E1.Values are smoothed with a sliding window of 5 bins, step size 1 bin.

Plot matrices with compartments

import cooltools.lib.plotting
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
from cooltools.lib.numutils import adaptive_coarsegrain, interp_nan

f, axs = plt.subplots(1,5,
    figsize=(25,10)
)

norm = LogNorm(vmax=0.1)

# Loop through the clrs and its matrix: plot the matrix on its axis
for i, (name, e1_dict) in enumerate(e1_values.items()):
    ax = axs[i]

    im = ax.matshow(
        clr.matrix().fetch('chrX'),
        norm=norm,
        cmap='fall',
    );
    ax.set_xlim(0, nbins)
    ax.set_ylim(nbins, 0)

    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="2%", pad=0.05)
    plt.colorbar(im, cax=cax, label='corrected frequencies');
    ax.set_ylabel('chrX:100kb bins, #bin')
    ax.xaxis.set_visible(False)

    for j, (view, e1) in enumerate(e1_dict.items()):
        ax1 = divider.append_axes("top", size="10%", pad=0.1, sharex = ax)

        # ax1.plot(e1, label='E1')

        # Create stairs
        x = np.zeros(2*chrom_start.size)
        y = np.zeros(2*chrom_start.size)
        #smooth_y = np.zeros(2*chrom_start.size)
        x[0::2] = chrom_start/binsize
        x[1::2] = chrom_end/binsize
        y[0::2] = e1
        y[1::2] = e1
        # smooth_e1 = (lambda x: x.rolling(5, 1, center=True).sum())(pd.DataFrame(e1, columns=['value']))['value'].values
        # smooth_y[0::2] = smooth_e1
        # smooth_y[1::2] = smooth_e1


        ax1.fill_between(x, y, 0, where=(y > 0), color='tab:red')
        ax1.fill_between(x, y, 0, where=(y < 0), color='tab:blue')

        ax1.set_ylabel(view, rotation=-70)
        ax1.set_ylim(-0.8, 0.8)
        ax1.set_xticks([])

    ax1.set_title(name)

#    # Plot the sign changes on the matrix 
#     col = colors[i]
#     for i in np.where(np.diff( (pd.Series(e1)>0).astype(int)))[0]:
#         # Horisontal lines where E1 intersects 0
#         ax.plot([0,nbins],[i,i],col,lw=0.5)

#         # Vertical lines where E1 intersects 0
#         ax.plot([i,i],[0,nbins],col,lw=0.5)

# Now show the plot
f.set_layout_engine('compressed')
f.canvas.draw()
plt.show()

Smooth the Observed/Expected matrix

Wang et al. 2019 used a 400Kb window with a step size of 100Kb to smooth the O/E matrix before calculating the E1. I have tried various different methods to smooth the interaction matrix, but I can’t get dimensions of the pixels to match with the bins. Also, cooltools.eigs_cis need a full cooler to fetch data from, and I can’t get it to work without a pixels object.

Thus, I will try smoothing the E1 values instead, as I can’t really rationalize what biological difference it would make. I think I need help figuring that out.

# import glob
# import os.path as op
# import cooler

# mcools = glob.glob("../steps/bwa/PE/cool/*/*.mcool")
# res = "::resolutions/100000"

# clrs100kb = {op.basename(op.dirname(mcool)): cooler.Cooler(mcool+res) for mcool in mcools}

# chron_order = ['fibroblast', 'spermatogonia', 'pachytene_spermatocyte', 'round_spermatid', 'sperm']

# clrs100kb = {key: clrs100kb[key] for key in chron_order}
# clrs100kb

Calculate gc covariance (from the reference genome)

Do this with any of the clrs - it just needs the bins positions.

# import bioframe
# import pandas as pd
# import os.path as op

# bins = clrs100kb['fibroblast'].bins().fetch('chrX')[:]
# out_name = 'rheMac10_gc_cov_X_100kb.tsv'

# rheMac10 = bioframe.load_fasta('../data/links/ucsc_ref/rheMac10.fa')
# if not op.exists(out_name):
#     print('Calculate the fraction of GC basepairs for each bin')
#     gc_cov = bioframe.frac_gc(bins[['chrom', 'start', 'end']], rheMac10)
#     gc_cov.to_csv(out_name, index=False,sep='\t')
#     print(gc_cov.info())
# else: 
#     print("Already exists, read from file")
#     gc_cov = pd.read_csv(out_name, sep='\t')
#     print(gc_cov.info())

# import cooler
# import numpy as np

# def smooth_matrix(matrix, window_size=4, step_size=1):
#     n = matrix.shape[0]
#     smoothed_matrix = np.zeros_like(matrix)
    
#     for i in range(0, n - window_size + 1, step_size):
#         for j in range(0, n - window_size + 1, step_size):
#             window = matrix[i:i+window_size, j:j+window_size]
#             smoothed_matrix[i:i+window_size, j:j+window_size] = np.sum(window)
    
#     return smoothed_matrix

# # Load the Hi-C interaction matrix from a multi-resolution .mcool file
# cooler_file = '../steps/bwa/PE/cool/fibroblast/fibroblast.mcool'
# clr = cooler.Cooler(cooler_file + '::resolutions/100000')

# # Extract the matrix
# matrix = clr.matrix(balance=True).fetch('chrX')[:]

# # Smooth the matrix
# smoothed_matrix = smooth_matrix(matrix, window_size=4, step_size=1)

# #clr.bins().fetch('chrX')[:]
# clr.pixels().fetch('chrX')[:]
# #print(matrix.shape)
# #print(smoothed_matrix.shape)

	bin1_id	bin2_id	count
145469195	26898	26898	2013
145469196	26898	26899	171
145469197	26898	26901	75
145469198	26898	26902	100
145469199	26898	26903	187
...	...	...	...
146369705	28431	28532	2
146369706	28431	28533	1
146369707	28431	28534	2
146369708	28431	28539	1
146369709	28431	28544	1

900515 rows × 3 columns

# import pandas as pd
# import cooler
# import numpy as np

# pixels = pd.DataFrame(smoothed_matrix).stack().rename_axis(['bin1_id', 'bin2_id']).reset_index(name='count')
# pixels = pixels[pixels['bin1_id'] <= pixels['bin2_id']]
# pixels.sort_values(['bin1_id', 'bin2_id'])[['bin1_id', 'bin2_id', 'count']].reset_index(drop=True)

# # Get the eigs

# import cooltools
# import pandas as pd

# eigvecs, eigvals = cooltools.lib.numutils.get_eig(matrix, n=1)

# # create a cooler for the smoothed matrix

# import cooler
# import numpy as np

# # Create a new cooler file for the smoothed matrix
# smoothed_cooler_file = cooler_file.replace('.mcool', '.smoothed.mcool')

# # Create a new cooler object
# cooler.create_cooler(smoothed_cooler_file, bins=clr.bins()[:], pixels=smoothed_matrix)

Working with coolers

Overview

Data (Accessions)

Folder structure

FullMerge (pool all from each source_name)

Create cooler dictionary (glob)

Merge coolers

Zoomify the merged cooler files

Balance the matrices

RepMerge (pool all from each BioSample ID)

Plotting

500kb resolution

Example with a single sample

Imports

Load cooler

Calculate gc covariance (from the reference genome)

Calculate the E1 compartments

Plot E1 and matrix

Stairs plot of the E1 compartments

All 5 full merges

Load coolers

Calculate gc covariance (from the reference genome)

Calculate the E1 compartments

Plot GC covariance

Chromosome restricted E1 compartments

Chromosome arms restricted

Refined: 10Mb windows restricted

Save the A compartment coordinates

This is copied to hicstuff.py as a function for later import

Plot NaN histogram

Plot the E1 compartments

Plot matrices with compartments (round spermatid)

Sliding window summed E1 compartments

100kb resolution

All 5 full merges

Load coolers

Calculate gc covariance (from the reference genome)

Calculate the E1 compartments

Plot GC covariance

Create viewframes: full, arms, 10Mb windows

Calulate the E1 compartments

Extract the A compartment coordinates

Plot the E1 compartments

Plot matrices with compartments

Smooth the Observed/Expected matrix

Calculate gc covariance (from the reference genome)

Create cooler dictionary (`glob`)