Submodules

Submodules are useful if you have analysis components that are used across many projects. It would good to have repos to use for subprojects for these components (between us, I think we have them)

HapMix
RFmix
Mosaic
Relate
ARGweaver
ARGweaver Clues
Relate Clues
SMC++
LDhat/helmet
pyro
GATK mapping and base-calling
ShapeIt phasing

The repos can each have their own adjustable workflow as decribed here, that are then easily put together in a larger workflow as described here.

Preliminaries

Make sure you install the current version of Git in your environments. The one on the cluster is ancient.

Terminal

pixi global install gh git

I also set these configs for each environment to get nicer/safer commands (commands below assume these are set):

git config --global diff.submodule log
git config status.submodulesummary 1
git config push.recurseSubmodules check

Now, say you have a project repo called “umbrella” that will contain other projects and that you have cloned that:

git clone git@github.com:kaspermunch/umbrella.git

Add a submodule

clone repository as submodule:

Terminal

git submodule add git@github.com:munch-group/rfmix.git

and pull the current state of the submodule repo:

Terminal

git submodule init rfmix
git submodule update rfmix

This also generates a .gitmodules configuration file that git uses to keep track of submodules. Commit that the addalong with the submodule:

Terminal

git add .gitmodules rfmix
git commit -m 'Added rfmix as submodule'
git push

If you want to work on/change submodule repo you need to check out a branch to work on (main or some other). Always do this. If you decide to make changes later and forgot you did not check out a branch you could loose those changes:

Terminal

cd rfmix
git checkout main  # (or some other branch)

Making changes to the submodule

now you can then do some work on the tester repo (E.g. change the README.md) and add, commit as usual:

Terminal

cd rfmix
# change README.md
git add README.md
git commit

Publishing submodule changes to GitHub

To publish your submodule commit to the tester repo on GitHub you run:

Terminal

cd rfmix
git push

Getting submodule changes from GitHub

if you run “git pull” in the umbrella repo, you pull upstream changes to the umbrella repo including the recorded state (commit) of the tester submodule:

Terminal

git pull

but it does not pull the tester submodule itself. To do that you run pull in the submodule:

Terminal

cd rfmix
git pull

Branches and multiple contributors

As you could see above submodules are really just repos inside other repos. The parent repo just treats the submodule as a file holding which state (commit) the submodule is in. So for most use, working alone on project it is quite simple. In some cases however, you need some of the git submodule commands. Those commands updates the relationship between the submodule state recorded in the parent repo and the state and of the submodule.

Terminal

git submodule update --checkout
git submodule update --rebase
git submodule update --merge

--checkout: Checkout the commit recorded in the superproject on a detached HEAD in the submodule. This is the default behavior, the main use of this option is to override submodule.$name.update when set to a value other than checkout.

--merge: Merge the commit recorded in the superproject into the current branch of the submodule. If this option is given, the submodule’s HEAD will not be detached. If a merge failure prevents this process, you will have to resolve the resulting conflicts within the submodule with the usual conflict resolution tools.

--rebase: Rebase the current branch onto the commit recorded in the superproject. If this option is given, the submodule’s HEAD will not be detached. If a merge failure prevents this process, you will have to resolve these failures with git-rebase[1].

Terminal

git submodule update --remote --checkout
git submodule update --remote --rebase
git submodule update --remote --merge

The last one fetches upsteam changes and pulls the submoduile branch recorded in the parent repo when the submodule was added. If this branch is the current branch in the local submodule, then the command is equivalent to git pull in the submodule.

--remote: Instead of using the superproject’s recorded SHA-1 to update the submodule, use the status of the submodule’s remote-tracking branch. In order to ensure a current tracking branch state, update –remote fetches the submodule’s remote repository before calculating the SHA-1. If you don’t want to fetch, you should use submodule update --remote --no-fetch.

Use this option to integrate changes from the upstream subproject with your submodule’s current HEAD. Alternatively, you can run git pull from the submodule, which is equivalent except for the remote branch name: update –remote uses the default upstream repository and submodule..branch, while git pull uses the submodule’s branch..merge. Prefer submodule..branch if you want to distribute the default upstream branch with the superproject and branch..merge if you want a more native feel while working in the submodule itself.

(read the difference between merge and rebase here)

Multiple submodules

You can have as many submodules as you want. With more submodules, each update command updates all submodules.

Combining GWF workflows from multiple Git submodules

import os
import pandas as pd
import yaml
from gwf.workflow import collect
from gwf import Workflow, AnonymousTarget

# get all targets with a given key
def target_output_files(targets, key):
    return [out for target in targets[key] for out in target.outputs]

###############################################################################
## Top workflow for all the stuff required to run the individual pioe lines
###############################################################################

gwf = Workflow()

# read config file for workflow
with open('workflow_config.yml') as f:
    config = yaml.safe_load(f)

# rfmix workflow
from rfmix.workflow import rfmix_workflow

# controls wheather submodule workflows merged with main workflow
# or run in isolation (only use False for initial submoduile setup)
merge_workflows = True

###############################################################################
##  Generate input files and run RFmix submodule workflow
###############################################################################

# full_list = ['Cynocephalus, Central Tanzania', 'Anubis, Kenya', 'Kindae, Zambia',
#     'Hamadryas, Ethiopia', 'Anubis, Tanzania',
#     'Cynocephalus, Western Tanzania', 'Papio, Senegal', 'Ursinus, Zambia',
#     'Anubis, Ethiopia']

# rfmix analyses
rfmix_analyses = config['rfmix_analyzes']

# rfmix output dir
rfmix_output_dir = "steps/rfmix_gen100/"

# compile reference/query sample lists for rfmix
meta_data_samples = pd.read_csv(config['sample_meta_data'], sep=" ")
analyzes = []
for analysis in rfmix_analyses:
    d = {}
    d["analysis"] = analysis
    os.makedirs(rfmix_output_dir+"/"+analysis, exist_ok=True)
    ref_samples = meta_data_samples.loc[meta_data_samples.C_origin.isin(rfmix_analyses[analysis])]
    query_samples = meta_data_samples.loc[~(meta_data_samples.C_origin.isin(rfmix_analyses[analysis])) &
                                        (meta_data_samples.C_origin != "Gelada, Captive")]
    d["ref_samples"] =list(ref_samples.PGDP_ID)
    d["query_samples"] = list(query_samples.PGDP_ID)
    analyzes.append(d)

# write sample/population info
for analysis in rfmix_analyses:
    analysis_dir = rfmix_output_dir + "/" + analysis
    sample_map_file = analysis_dir + "/ref_names.txt"
    gwf.target(f'sample_map_{analysis}', inputs=['workflow_config.yml'], outputs=[rfmix_output_dir+"/"+analysis+"/ref_names.txt"]) << f'''

    mkdir -p analysis_dir
    python scripts/rfmix_write_sample_map.py {analysis} workflow_config.yml {sample_map_file}
    '''

# write recombination maps
autosome_rec_map = rfmix_output_dir + "aut_genetic_map.txt"
x_rec_map = rfmix_output_dir + "X_genetic_map.txt"
gwf.target('format_genetic_maps', 
           memory='16gb',
           inputs=['workflow_config.yml'], 
           outputs=[autosome_rec_map, x_rec_map]) << f'''

mkdir -p output_dir
python scripts/rfmix_format_genetic_maps.py workflow_config.yml {autosome_rec_map} {x_rec_map}
'''



# run the rfmix pipeline
_gwf, rfmix_targets  = rfmix_workflow(
                          gwf if merge_workflows else Workflow(working_dir=os.getcwd()),
#                          merge_workflows and gwf or Workflow(working_dir=os.getcwd()),
                          analyzes=analyzes, 
                          output_dir=rfmix_output_dir, 
                          vcf_files=config['vcf_files'], 
                          autosome_rec_map=autosome_rec_map,
                          x_rec_map=x_rec_map
                          )

globals()['rfmix'] = _gwf

###############################################################################
## Next workflow...
###############################################################################

# # get relevant outputs from A for input to B
# input_files =  target_output_files(rfmix_targets, 'work')

In each submodule the workflow.py could look like this:

import os.path
import os
from collections import defaultdict
from gwf import Workflow

def submoduleA_workflow(working_dir=os.getcwd(), input_files=None, output_dir=None, summarize=True):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # dict of targets as info for other workflows
    targets = defaultdict(list)

    gwf = Workflow(working_dir=working_dir)

    work_output = os.path.join(output_dir, 'A_output1.txt')
    target = gwf.target(
        name='A_work',
        inputs=input_files,
        outputs=[work_output],
    ) << f"""
    touch {work_output}
    """
    targets['work'].append(target)

    if summarize:
        summary_output = os.path.join(output_dir, 'A_output2.txt')
        target = gwf.target(
            name='A_summary',
            inputs=[work_output],
            outputs=[summary_output]
        ) << f"""
        touch {summary_output}
        """
        targets['summary'].append(target)

    return gwf, targets

# we need to assign the workflow to the gwf variable to allow the workflow to be
# run separetely with 'gwf run' in the submoduleA dir
gwf, targets = submoduleA_workflow(input_files=['./input.txt'], output_dir='A_outputs')

Thw workflow can be then be run the normal way:

gwf run

This way of writing workflows allows allows multiple submodule workflows to be combined in a master workflow file.

Try to put the workflow.py is in a submoduleA folder and the one below is in a submoduleB folder.

import os.path
import os
from collections import defaultdict
from gwf import Workflow

def submoduleB_workflow(working_dir=os.getcwd(), input_files=None, output_dir=None, summarize=True):

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # dict of targets as info for other workflows
    targets = defaultdict(list)

    gwf = Workflow(working_dir=working_dir)

    work_output = os.path.join(output_dir, 'B_output1.txt')
    target = gwf.target(
        name='B_work',
        inputs=input_files,
        outputs=[work_output],
    ) << f"""
    touch {work_output}
    """
    targets['work'].append(target)

    if summarize:
        summary_output = os.path.join(output_dir, 'B_output2.txt')
        target = gwf.target(
            name='B_summary',
            inputs=[work_output],
            outputs=[summary_output]
        ) << f"""
        touch {summary_output}
        """
        targets['summary'].append(target)

    return gwf, targets

# we need to assign the workflow to the gwf variable to allow the workflow to be
# run separetely with 'gwf run' in the submoduleB dir
gwf, targets = submoduleB_workflow(input_files=['./input.txt'], output_dir='./B_outputs')

The two submodule workflows above can be combined into parts of a master workflow. The master workflow.py sits in the parent dir of submoduleA and submoduleB:

├── submoduleA
│   └── workflow.py
├── submoduleB
│   └── workflow.py
└── workflow.py

If you write it like this:

import os
from gwf.workflow import collect

from submoduleA.workflow import submoduleA_workflow
from submoduleB.workflow import submoduleB_workflow

working_dir = os.getcwd()

def target_output_files(targets, key):
    return [out for target in targets[key] for out in target.outputs]

# submodule A workflow
gwf, A_targets  = submoduleA_workflow(working_dir=working_dir,
                          input_files=['./input.txt'],
                          output_dir='./A_outputs')
globals()['submoduleA'] = gwf

# add an extra target to glue workflows together
gwf.target('extra', inputs=['./input.txt'], outputs=['./A_outputs/extra.txt']) << """
    touch ./A_outputs/extra.txt
"""

# get relevant outputs from A for input to B
input_files =  target_output_files(A_targets, 'work')

# submodule B workflow
gwf, B_targets = submoduleB_workflow(working_dir=working_dir, 
                          input_files=input_files,
                          output_dir='./B_outputs' )
globals()['submoduleB'] = gwf

You can run each component workflow like this:

gwf -f workflow.py:submoduleA run
gwf -f workflow.py:submoduleB run