Submodules
Submodules are useful if you have analysis components that are used across many projects. It would good to have repos to use for subprojects for these components (between us, I think we have them)
- HapMix
- RFmix
- Mosaic
- Relate
- ARGweaver
- ARGweaver Clues
- Relate Clues
- SMC++
- LDhat/helmet
- pyro
- GATK mapping and base-calling
- ShapeIt phasing
The repos can each have their own adjustable workflow as decribed here, that are then easily put together in a larger workflow as described here.
Preliminaries
Make sure you install the current version of Git in your environments. The one on the cluster is ancient.
Terminal
conda install -c conda-forge git
I also set these configs for each environment to get nicer/safer commands (commands below assume these are set):
git config --global diff.submodule log
git config status.submodulesummary 1
git config push.recurseSubmodules check
Now, say you have a project repo called “umbrella” that will contain other projects and that you have cloned that:
git clone git@github.com:kaspermunch/umbrella.git
Add a submodule
clone repository as submodule:
Terminal
git submodule add git@github.com:munch-group/rfmix.git
and pull the current state of the submodule repo:
Terminal
git submodule init rfmix
git submodule update rfmix
This also generates a .gitmodules
configuration file that git uses to keep track of submodules. Commit that the addalong with the submodule:
Terminal
git add .gitmodules rfmix
git commit -m 'Added rfmix as submodule'
git push
If you want to work on/change submodule repo you need to check out a branch to work on (main or some other). Always do this. If you decide to make changes later and forgot you did not check out a branch you could loose those changes:
Terminal
cd rfmix
git checkout main # (or some other branch)
Making changes to the submodule
now you can then do some work on the tester repo (E.g. change the README.md) and add, commit as usual:
Terminal
cd rfmix
# change README.md
git add README.md
git commit
Publishing submodule changes to GitHub
To publish your submodule commit to the tester repo on GitHub you run:
Terminal
cd rfmix
git push
Getting submodule changes from GitHub
if you run “git pull” in the umbrella repo, you pull upstream changes to the umbrella repo including the recorded state (commit) of the tester submodule:
Terminal
git pull
but it does not pull the tester submodule itself. To do that you run pull in the submodule:
Terminal
cd rfmix
git pull
Branches and multiple contributors
As you could see above submodules are really just repos inside other repos. The parent repo just treats the submodule as a file holding which state (commit) the submodule is in. So for most use, working alone on project it is quite simple. In some cases however, you need some of the git submodule
commands. Those commands updates the relationship between the submodule state recorded in the parent repo and the state and of the submodule.
Terminal
git submodule update --checkout
git submodule update --rebase
git submodule update --merge
--checkout
: Checkout the commit recorded in the superproject on a detached HEAD in the submodule. This is the default behavior, the main use of this option is to override submodule.$name.update when set to a value other than checkout.
--merge
: Merge the commit recorded in the superproject into the current branch of the submodule. If this option is given, the submodule’s HEAD will not be detached. If a merge failure prevents this process, you will have to resolve the resulting conflicts within the submodule with the usual conflict resolution tools.
--rebase
: Rebase the current branch onto the commit recorded in the superproject. If this option is given, the submodule’s HEAD will not be detached. If a merge failure prevents this process, you will have to resolve these failures with git-rebase[1].
Terminal
git submodule update --remote --checkout
git submodule update --remote --rebase
git submodule update --remote --merge
The last one fetches upsteam changes and pulls the submoduile branch recorded in the parent repo when the submodule was added. If this branch is the current branch in the local submodule, then the command is equivalent to git pull
in the submodule.
--remote
: Instead of using the superproject’s recorded SHA-1 to update the submodule, use the status of the submodule’s remote-tracking branch. In order to ensure a current tracking branch state, update –remote fetches the submodule’s remote repository before calculating the SHA-1. If you don’t want to fetch, you should use submodule update --remote --no-fetch
.
Use this option to integrate changes from the upstream subproject with your submodule’s current HEAD. Alternatively, you can run git pull from the submodule, which is equivalent except for the remote branch name: update –remote uses the default upstream repository and submodule.
(read the difference between merge and rebase here)
Multiple submodules
You can have as many submodules as you want. With more submodules, each update command updates all submodules.
Combining GWF workflows from multiple Git submodules
import os
import pandas as pd
import yaml
from gwf.workflow import collect
from gwf import Workflow, AnonymousTarget
# get all targets with a given key
def target_output_files(targets, key):
return [out for target in targets[key] for out in target.outputs]
###############################################################################
## Top workflow for all the stuff required to run the individual pioe lines
###############################################################################
= Workflow()
gwf
# read config file for workflow
with open('workflow_config.yml') as f:
= yaml.safe_load(f)
config
# rfmix workflow
from rfmix.workflow import rfmix_workflow
# controls wheather submodule workflows merged with main workflow
# or run in isolation (only use False for initial submoduile setup)
= True
merge_workflows
###############################################################################
## Generate input files and run RFmix submodule workflow
###############################################################################
# full_list = ['Cynocephalus, Central Tanzania', 'Anubis, Kenya', 'Kindae, Zambia',
# 'Hamadryas, Ethiopia', 'Anubis, Tanzania',
# 'Cynocephalus, Western Tanzania', 'Papio, Senegal', 'Ursinus, Zambia',
# 'Anubis, Ethiopia']
# rfmix analyses
= config['rfmix_analyzes']
rfmix_analyses
# rfmix output dir
= "steps/rfmix_gen100/"
rfmix_output_dir
# compile reference/query sample lists for rfmix
= pd.read_csv(config['sample_meta_data'], sep=" ")
meta_data_samples = []
analyzes for analysis in rfmix_analyses:
= {}
d "analysis"] = analysis
d[+"/"+analysis, exist_ok=True)
os.makedirs(rfmix_output_dir= meta_data_samples.loc[meta_data_samples.C_origin.isin(rfmix_analyses[analysis])]
ref_samples = meta_data_samples.loc[~(meta_data_samples.C_origin.isin(rfmix_analyses[analysis])) &
query_samples != "Gelada, Captive")]
(meta_data_samples.C_origin "ref_samples"] =list(ref_samples.PGDP_ID)
d["query_samples"] = list(query_samples.PGDP_ID)
d[
analyzes.append(d)
# write sample/population info
for analysis in rfmix_analyses:
= rfmix_output_dir + "/" + analysis
analysis_dir = analysis_dir + "/ref_names.txt"
sample_map_file f'sample_map_{analysis}', inputs=['workflow_config.yml'], outputs=[rfmix_output_dir+"/"+analysis+"/ref_names.txt"]) << f'''
gwf.target(
mkdir -p analysis_dir
python scripts/rfmix_write_sample_map.py {analysis} workflow_config.yml {sample_map_file}
'''
# write recombination maps
= rfmix_output_dir + "aut_genetic_map.txt"
autosome_rec_map = rfmix_output_dir + "X_genetic_map.txt"
x_rec_map 'format_genetic_maps',
gwf.target(='16gb',
memory=['workflow_config.yml'],
inputs=[autosome_rec_map, x_rec_map]) << f'''
outputs
mkdir -p output_dir
python scripts/rfmix_format_genetic_maps.py workflow_config.yml {autosome_rec_map} {x_rec_map}
'''
# run the rfmix pipeline
= rfmix_workflow(
_gwf, rfmix_targets if merge_workflows else Workflow(working_dir=os.getcwd()),
gwf # merge_workflows and gwf or Workflow(working_dir=os.getcwd()),
=analyzes,
analyzes=rfmix_output_dir,
output_dir=config['vcf_files'],
vcf_files=autosome_rec_map,
autosome_rec_map=x_rec_map
x_rec_map
)
globals()['rfmix'] = _gwf
###############################################################################
## Next workflow...
###############################################################################
# # get relevant outputs from A for input to B
# input_files = target_output_files(rfmix_targets, 'work')
In each submodule the workflow.py
could look like this:
import os.path
import os
from collections import defaultdict
from gwf import Workflow
def submoduleA_workflow(working_dir=os.getcwd(), input_files=None, output_dir=None, summarize=True):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# dict of targets as info for other workflows
= defaultdict(list)
targets
= Workflow(working_dir=working_dir)
gwf
= os.path.join(output_dir, 'A_output1.txt')
work_output = gwf.target(
target ='A_work',
name=input_files,
inputs=[work_output],
outputs<< f"""
) touch {work_output}
"""
'work'].append(target)
targets[
if summarize:
= os.path.join(output_dir, 'A_output2.txt')
summary_output = gwf.target(
target ='A_summary',
name=[work_output],
inputs=[summary_output]
outputs<< f"""
) touch {summary_output}
"""
'summary'].append(target)
targets[
return gwf, targets
# we need to assign the workflow to the gwf variable to allow the workflow to be
# run separetely with 'gwf run' in the submoduleA dir
= submoduleA_workflow(input_files=['./input.txt'], output_dir='A_outputs') gwf, targets
Thw workflow can be then be run the normal way:
gwf run
This way of writing workflows allows allows multiple submodule workflows to be combined in a master workflow file.
Try to put the workflow.py
is in a submoduleA
folder and the one below is in a submoduleB
folder.
import os.path
import os
from collections import defaultdict
from gwf import Workflow
def submoduleB_workflow(working_dir=os.getcwd(), input_files=None, output_dir=None, summarize=True):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# dict of targets as info for other workflows
= defaultdict(list)
targets
= Workflow(working_dir=working_dir)
gwf
= os.path.join(output_dir, 'B_output1.txt')
work_output = gwf.target(
target ='B_work',
name=input_files,
inputs=[work_output],
outputs<< f"""
) touch {work_output}
"""
'work'].append(target)
targets[
if summarize:
= os.path.join(output_dir, 'B_output2.txt')
summary_output = gwf.target(
target ='B_summary',
name=[work_output],
inputs=[summary_output]
outputs<< f"""
) touch {summary_output}
"""
'summary'].append(target)
targets[
return gwf, targets
# we need to assign the workflow to the gwf variable to allow the workflow to be
# run separetely with 'gwf run' in the submoduleB dir
= submoduleB_workflow(input_files=['./input.txt'], output_dir='./B_outputs') gwf, targets
The two submodule workflows above can be combined into parts of a master workflow. The master workflow.py
sits in the parent dir of submoduleA
and submoduleB
:
├── submoduleA
│ └── workflow.py
├── submoduleB
│ └── workflow.py
└── workflow.py
If you write it like this:
import os
from gwf.workflow import collect
from submoduleA.workflow import submoduleA_workflow
from submoduleB.workflow import submoduleB_workflow
= os.getcwd()
working_dir
def target_output_files(targets, key):
return [out for target in targets[key] for out in target.outputs]
# submodule A workflow
= submoduleA_workflow(working_dir=working_dir,
gwf, A_targets =['./input.txt'],
input_files='./A_outputs')
output_dirglobals()['submoduleA'] = gwf
# add an extra target to glue workflows together
'extra', inputs=['./input.txt'], outputs=['./A_outputs/extra.txt']) << """
gwf.target( touch ./A_outputs/extra.txt
"""
# get relevant outputs from A for input to B
= target_output_files(A_targets, 'work')
input_files
# submodule B workflow
= submoduleB_workflow(working_dir=working_dir,
gwf, B_targets =input_files,
input_files='./B_outputs' )
output_dirglobals()['submoduleB'] = gwf
You can run each component workflow like this:
gwf -f workflow.py:submoduleA run
gwf -f workflow.py:submoduleB run