Gene info

import sys, os
import re
from collections import defaultdict
import numpy as np
import pandas as pd

%matplotlib inline

import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('retina', 'png')
from matplotlib.patches import Rectangle, Polygon
import seaborn as sns
sns.set_style('white')

import geneinfo as gi
gi.email('ariadna.saez@alum.esci.upf.edu')


# for use on the cluster
%env ftp_proxy http://proxy-default:3128

env: ftp_proxy=http://proxy-default:3128

help(gi)

Help on package geneinfo:

NAME
    geneinfo

PACKAGE CONTENTS
    intervals

CLASSES
    builtins.Exception(builtins.BaseException)
        NotFound
    builtins.object
        WrSubObo
        nice
    goatools.go_enrichment.GOEnrichmentRecord(builtins.object)
        My_GOEnrichemntRecord
    
    class My_GOEnrichemntRecord(goatools.go_enrichment.GOEnrichmentRecord)
     |  My_GOEnrichemntRecord(goid, **kwargs)
     |  
     |  Represents one result (from a single GOTerm) in the GOEnrichmentStudy
     |  
     |  Method resolution order:
     |      My_GOEnrichemntRecord
     |      goatools.go_enrichment.GOEnrichmentRecord
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from goatools.go_enrichment.GOEnrichmentRecord:
     |  
     |  __init__(self, goid, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __repr__(self)
     |      Return repr(self).
     |  
     |  get_field_values(self, fldnames, rpt_fmt=True, itemid2name=None)
     |      Get flat namedtuple fields for one GOEnrichmentRecord.
     |  
     |  get_indent_dots(self)
     |      Get a string of dots ("....") representing the level of the GO term.
     |  
     |  get_method_name(self)
     |      Return name of first method in the method_flds list.
     |  
     |  get_prtflds_all(self)
     |      When converting to a namedtuple, get all possible fields in their original order.
     |  
     |  get_prtflds_default(self)
     |      Get default fields.
     |  
     |  get_pvalue(self)
     |      Returns pval for 1st method, if it exists. Else returns uncorrected pval.
     |  
     |  set_corrected_pval(self, nt_method, pvalue)
     |      Add object attribute based on method name.
     |  
     |  set_goterm(self, go2obj)
     |      Set goterm and copy GOTerm's name and namespace.
     |  
     |  update_remaining_fldsdefprt(self, min_ratio=None)
     |      Finish updating self (GOEnrichmentRecord) field, is_ratio_different.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from goatools.go_enrichment.GOEnrichmentRecord:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from goatools.go_enrichment.GOEnrichmentRecord:
     |  
     |  namespace2NS = OrderedDict([('biological_process', 'BP'), ('mol..._fun...
    
    class NotFound(builtins.Exception)
     |  query returned no result
     |  
     |  Method resolution order:
     |      NotFound
     |      builtins.Exception
     |      builtins.BaseException
     |      builtins.object
     |  
     |  Data descriptors defined here:
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from builtins.Exception:
     |  
     |  __init__(self, /, *args, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  Static methods inherited from builtins.Exception:
     |  
     |  __new__(*args, **kwargs) from builtins.type
     |      Create and return a new object.  See help(type) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from builtins.BaseException:
     |  
     |  __delattr__(self, name, /)
     |      Implement delattr(self, name).
     |  
     |  __getattribute__(self, name, /)
     |      Return getattr(self, name).
     |  
     |  __reduce__(...)
     |      Helper for pickle.
     |  
     |  __repr__(self, /)
     |      Return repr(self).
     |  
     |  __setattr__(self, name, value, /)
     |      Implement setattr(self, name, value).
     |  
     |  __setstate__(...)
     |  
     |  __str__(self, /)
     |      Return str(self).
     |  
     |  with_traceback(...)
     |      Exception.with_traceback(tb) --
     |      set self.__traceback__ to tb and return self.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from builtins.BaseException:
     |  
     |  __cause__
     |      exception cause
     |  
     |  __context__
     |      exception context
     |  
     |  __dict__
     |  
     |  __suppress_context__
     |  
     |  __traceback__
     |  
     |  args
    
    class WrSubObo(builtins.object)
     |  WrSubObo(fin_obo=None, optional_attrs=None, load_obsolete=None)
     |  
     |  Read a large GO-DAG from an obo file. Write a subset GO-DAG into a small obo file.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, fin_obo=None, optional_attrs=None, load_obsolete=None)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  wrobo(self, fout_obo, goid_sources)
     |      Write a subset obo file containing GO ID sources and their parents.
     |  
     |  ----------------------------------------------------------------------
     |  Static methods defined here:
     |  
     |  get_goids(fin_obo, name)
     |      Get GO IDs whose name matches given name.
     |  
     |  prt_goterms(fin_obo, goids, prt, b_prt=True)
     |      Print the specified GO terms for GO IDs in arg.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class nice(builtins.object)
     |  Methods defined here:
     |  
     |  __rlshift__(self, df)
     |      Left align columns of data frame: df << nice()
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

FUNCTIONS
    chrom_ideogram(annot, hspace=0.1, min_visible_width=200000, figsize=(10, 10), assembly='hg38')
    
    download_and_move_go_basic_obo(prt)
    
    download_ncbi_associations(prt=<ipykernel.iostream.OutStream object at 0x151c47b74340>)
    
    email(email_address)
    
    ensembl2ncbi(ensembl_id)
    
    ensembl2symbol(ensembl_id)
    
    ensembl_get_gene_info_by_symbol(symbols, assembly=None, species='homo_sapiens')
    
    ensembl_get_genes_region(chrom, window_start, window_end, assembly=None, species='homo_sapiens')
    
    ensembl_id(name, species='homo_sapiens')
    
    fetch_background_genes(taxid=9606)
    
    gene_annotation_table(taxid=9606)
    
    gene_info(query, species='human', scopes='hgnc')
    
    gene_info_region(chrom, window_start, window_end, assembly='GRCh38', db='ncbiRefSeq')
    
    gene_plot(chrom, start, end, assembly, highlight=[], db='ncbiRefSeq', collapse_splice_var=True, hard_limits=False, exact_exons=False, figsize=None, aspect=1, despine=False, clip_on=True, gene_density=60, font_size=None, return_axes=1)
    
    get_genes_for_go_regex(regex, taxid=9606)
    
    get_genes_for_go_terms(terms, taxid=9606)
    
    get_genes_region(chrom, window_start, window_end, assembly='GRCh38', db='ncbiRefSeq')
    
    get_genes_region_dataframe(chrom, start, end, assembly='GRCh38', db='ncbiRefSeq')
    
    get_go_terms_for_genes(genes, taxid=9606, evidence=None)
    
    get_terms_for_go_regex(regex, taxid=9606, add_children=False)
    
    go_annotation_table(taxid=9606)
    
    go_enrichment(gene_list, taxid=9606, background_chrom=None, background_genes=None, terms=None, list_study_genes=False, alpha=0.05)
    
    go_info(terms)
    
    go_name2term(name)
    
    go_term2name(term)
    
    hgcn_symbol(name)
    
    log10(x, /)
        Return the base 10 logarithm of x.
    
    map_interval(chrom, start, end, strand, map_from, map_to, species='homo_sapiens')
    
    mygene_get_gene_info(query, species='human', scopes='hgnc', fields='symbol,alias,name,type_of_gene,summary,genomic_pos,genomic_pos_hg19')
    
    reduce(...)
        reduce(function, sequence[, initial]) -> value
        
        Apply a function of two arguments cumulatively to the items of a sequence,
        from left to right, so as to reduce the sequence to a single value.
        For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
        ((((1+2)+3)+4)+5).  If initial is present, it is placed before the items
        of the sequence in the calculation, and serves as a default when the
        sequence is empty.
    
    show_go_dag_enrichment_results(results)
    
    show_go_dag_for_gene(gene, taxid=9606, evidence=None, add_relationships=True)
    
    show_go_dag_for_terms(terms, add_relationships=True)
    
    show_go_evidence_codes()
    
    show_string_network(my_genes, nodes=10)
    
    sqrt(x, /)
        Return the square root of x.
    
    string_network_table(my_genes, nodes=10)
    
    symbols_protein_coding(taxid=9606)
    
    tabulate_genes(words, ncols=None)

DATA
    CACHE = {}

FILE
    /home/ari/miniconda3/envs/arifdp/lib/python3.8/site-packages/geneinfo/__init__.py

RELATE

## AFRICANS ##
import pandas as pd
import os

# Define the threshold
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/LWK/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/ESN/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/MSL/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/GWD/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/YRI/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = file_path.split('/')[-4]  # Get the third element from the end when splitting by '/'
    
    # Read data with '\s+' separator and make the values in "when_mutation_has_freq2" column positive
    data = pd.read_csv(file_path, sep='\s+')
    data['when_mutation_has_freq2'] = data['when_mutation_has_freq2'].abs()
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['when_mutation_has_freq2'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
            # Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

    # Print gene regions for the current population
    print(f"Gene regions for {population_name}:")
    for region in population_gene_regions:
        print(region)
    print()

Gene regions for LWK:
('PTCHD1', 23334395, 23404374)
('CASK', 41514933, 41665852)
('FAM120C', 54068323, 54183254)
('LOC105377212', 63237521, 63277646)
('LINC01278', 63426557, 63561095)
('AMMECR1', 110194185, 110318085)
('RTL4', 112083012, 112457514)
('ENOX2', 130622324, 130720491)
('RAP2C-AS1', 132218506, 132432811)
('MAMLD1', 150361571, 150514173)
('ZNF185', 152898066, 152973474)

Gene regions for ESN:
('CDKL5', 18425607, 18640196)
('CDKL5', 18425607, 18640196)
('CDKL5', 18425607, 18640196)
('CDKL5', 18425607, 18640196)
('PTCHD1-AS', 22193004, 23293146)
('CASK', 41514933, 41923034)
('CASK', 41514933, 41923034)
('LOC124905191', 53361522, 53370780)
('RTL4', 112083012, 112457514)
('TENM1', 124375902, 124963817)
('IGSF1', 131273505, 131289306)
('G6PD', 154531389, 154547018)

Gene regions for MSL:
('PTCHD1-AS', 22193004, 23293146)

Gene regions for GWD:
('ARSL', 2934520, 2964341)
('BCOR', 40051245, 40177277)
('BCOR', 40051245, 40177277)
('HUWE1', 53532095, 53680089)
('ACSL4', 109641334, 109733257)
('AMMECR1', 110194185, 110318085)
('AMMECR1', 110194185, 110318085)
('AMMECR1', 110194185, 110318085)
('AMMECR1', 110194185, 110318085)
('AMMECR1', 110194185, 110318085)
('AMMECR1', 110194185, 110318085)
('PAK3', 110944396, 111217494)
('IGSF1', 131273505, 131289306)

Gene regions for YRI:
('PRKX', 3604339, 3713649)
('CDKL5', 18425607, 18640196)
('DMD', 31119221, 32155469)
('NYX', 41447342, 41475652)
('CASK', 41514933, 41923034)
('CASK', 41514933, 41923034)
('GNL3L', 54530218, 54645854)
('LINC01278', 63426557, 63561095)
('LINC01278', 63426557, 63561095)
('LINC01278', 63426557, 63561095)
('LINC01278', 63426557, 63561095)
('ZMYM3', 71239623, 71253721)
('ZMYM3', 71239623, 71254649)
('MAGT1', 77825746, 77895435)
('TMEM164', 110002368, 110177788)
('TMEM164', 110002368, 110177788)
('TMEM164', 110002368, 110177788)
('RAB33A', 130110622, 130184870)
('RAB33A', 130110622, 130184870)
('MAMLD1', 150361571, 150514173)

## PUERTO RICANS ##
import pandas as pd
import os

# Define the threshold
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/PUR/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = file_path.split('/')[-4]  # Get the third element from the end when splitting by '/'
    
    # Read data with '\s+' separator and make the values in "when_mutation_has_freq2" column positive
    data = pd.read_csv(file_path, sep='\s+')
    data['when_mutation_has_freq2'] = data['when_mutation_has_freq2'].abs()
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['when_mutation_has_freq2'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
            # Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

    # Print gene regions for the current population
    print(f"Gene regions for {population_name}:")
    for region in population_gene_regions:
        print(region)
    print()

Gene regions for PUR:
('ARSL', 2934520, 2964341)
('SHROOM2', 9786428, 9949443)
('WWC3', 10015253, 10144474)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10677739)
('MID1', 10445309, 10677739)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('HCCS-DT', 10847542, 11111138)
('EGFL6', 13569600, 13633575)
('NHS', 17375199, 17735994)
('PHKA2', 18892297, 18938716)
('PHKA2', 18892297, 18938716)
('PHKA2', 18892297, 18954298)
('MAP3K15', 19360058, 19515508)
('SH3KBP1', 19533976, 19668254)
('LOC124905257', 20606476, 20727382)
('CNKSR2', 21374417, 21645440)
('PTCHD1-AS', 22193004, 23293146)
('PTCHD1-AS', 22193004, 23293146)
('PTCHD1-AS', 22193004, 23293146)
('PTCHD1-AS', 22193004, 23293146)
('DMD', 31119221, 33211549)
('DMD', 31119221, 33211549)
('CFAP47', 35919733, 35990150)
('CFAP47', 35919733, 35990150)
('CFAP47', 35919733, 36145079)
('LOC101928627', 36365625, 36440292)
('PRRG1', 37349363, 37442224)
('PRRG1', 37349363, 37442224)
('PRRG1', 37349363, 37442224)
('TSPAN7', 38561541, 38688918)
('PAGE1', 49687446, 49695984)
('CLCN5', 49922595, 50042541)
('CLCN5', 49922595, 50042541)
('CLCN5', 49922595, 50042541)
('CLCN5', 49922595, 50042541)
('SHROOM4', 50575533, 50814194)
('SHROOM4', 50575533, 50814194)
('SHROOM4', 50575533, 50814194)
('LOC105377209', 52195604, 52265931)
('LOC105377209', 52195604, 52265931)
('FAM120C', 54068323, 54183254)
('FAM120C', 54068323, 54183254)
('NHSL2', 71910844, 72140776)
('LOC107985713', 88494129, 88613490)
('LOC107985713', 88494129, 88613490)
('LOC107985713', 88494129, 88613490)
('LOC107985713', 88494129, 88613490)
('DIAPH2', 96684841, 97469836)
('IL1RAPL2', 104566198, 105767829)
('FRMPD3', 107449651, 107605251)
('MID2', 107825734, 107931637)
('AMMECR1', 110194185, 110440233)
('PAK3', 110944396, 111217494)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('RTL4', 112083012, 112457514)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('LOC101928437', 113042726, 113520614)
('KLHL13', 117897812, 118116810)
('KLHL13', 117897812, 118116810)
('LINC01285', 118839555, 118882015)
('KIAA1210', 119078634, 119150579)
('KIAA1210', 119078634, 119150579)
('LINC03098', 119291528, 119335610)
('LINC03098', 119291528, 119335610)
('LINC03098', 119291528, 119335610)
('TMEM255A', 120251432, 120311461)
('TENM1', 124375902, 124963817)
('TENM1', 124375902, 124963817)
('TENM1', 124375902, 124963817)
('TENM1', 124375902, 124963817)
('TENM1', 124375902, 125204312)
('TENM1', 124375902, 125204312)
('XPNPEP2', 129738978, 129769536)
('ZDHHC9', 129803287, 129843599)
('RBMX2', 130401986, 130413656)
('ENOX2', 130622324, 130887549)
('ENOX2', 130622324, 130887549)
('ENOX2', 130622324, 130887549)
('LINC01201', 131016468, 131058146)
('LINC01201', 131016468, 131058146)
('ARHGAP36', 131058345, 131089885)
('ARHGAP36', 131058345, 131089885)
('ARHGAP36', 131058345, 131089885)
('ARHGAP36', 131058345, 131089885)
('PLAC1', 134565837, 134658471)
('PLAC1', 134565837, 134658471)
('PLAC1', 134565837, 134764322)
('SLC9A6', 135973836, 136047269)
('SLC9A6', 135973836, 136047269)
('LOC105373344', 140517784, 140526183)
('LOC101928808', 146394681, 146401858)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('LINC00850', 149825707, 149879799)
('LINC00850', 149825707, 149879799)
('MAGEA3-DT', 152554376, 152698727)
('FLNA', 154348530, 154374634)

## EUROPEANS ##
import pandas as pd
import os

# Define the threshold
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/GBR/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/FIN/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/IBS/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/TSI/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = file_path.split('/')[-4]  # Get the third element from the end when splitting by '/'
    
    # Read data with '\s+' separator and make the values in "when_mutation_has_freq2" column positive
    data = pd.read_csv(file_path, sep='\s+')
    data['when_mutation_has_freq2'] = data['when_mutation_has_freq2'].abs()
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['when_mutation_has_freq2'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
            # Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

    # Print gene regions for the current population
    print(f"Gene regions for {population_name}:")
    for region in population_gene_regions:
        print(region)
    print()

Gene regions for GBR:
('PTCHD1-AS', 22193004, 23293146)
('PTCHD1-AS', 22193004, 23293146)

Gene regions for FIN:
('IL1RAPL1', 28587445, 29956718)
('DMD', 31119221, 31266954)
('DMD', 31119221, 33211549)
('XPNPEP2', 129738978, 129769536)

Gene regions for IBS:
('PRKX', 3604339, 3713649)
('PRKX', 3604339, 3713649)
('PRKX', 3604339, 3713649)
('PRKX', 3604339, 3713649)
('PRKX', 3604339, 3713649)
('NHS', 17375199, 17735994)
('PHKA1', 72578813, 72714306)
('MIR325HG', 76657797, 77014532)
('TENM1', 124375902, 124963817)
('TMLHE', 155489010, 155536365)

Gene regions for TSI:
('PRKX', 3604339, 3713649)
('LOC112268307', 64205945, 64312562)

## ASIANS ##
import pandas as pd
import os

# Define the threshold
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/CDX/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/CHS/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/CHB/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/JPT/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele',
    '/home/ari/ari-intern/people/ari/ariadna-intern/steps/KHV/relate/run_relate/1000g_ppl_phased_haplotypes_selection.sele'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = file_path.split('/')[-4]  # Get the third element from the end when splitting by '/'
    
    # Read data with '\s+' separator and make the values in "when_mutation_has_freq2" column positive
    data = pd.read_csv(file_path, sep='\s+')
    data['when_mutation_has_freq2'] = data['when_mutation_has_freq2'].abs()
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['when_mutation_has_freq2'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
            # Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

    # Print gene regions for the current population
    print(f"Gene regions for {population_name}:")
    for region in population_gene_regions:
        print(region)
    print()

Gene regions for CDX:
('WWC3', 10015253, 10144474)
('FRMPD4', 11822438, 12724523)
('FRMPD4', 11822438, 12724523)
('DMD', 31119221, 33211549)
('IL1RAPL2', 104566198, 105767829)

Gene regions for CHS:
('PRKX', 3604339, 3713649)
('CLCN4', 10156974, 10237660)
('LOC101928359', 122422005, 122478081)

Gene regions for CHB:
('PRKX', 3604339, 3713649)
('PRKX', 3604339, 3713649)
('DMD', 31119221, 33211549)
('PASD1', 151563674, 151676739)

Gene regions for JPT:
('CLCN5', 49922595, 50042541)

Gene regions for KHV:
('PASD1', 151563674, 151676739)
('TMEM187', 153972753, 153983194)

COMBINED

## AFRICANS##

# Define the threshold
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_LWK.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_ESN.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_GWD.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_MSL.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_YRI.csv'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = os.path.splitext(os.path.basename(file_path))[0]
    
    # Read data
    data = pd.read_csv(file_path)
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['combined_p_values'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
# Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

# Print gene regions for all populations
for population, regions in gene_regions.items():
    print(f"Gene regions for {population}:")
    for region in regions:
        print(region)
    print()


print()
print()

# Set to store gene symbols for which information has already been retrieved
retrieved_symbols = set()

# Run gi.gene_info() for each unique gene symbol
for gene_symbol in unique_genes:
    # Check if gene information has already been retrieved for this symbol
    if gene_symbol not in retrieved_symbols:
        gene_info = gi.gene_info(gene_symbol)
        if gene_info is not None:
            print(f"Gene info for {gene_symbol}: {gene_info}")

Gene regions for combined_LWK:
('CASK', 41514933, 41665852)
('RTL4', 112083012, 112457514)

Gene regions for combined_ESN:
('CASK', 41514933, 41923034)
('RTL4', 112083012, 112457514)

Gene regions for combined_GWD:
('PAK3', 110944396, 111217494)

Gene regions for combined_MSL:
('PTCHD1-AS', 22193004, 23293146)

Gene regions for combined_YRI:
('PRKX', 3604339, 3713649)
('TMEM164', 110002368, 110177788)
('TMEM164', 110002368, 110177788)
('TMEM164', 110002368, 110177788)
('RAB33A', 130110622, 130184870)
('RAB33A', 130110622, 130184870)

Symbol: PTCHD1-AS
PTCHD1 antisense RNA (head to head)
Gene card

Symbol: PAK3 (protein-coding) Aliases: ARA, MRX30, MRX47, OPHN3, PAK-3, PAK3beta, XLID30, bPAK, beta-PAK
p21 (RAC1) activated kinase 3
Summary: The protein encoded by this gene is a serine-threonine kinase and forms an activated complex with GTP-bound RAS-like (P21), CDC2 and RAC1. This protein may be necessary for dendritic development and for the rapid cytoskeletal reorganization in dendritic spines associated with synaptic plasticity. Defects in this gene are the cause of a non-syndromic form of X-linked intellectual disability. Alternatively spliced transcript variants encoding different isoforms have been identified. [provided by RefSeq, Jul 2017].
Genomic position: X:110944285-111227361 (hg38), X:110187513-110470589 (hg19)
Gene card

Symbol: RAB33A (protein-coding) Aliases: RabS10
RAB33A, member RAS oncogene family
Summary: The protein encoded by this gene belongs to the small GTPase superfamily, Rab family. It is GTP-binding protein and may be involved in vesicle transport. [provided by RefSeq, Jul 2008].
Genomic position: X:130171962-130184870 (hg38), X:129305623-129318844 (hg19)
Gene card

Symbol: PRKX (protein-coding) Aliases: PKX1
protein kinase cAMP-dependent X-linked catalytic subunit
Summary: This gene encodes a serine threonine protein kinase that has similarity to the catalytic subunit of cyclic AMP dependent protein kinases. The encoded protein is developmentally regulated and may be involved in renal epithelial morphogenesis. This protein may also be involved in macrophage and granulocyte maturation. Abnormal recombination between this gene and a related pseudogene on chromosome Y is a frequent cause of sex reversal disorder in XX males and XY females. Pseudogenes of this gene are found on chromosomes X, 15 and Y. [provided by RefSeq, Feb 2010].
Genomic position: X:3604340-3713649 (hg38), X:3522411-3631649 (hg19)
Gene card

Symbol: TMEM164 (protein-coding) Aliases: bB360B22.3
transmembrane protein 164
Summary: Predicted to be integral component of membrane. [provided by Alliance of Genome Resources, Apr 2022]
Genomic position: X:110002631-110182734 (hg38), X:109245859-109425962 (hg19)
Gene card

Symbol: RTL4 (protein-coding) Aliases: Mar4, Mart4, SIRH11, ZCCHC16
retrotransposon Gag like 4
Summary: Predicted to enable nucleic acid binding activity and zinc ion binding activity. Predicted to act upstream of or within cognition and norepinephrine metabolic process. [provided by Alliance of Genome Resources, Apr 2022]
Genomic position: X:112083016-112457514 (hg38), X:111697727-111700473 (hg19)
Gene card

Symbol: CASK (protein-coding) Aliases: CAGH39, CAMGUK, CMG, FGS4, LIN2, MICPCH, MRXSNA, TNRC8, hCASK
calcium/calmodulin dependent serine protein kinase
Summary: This gene encodes a calcium/calmodulin-dependent serine protein kinase. The encoded protein is a MAGUK (membrane-associated guanylate kinase) protein family member. These proteins are scaffold proteins and the encoded protein is located at synapses in the brain. Mutations in this gene are associated with FG syndrome 4, intellectual disability and microcephaly with pontine and cerebellar hypoplasia, and a form of X-linked intellectual disability. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2017].
Genomic position: X:41514934-41923554 (hg38), X:41374187-41782716 (hg19)
Gene card

## PUR ##

threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_PUR.csv'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = os.path.splitext(os.path.basename(file_path))[0]
    
    # Read data
    data = pd.read_csv(file_path)
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['combined_p_values'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
# Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

# Print gene regions for all populations
for population, regions in gene_regions.items():
    print(f"Gene regions for {population}:")
    for region in regions:
        print(region)
    print()


print()
print()

# Set to store gene symbols for which information has already been retrieved
retrieved_symbols = set()

# Run gi.gene_info() for each unique gene symbol
for gene_symbol in unique_genes:
    # Check if gene information has already been retrieved for this symbol
    if gene_symbol not in retrieved_symbols:
        gene_info = gi.gene_info(gene_symbol)
        if gene_info is not None:
            print(f"Gene info for {gene_symbol}: {gene_info}")

Gene regions for combined_PUR:
('WWC3', 10015253, 10144474)
('WWC3', 10015253, 10144474)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10576917)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10620585)
('MID1', 10445309, 10677739)
('MID1', 10445309, 10677739)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('MID1', 10445309, 10833683)
('PHKA2', 18892297, 18938716)
('PHKA2', 18892297, 18938716)
('MAP3K15', 19360058, 19515508)
('PTCHD1-AS', 22193004, 23293146)
('TSPAN7', 38561541, 38688918)
('CLCN5', 49922595, 50042541)
('SHROOM4', 50575533, 50814194)
('SHROOM4', 50575533, 50814194)
('SHROOM4', 50575533, 50814194)
('APEX2', 55000362, 55009057)
('HMGN5', 81113698, 81201913)
('LOC107985713', 88494129, 88613490)
('LOC107985713', 88494129, 88613490)
('IL1RAPL2', 104566198, 105767829)
('MID2', 107825734, 107931637)
('AMMECR1', 110194185, 110440233)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('TRPC5', 111768010, 112082776)
('RTL4', 112083012, 112457514)
('TENM1', 124375902, 124963817)
('TENM1', 124375902, 125204312)
('XPNPEP2', 129738978, 129769536)
('LOC101928808', 146394681, 146401858)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('AFF2', 148500616, 149000663)
('LINC00850', 149825707, 149879799)
('LINC00850', 149825707, 149879799)
('MAGEA3-DT', 152554376, 152698727)

Symbol: PTCHD1-AS
PTCHD1 antisense RNA (head to head)
Gene card

Symbol: LINC00850 (ncRNA) Aliases: KUCG1
long intergenic non-protein coding RNA 850
Gene card

Symbol: MID1 (protein-coding) Aliases: BBBG1, FXY, GBBB, GBBB1, MIDIN, OGS1, OS, OSX, RNF59, TRIM18, XPRF, ZNFXY
midline 1
Summary: The protein encoded by this gene is a member of the tripartite motif (TRIM) family, also known as the ‘RING-B box-coiled coil’ (RBCC) subgroup of RING finger proteins. The TRIM motif includes three zinc-binding domains, a RING, a B-box type 1 and a B-box type 2, and a coiled-coil region. This protein forms homodimers which associate with microtubules in the cytoplasm. The protein is likely involved in the formation of multiprotein structures acting as anchor points to microtubules. Mutations in this gene have been associated with the X-linked form of Opitz syndrome, which is characterized by midline abnormalities such as cleft lip, laryngeal cleft, heart defects, hypospadias, and agenesis of the corpus callosum. This gene was also the first example of a gene subject to X inactivation in human while escaping it in mouse. Alternative promoter use, alternative splicing and alternative polyadenylation result in multiple transcript variants that have different tissue specificities. [provided by RefSeq, Dec 2016].
Genomic position: X:10445310-10833654 (hg38), X:10413350-10851773 (hg19)
Gene card

Symbol: TSPAN7 (protein-coding) Aliases: A15, CCG-B7, CD231, DXS1692E, MRX58, MXS1, TALLA-1, TM4SF2, TM4SF2b, XLID58
tetraspanin 7
Summary: The protein encoded by this gene is a member of the transmembrane 4 superfamily, also known as the tetraspanin family. Most of these members are cell-surface proteins that are characterized by the presence of four hydrophobic domains. The proteins mediate signal transduction events that play a role in the regulation of cell development, activation, growth and motility. This encoded protein is a cell surface glycoprotein and may have a role in the control of neurite outgrowth. It is known to complex with integrins. This gene is associated with X-linked cognitive disability and neuropsychiatric diseases such as Huntington’s chorea, fragile X syndrome and myotonic dystrophy. [provided by RefSeq, Jul 2008].
Genomic position: X:38561542-38688920 (hg38), X:38420623-38548169 (hg19)
Gene card

Symbol: HMGN5 (protein-coding) Aliases: NBP-45, NSBP1
high mobility group nucleosome binding domain 5
Summary: This gene encodes a nuclear protein with similarities to the high mobility group proteins, HMG14 and HMG17, which suggests that this protein may function as a nucleosomal binding and transcriptional activating protein. [provided by RefSeq, Sep 2009].
Genomic position: X:81113699-81201913 (hg38), X:80369200-80457441 (hg19)
Gene card

Symbol: AMMECR1 (protein-coding) Aliases: AMMERC1, MFHIEN
AMMECR nuclear protein 1
Summary: The exact function of this gene is not known, however, submicroscopic deletion of the X chromosome including this gene, COL4A5, and FACL4 genes, result in a contiguous gene deletion syndrome, the AMME complex (Alport syndrome, mental retardation, midface hypoplasia, and elliptocytosis). Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jan 2010].
Genomic position: X:110194186-110440318 (hg38), X:109437414-109683461 (hg19)
Gene card

Symbol: XPNPEP2 (protein-coding) Aliases: AEACEI, APP2
X-prolyl aminopeptidase 2
Summary: Aminopeptidase P is a hydrolase specific for N-terminal imido bonds, which are common to several collagen degradation products, neuropeptides, vasoactive peptides, and cytokines. Structurally, the enzyme is a member of the ‘pita bread fold’ family and occurs in mammalian tissues in both soluble and GPI-anchored membrane-bound forms. A membrane-bound and soluble form of this enzyme have been identified as products of two separate genes. [provided by RefSeq, Jul 2008].
Genomic position: X:129738949-129769536 (hg38), X:128872950-128903514 (hg19)
Gene card

Symbol: MAP3K15 (protein-coding) Aliases: ASK3, bA723P2.3
mitogen-activated protein kinase kinase kinase 15
Summary: The protein encoded by this gene is a member of the mitogen-activated protein kinase (MAPK) family. These family members function in a protein kinase signal transduction cascade, where an activated MAPK kinase kinase (MAP3K) phosphorylates and activates a specific MAPK kinase (MAP2K), which then activates a specific MAPK. This MAP3K protein plays an essential role in apoptotic cell death triggered by cellular stresses. [provided by RefSeq, Jul 2010].
Genomic position: X:19360056-19515508 (hg38), X:19378174-19533379 (hg19)
Gene card

Symbol: TENM1 (protein-coding) Aliases: ODZ1, ODZ3, TEN-M1, TEN1, TNM, TNM1, ten-1
teneurin transmembrane protein 1
Summary: The protein encoded by this gene belongs to the tenascin family and teneurin subfamily. It is expressed in the neurons and may function as a cellular signal transducer. Several alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Sep 2009].
Genomic position: X:124375903-125204312 (hg38), X:123509753-124097666 (hg19)
Gene card

Symbol: SHROOM4 (protein-coding) Aliases: MRXSSDS, SHAP, shrm4
shroom family member 4
Summary: This gene encodes a member of the APX/Shroom family, which contain an N-terminal PDZ domain and a C-terminal ASD2 motif. The encoded protein may play a role in cytoskeletal architecture. Mutations in this gene have been linked to the X-linked Stocco dos Santos syndrome characterized by cognitive disabilities. Alternatively spliced transcript variants have been described. [provided by RefSeq, Jul 2017].
Genomic position: X:50586796-50814302 (hg38), HG1433_PATCH:50334647-50557302, X:50334647-50557302 (hg19)
Gene card

Symbol: AFF2 (protein-coding) Aliases: FMR2, FMR2P, FRAXE, MRX2, OX19, XLID109
ALF transcription elongation factor 2
Summary: This gene encodes a putative transcriptional activator that is a member of the AF4 gene family. This gene is associated with the folate-sensitive fragile X E locus on chromosome X. A repeat polymorphism in the fragile X E locus results in silencing of this gene causing Fragile X E syndrome. Fragile X E syndrome is a form of nonsyndromic X-linked cognitive disability. In addition, this gene contains 6-25 GCC repeats that are expanded to >200 repeats in the disease state. Alternate splicing results in multiple transcript variants.[provided by RefSeq, Jul 2016].
Genomic position: X:148500617-149000663 (hg38), HG1459_PATCH:147582336-148082384, X:147582139-148082193 (hg19)
Gene card

Symbol: IL1RAPL2 (protein-coding) Aliases: IL-1R9, IL1R9, IL1RAPL-2, TIGIRR-1
interleukin 1 receptor accessory protein like 2
Summary: The protein encoded by this gene is a member of the interleukin 1 receptor family. This protein is similar to the interleukin 1 accessory proteins, and is most closely related to interleukin 1 receptor accessory protein-like 1 (IL1RAPL1). This gene and IL1RAPL1 are located at a region on chromosome X that is associated with X-linked non-syndromic cognitive disability. [provided by RefSeq, Jul 2008].
Genomic position: X:104566199-105767829 (hg38), HG375_PATCH:103810996-105011820, X:103810996-105011822 (hg19)
Gene card

Symbol: CLCN5 (protein-coding) Aliases: CLC5, CLCK2, ClC-5, DENT1, DENTS, NPHL1, NPHL2, XLRH, XRN, hCIC-K2
chloride voltage-gated channel 5
Summary: This gene encodes a member of the ClC family of chloride ion channels and ion transporters. The encoded protein is primarily localized to endosomal membranes and may function to facilitate albumin uptake by the renal proximal tubule. Mutations in this gene have been found in Dent disease and renal tubular disorders complicated by nephrolithiasis. Alternatively spliced transcript variants have been found for this gene. [provided by RefSeq, Jan 2013].
Genomic position: X:49922596-50099235 (hg38), HG1436_HG1432_PATCH:49732014-49908634, X:49687225-49863892 (hg19)
Gene card

Symbol: APEX2 (protein-coding) Aliases: APE2, APEXL2, XTH2, ZGRF2
apurinic/apyrimidinic endodeoxyribonuclease 2
Summary: Apurinic/apyrimidinic (AP) sites occur frequently in DNA molecules by spontaneous hydrolysis, by DNA damaging agents or by DNA glycosylases that remove specific abnormal bases. AP sites are pre-mutagenic lesions that can prevent normal DNA replication so the cell contains systems to identify and repair such sites. Class II AP endonucleases cleave the phosphodiester backbone 5’ to the AP site. This gene encodes a protein shown to have a weak class II AP endonuclease activity. Most of the encoded protein is located in the nucleus but some is also present in mitochondria. This protein may play an important role in both nuclear and mitochondrial base excision repair. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Nov 2012].
Genomic position: X:55000363-55009057 (hg38), X:55026790-55035490 (hg19)
Gene card

Symbol: LOC101928808 (ncRNA)
uncharacterized LOC101928808
Gene card

Symbol: MID2 (protein-coding) Aliases: FXY2, MRX101, RNF60, TRIM1, XLID101
midline 2
Summary: The protein encoded by this gene is a member of the tripartite motif (TRIM) family. The TRIM motif includes three zinc-binding domains, a RING, a B-box type 1 and a B-box type 2, and a coiled-coil region. The protein localizes to microtubular structures in the cytoplasm. Alternate splicing of this gene results in two transcript variants encoding different isoforms. [provided by RefSeq, Feb 2009].
Genomic position: X:107825755-107931637 (hg38), X:107068985-107170423 (hg19)
Gene card

Symbol: TRPC5 (protein-coding) Aliases: PPP1R159, TRP5
transient receptor potential cation channel subfamily C member 5
Summary: This gene belongs to the transient receptor family. It encodes one of the seven mammalian TRPC (transient receptor potential channel) proteins. The encoded protein is a multi-pass membrane protein and is thought to form a receptor-activated non-selective calcium permeant cation channel. The protein is active alone or as a heteromultimeric assembly with TRPC1, TRPC3, and TRPC4. It also interacts with multiple proteins including calmodulin, CABP1, enkurin, Na(+)-H+ exchange regulatory factor (NHERF ), interferon-induced GTP-binding protein (MX1), ring finger protein 24 (RNF24), and SEC14 domain and spectrin repeat-containing protein 1 (SESTD1). [provided by RefSeq, May 2010].
Genomic position: X:111768011-112082776 (hg38), X:111017543-111326004 (hg19)
Gene card

Symbol: MAGEA3-DT
MAGEA3 divergent transcript
Gene card

Symbol: WWC3 (protein-coding) Aliases: BM042
WWC family member 3
Summary: This gene encodes a member of the WWC family of proteins, which also includes the WWC1 (KIBRA) gene product and the WWC2 gene product. The protein encoded by this gene includes a C2 domain, which is known to mediate homodimerization in the related WWC1 gene product. [provided by RefSeq, Sep 2011].
Genomic position: X:10015254-10144474 (hg38), X:9983602-10112518 (hg19)
Gene card

Symbol: PHKA2 (protein-coding) Aliases: GSD9A, PHK, PYK, PYKL, XLG, XLG2
phosphorylase kinase regulatory subunit alpha 2
Summary: Phosphorylase kinase is a polymer of 16 subunits, four each of alpha, beta, gamma and delta. The alpha subunit includes the skeletal muscle and hepatic isoforms, and the hepatic isoform is encoded by this gene. The beta subunit is the same in both the muscle and hepatic isoforms, and encoded by one gene. The gamma subunit also includes the skeletal muscle and hepatic isoforms, which are encoded by two different genes. The delta subunit is a calmodulin and can be encoded by three different genes. The gamma subunits contain the active site of the enzyme, whereas the alpha and beta subunits have regulatory functions controlled by phosphorylation. The delta subunit mediates the dependence of the enzyme on calcium concentration. Mutations in this gene cause glycogen storage disease type 9A, also known as X-linked liver glycogenosis. Alternatively spliced transcript variants have been reported, but the full-length nature of these variants has not been determined.[provided by RefSeq, Feb 2010].
Genomic position: X:18892298-18984114 (hg38), X:18910418-19002716 (hg19)
Gene card

Symbol: LOC107985713 (ncRNA)
uncharacterized LOC107985713
Gene card

## EUROPEANS ##
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_GBR.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_FIN.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_IBS.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_TSI.csv'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = os.path.splitext(os.path.basename(file_path))[0]
    
    # Read data
    data = pd.read_csv(file_path)
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['combined_p_values'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
# Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

# Print gene regions for all populations
for population, regions in gene_regions.items():
    print(f"Gene regions for {population}:")
    for region in regions:
        print(region)
    print()


print()
print()

# Set to store gene symbols for which information has already been retrieved
retrieved_symbols = set()

# Run gi.gene_info() for each unique gene symbol
for gene_symbol in unique_genes:
    # Check if gene information has already been retrieved for this symbol
    if gene_symbol not in retrieved_symbols:
        gene_info = gi.gene_info(gene_symbol)
        if gene_info is not None:
            print(f"Gene info for {gene_symbol}: {gene_info}")

Gene regions for combined_GBR:

Gene regions for combined_FIN:

Gene regions for combined_IBS:
('PRKX', 3604339, 3713649)
('TENM1', 124375902, 124963817)

Gene regions for combined_TSI:
('LOC112268307', 64205945, 64312562)

Symbol: LOC112268307 (protein-coding)
uncharacterized LOC112268307
Gene card

## ASIANS ##
threshold = 6

# Function to define window to search nearby genes
def get_window_range(position, window_size):
    x_min = position - window_size // 2
    x_max = position + window_size // 2
    return x_min, x_max

# List of file paths
file_paths = [
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_CDX.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_CHS.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_CHB.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_KHV.csv',
    '/home/ari/ari-intern/people/ari/ariadna-intern/results/combined_JPT.csv'
]

# Dictionary to store gene regions for each population
gene_regions = {}

# Set to store unique gene symbols
unique_genes = set()

# Iterate over file paths
for file_path in file_paths:
    # Extract population name from the filename
    population_name = os.path.splitext(os.path.basename(file_path))[0]
    
    # Read data
    data = pd.read_csv(file_path)
    
    # Add 'population' column
    data['population'] = population_name
    
    # Filter data based on threshold
    filtered_data = data[data['combined_p_values'] > threshold]
    
    # Initialize a list to store gene regions for the current population
    population_gene_regions = []
    
    # Iterate over rows and calculate gene regions
    for index, row in filtered_data.iterrows():
        genomic_position = row['pos']
        window_size = 1000  # Set window size !!!!!!! 500 bp on each SNP side
        x_min, x_max = get_window_range(genomic_position, window_size)
        
        # Call gi.get_genes_region() for the current region
        gene_region = gi.get_genes_region('chrX', x_min, x_max, assembly='hg38')
        
        # Extract gene symbol, start position, and end position from the result
        if gene_region:  # Check if gene_region is not empty
            gene_symbol = gene_region[0][0]
            start_position = gene_region[0][1]
            end_position = gene_region[0][2]
            
            # Append the extracted information to the list of gene regions for the current population
            population_gene_regions.append((gene_symbol, start_position, end_position))
            
# Add gene symbol to set of unique genes
            unique_genes.add(gene_symbol)
    
    # Store the list of gene regions for the current population in the dictionary
    gene_regions[population_name] = population_gene_regions

# Print gene regions for all populations
for population, regions in gene_regions.items():
    print(f"Gene regions for {population}:")
    for region in regions:
        print(region)
    print()


print()
print()

# Set to store gene symbols for which information has already been retrieved
retrieved_symbols = set()

# Run gi.gene_info() for each unique gene symbol
for gene_symbol in unique_genes:
    # Check if gene information has already been retrieved for this symbol
    if gene_symbol not in retrieved_symbols:
        gene_info = gi.gene_info(gene_symbol)
        if gene_info is not None:
            print(f"Gene info for {gene_symbol}: {gene_info}")

Gene regions for combined_CDX:

Gene regions for combined_CHS:
('CLCN4', 10156974, 10237660)

Gene regions for combined_CHB:
('PASD1', 151563674, 151676739)

Gene regions for combined_KHV:
('PASD1', 151563674, 151676739)

Gene regions for combined_JPT:

Symbol: PASD1 (protein-coding) Aliases: CT63, CT64, OXTES1
PAS domain containing repressor 1
Summary: This gene encodes a protein that is thought to function as a transcription factor. The protein is a cancer-associated antigen that can stimulate autologous T-cell responses, and it is therefore considered to be a potential immunotherapeutic target for the treatment of various hematopoietic malignancies, including diffuse large B-cell lymphoma. [provided by RefSeq, May 2010].
Genomic position: X:151563675-151676739 (hg38), X:150732094-150845211 (hg19)
Gene card

Symbol: CLCN4 (protein-coding) Aliases: CLC4, ClC-4, ClC-4A, MRX15, MRX49, MRXSRC
chloride voltage-gated channel 4
Summary: The CLCN family of voltage-dependent chloride channel genes comprises nine members (CLCN1-7, Ka and Kb) which demonstrate quite diverse functional characteristics while sharing significant sequence homology. Chloride channel 4 has an evolutionary conserved CpG island and is conserved in both mouse and hamster. This gene is mapped in close proximity to APXL (Apical protein Xenopus laevis-like) and OA1 (Ocular albinism type I), which are both located on the human X chromosome at band p22.3. The physiological role of chloride channel 4 remains unknown but may contribute to the pathogenesis of neuronal disorders. Alternate splicing results in two transcript variants that encode different proteins. [provided by RefSeq, Mar 2012].
Genomic position: X:10156975-10237660 (hg38), X:10125024-10205700 (hg19)
Gene card