Tools to index genomic files

BED Indexer

class mg_process_files.tool.bed_indexer.bedIndexerTool(configuration=None)[source]

Tool for running indexers over a BED file for use in the RESTful API

bed2bigbed(**kwargs)[source]

BED to BigBed converter

This uses the bedToBigBed program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from bed to bigbed.

Parameters:
  • file_sorted_bed (str) – Location of the sorted BED file
  • file_chrom (str) – Location of the chrom.size file
  • file_bb (str) – Location of the bigBed file

Example

1
2
3
4
if not self.bed2bigbed(bed_file, chrom_file, bb_file):
    output_metadata.set_exception(
        Exception(
            "bed2bigbed: Could not process files {}, {}.".format(*input_files)))
bed2hdf5(**kwargs)[source]

BED to HDF5 converter

Loads the BED file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.

Parameters:
  • file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later
  • assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
  • feature_length (int) – Defines the level of resolution that the features should be recorded at. The 2 options are 1 or 1000. 1 records features at every single base whereas 1000 groups features into 1000bp chunks. The single base pair option should really only be used when features are less than 10bp to
  • file_sorted_bed (str) – Location of the sorted BED file
  • file_hdf5 (str) – Location of the HDF5 index file

Example

1
2
3
4
if not self.bed2hdf5(file_id, assembly, bed_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "bed2hdf5: Could not process files {}, {}.".format(*input_files)))
bed_feature_length(file_bed)[source]

BED Feature Length

Function to calcualte the averagte length of a feature in BED file.

Parameters:file_bed (str) – Location of teh BED file
Returns:average_feature_length – The average length of the features in a BED file.
Return type:int
run(input_files, input_metadata, output_files)[source]

Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API

Parameters:
  • input_files (list) –
    bed_file : str
    Location of the sorted bed file
    chrom_size : str
    Location of chrom.size file
    hdf5_file : str
    Location of the HDF5 index file
  • metadata (list) –
    file_id : str
    file_id used to identify the original bed file
    assembly : str
    Genome assembly accession
Returns:

bed_file : str

Location of the sorted bed file

bb_file : str

Location of the BigBed file

hdf5_file : str

Location of the HDF5 index file

Return type:

list

Example

1
2
3
4
5
6
7
import tool

# Bed Indexer
b = tool.bedIndexerTool(self.configuration)
bi, bm = bd.run(
    [bed_file_id, chrom_file_id, hdf5_file_id], [], {'assembly' : assembly}
)

WIG Indexer

class mg_process_files.tool.wig_indexer.wigIndexerTool(configuration=None)[source]

Tool for running indexers over a WIG file for use in the RESTful API

run(input_files, input_metadata, output_files)[source]

Function to run the WIG file sorter and indexer so that the files can get searched as part of the REST API

Parameters:
  • input_files (dict) –
    wig_file : str
    Location of the wig file
    chrom_size : str
    Location of chrom.size file
    hdf5_file : str
    Location of the HDF5 index file
  • meta_data (dict) –
Returns:

bw_file : str

Location of the BigWig file

hdf5_file : str

Location of the HDF5 index file

Return type:

list

wig2bigwig(**kwargs)[source]

WIG to BigWig converter

This uses the wigToBigWig program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from WIG to BigWig.

Parameters:
  • file_wig (str) – Location of the wig file
  • file_chrom (str) – Location of the chrom.size file
  • file_bw (str) – Location of the bigWig file

Example

1
2
3
4
if not self.wig2bigwig(wig_file, chrom_file, bw_file):
    output_metadata.set_exception(
        Exception(
            "wig2bigWig: Could not process files {}, {}.".format(*input_files)))
wig2hdf5(**kwargs)[source]

WIG to HDF5 converter

Loads the WIG file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original WIG file.

Parameters:
  • file_id (str) – The file_id as stored by the DMP so that it can be used for file retrieval later
  • assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
  • file_wig (str) – Location of the wig file
  • file_hdf5 (str) – Location of the HDF5 index file

Example

1
2
3
4
if not self.wig2hdf5(file_id, assembly, wig_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "wig2hdf5: Could not process files {}, {}.".format(*input_files)))

GFF3 Indexer

class mg_process_files.tool.gff3_indexer.gff3IndexerTool(configuration=None)[source]

Tool for running indexers over a WIG file for use in the RESTful API

gff32hdf5(**kwargs)[source]

GFF3 to HDF5 converter

Loads the GFF3 file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.

Parameters:
  • file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later
  • assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
  • file_sorted_gff3 (str) – Location of the sorted GFF3 file
  • file_hdf5 (str) – Location of the HDF5 index file

Example

1
2
3
4
if not self.gff32hdf5(file_id, assembly, bed_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "gff32hdf5: Could not process files {}, {}.".format(*input_files)))
gff32tabix(**kwargs)[source]

GFF3 to Tabix

Compresses the sorted GFF3 file and then uses Tabix to generate an index of the GFF3 file.

Parameters:
  • file_sorted_gff3 (str) – Location of a sorted GFF3 file
  • file_sorted_gz_gff3 (str) – Location of the bgzip compressed GFF3 file
  • file_gff3_tbi (str) – Location of the Tabix index file

Example

1
2
3
4
if not self.gff32tabix(self, file_sorted_gff3, gz_file, tbi_file):
    output_metadata.set_exception(
        Exception(
            "gff32tabix: Could not process files {}, {}.".format(*input_files)))
run(input_files, input_metadata, output_files)[source]

Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API

Parameters:
  • input_files (list) –
    gff3_file : str
    Location of the bed file
    hdf5_file : str
    Location of the HDF5 index file
  • meta_data (list) –
    file_id : str
    file_id used to identify the original bed file
    assembly : str
    Genome assembly accession
Returns:

gz_file : str

Location of the sorted gzipped GFF3 file

tbi_file : str

Location of the Tabix index file

hdf5_file : str

Location of the HDF5 index file

Return type:

list

3D JSON Indexer

class mg_process_files.tool.json_3d_indexer.json3dIndexerTool(configuration=None)[source]

Tool for running indexers over 3D JSON files for use in the RESTful API

json2hdf5(**kwargs)[source]

Genome Model Indexing

Load the JSON files generated by TADbit into a specified HDF5 file. The file includes the x, y and z coordinates of all the models for each region along with the matching stats, clusters, TADs and adjacency values used during the modelling.

Parameters:
  • json_files (list) – Locations of all the JSON 3D model files generated by TADbit for a given dataset
  • file_hdf5 (str) – Location of the HDF5 index file for this dataset.

Example

1
2
3
4
if not self.json2hdf5(json_files, assembly, wig_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "wig2hdf5: Could not process files {}, {}.".format(*input_files)))
run(input_files, input_metadata, output_files)[source]

Function to index models of the geome structure generated by TADbit on a per dataset basis so that they can be easily distributed as part of the RESTful API.

Parameters:
  • input_files (list) –
    gz_file : str
    Location of the archived JSON model files
    hdf5_file : str
    Location of the HDF5 index file
  • meta_data (list) –
    file_id : str
    file_id used to identify the original wig file
    assembly : str
    Genome assembly accession
Returns:

hdf5_file : str

Location of the HDF5 index file

Return type:

list

Example

1
2
3
4
5
import tool

# WIG Indexer
j3d = tool.json3dIndexerTool(self.configuration)
j3di = j3d.run((gz_file, hdf5_file_id), ())
unzipJSON(file_targz)[source]

Unzips the zipped folder containing all the models for regions of the genome based on the information within the adjacency matrixes generated by TADbit.

Parameters:archive_location (str) – Location of archived JSON files
Returns:json_file_locations – List of the locations of the files within an extracted archive
Return type:list

Example

1
2
gz_file = '/home/<user>/test.tar.gz'
json_files = unzip(gz_file)