Tools to index genomic files¶

BED Indexer¶

class mg_process_files.tool.bed_indexer.bedIndexerTool(configuration=None)[source]¶

Tool for running indexers over a BED file for use in the RESTful API

bed2bigbed(**kwargs)[source]¶

BED to BigBed converter

This uses the bedToBigBed program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from bed to bigbed.

Parameters:	file_sorted_bed (str) – Location of the sorted BED file file_chrom (str) – Location of the chrom.size file file_bb (str) – Location of the bigBed file

Example

if not self.bed2bigbed(bed_file, chrom_file, bb_file):
    output_metadata.set_exception(
        Exception(
            "bed2bigbed: Could not process files {}, {}.".format(*input_files)))

bed2hdf5(**kwargs)[source]¶

BED to HDF5 converter

Loads the BED file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.

Parameters:

file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later
assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
feature_length (int) – Defines the level of resolution that the features should be recorded at. The 2 options are 1 or 1000. 1 records features at every single base whereas 1000 groups features into 1000bp chunks. The single base pair option should really only be used when features are less than 10bp to
file_sorted_bed (str) – Location of the sorted BED file
file_hdf5 (str) – Location of the HDF5 index file

Example

if not self.bed2hdf5(file_id, assembly, bed_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "bed2hdf5: Could not process files {}, {}.".format(*input_files)))

bed_feature_length(file_bed)[source]¶

BED Feature Length

Function to calcualte the averagte length of a feature in BED file.

Parameters:	file_bed (str) – Location of teh BED file
Returns:	average_feature_length – The average length of the features in a BED file.
Return type:	int

run(input_files, input_metadata, output_files)[source]¶

Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API

Parameters:

input_files (list) –

bed_file : str

Location of the sorted bed file

chrom_size : str

Location of chrom.size file

hdf5_file : str

Location of the HDF5 index file
metadata (list) –

file_id : str

file_id used to identify the original bed file

assembly : str

Genome assembly accession

Returns:

bed_file : str: Location of the sorted bed file
bb_file : str: Location of the BigBed file
hdf5_file : str: Location of the HDF5 index file

Return type:

list

Example

import tool

# Bed Indexer
b = tool.bedIndexerTool(self.configuration)
bi, bm = bd.run(
    [bed_file_id, chrom_file_id, hdf5_file_id], [], {'assembly' : assembly}
)

WIG Indexer¶

class mg_process_files.tool.wig_indexer.wigIndexerTool(configuration=None)[source]¶

Tool for running indexers over a WIG file for use in the RESTful API

run(input_files, input_metadata, output_files)[source]¶

Function to run the WIG file sorter and indexer so that the files can get searched as part of the REST API

Parameters:

input_files (dict) –

wig_file : str

Location of the wig file

chrom_size : str

Location of chrom.size file

hdf5_file : str

Location of the HDF5 index file
meta_data (dict) –

Returns:

bw_file : str: Location of the BigWig file
hdf5_file : str: Location of the HDF5 index file

Return type:

list

wig2bigwig(**kwargs)[source]¶

WIG to BigWig converter

This uses the wigToBigWig program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from WIG to BigWig.

Parameters:	file_wig (str) – Location of the wig file file_chrom (str) – Location of the chrom.size file file_bw (str) – Location of the bigWig file

Example

if not self.wig2bigwig(wig_file, chrom_file, bw_file):
    output_metadata.set_exception(
        Exception(
            "wig2bigWig: Could not process files {}, {}.".format(*input_files)))

wig2hdf5(**kwargs)[source]¶

WIG to HDF5 converter

Loads the WIG file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original WIG file.

Parameters:	file_id (str) – The file_id as stored by the DMP so that it can be used for file retrieval later assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match file_wig (str) – Location of the wig file file_hdf5 (str) – Location of the HDF5 index file

Example

if not self.wig2hdf5(file_id, assembly, wig_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "wig2hdf5: Could not process files {}, {}.".format(*input_files)))

GFF3 Indexer¶

class mg_process_files.tool.gff3_indexer.gff3IndexerTool(configuration=None)[source]¶

Tool for running indexers over a WIG file for use in the RESTful API

gff32hdf5(**kwargs)[source]¶

GFF3 to HDF5 converter

Loads the GFF3 file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.

Parameters:	file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match file_sorted_gff3 (str) – Location of the sorted GFF3 file file_hdf5 (str) – Location of the HDF5 index file

Example

if not self.gff32hdf5(file_id, assembly, bed_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "gff32hdf5: Could not process files {}, {}.".format(*input_files)))

gff32tabix(**kwargs)[source]¶

GFF3 to Tabix

Compresses the sorted GFF3 file and then uses Tabix to generate an index of the GFF3 file.

Parameters:	file_sorted_gff3 (str) – Location of a sorted GFF3 file file_sorted_gz_gff3 (str) – Location of the bgzip compressed GFF3 file file_gff3_tbi (str) – Location of the Tabix index file

Example

if not self.gff32tabix(self, file_sorted_gff3, gz_file, tbi_file):
    output_metadata.set_exception(
        Exception(
            "gff32tabix: Could not process files {}, {}.".format(*input_files)))

run(input_files, input_metadata, output_files)[source]¶

Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API

Parameters:

input_files (list) –

gff3_file : str

Location of the bed file

hdf5_file : str

Location of the HDF5 index file
meta_data (list) –

file_id : str

file_id used to identify the original bed file

assembly : str

Genome assembly accession

Returns:

gz_file : str: Location of the sorted gzipped GFF3 file
tbi_file : str: Location of the Tabix index file
hdf5_file : str: Location of the HDF5 index file

Return type:

list

3D JSON Indexer¶

class mg_process_files.tool.json_3d_indexer.json3dIndexerTool(configuration=None)[source]¶

Tool for running indexers over 3D JSON files for use in the RESTful API

json2hdf5(**kwargs)[source]¶

Genome Model Indexing

Load the JSON files generated by TADbit into a specified HDF5 file. The file includes the x, y and z coordinates of all the models for each region along with the matching stats, clusters, TADs and adjacency values used during the modelling.

Parameters:	json_files (list) – Locations of all the JSON 3D model files generated by TADbit for a given dataset file_hdf5 (str) – Location of the HDF5 index file for this dataset.

Example

if not self.json2hdf5(json_files, assembly, wig_file, hdf5_file):
    output_metadata.set_exception(
        Exception(
            "wig2hdf5: Could not process files {}, {}.".format(*input_files)))

run(input_files, input_metadata, output_files)[source]¶

Function to index models of the geome structure generated by TADbit on a per dataset basis so that they can be easily distributed as part of the RESTful API.

Parameters:

input_files (list) –

gz_file : str

Location of the archived JSON model files

hdf5_file : str

Location of the HDF5 index file
meta_data (list) –

file_id : str

file_id used to identify the original wig file

assembly : str

Genome assembly accession

Returns:

hdf5_file : str: Location of the HDF5 index file

Return type:

list

Example

import tool

# WIG Indexer
j3d = tool.json3dIndexerTool(self.configuration)
j3di = j3d.run((gz_file, hdf5_file_id), ())

unzipJSON(file_targz)[source]¶

Unzips the zipped folder containing all the models for regions of the genome based on the information within the adjacency matrixes generated by TADbit.

Parameters:	archive_location (str) – Location of archived JSON files
Returns:	json_file_locations – List of the locations of the files within an extracted archive
Return type:	list

Example

gz_file = '/home/<user>/test.tar.gz'
json_files = unzip(gz_file)