Tools to index genomic files¶
BED Indexer¶
-
class
mg_process_files.tool.bed_indexer.
bedIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a BED file for use in the RESTful API
-
bed2bigbed
(**kwargs)[source]¶ BED to BigBed converter
This uses the
bedToBigBed
program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from bed to bigbed.Parameters: - file_sorted_bed (str) – Location of the sorted BED file
- file_chrom (str) – Location of the chrom.size file
- file_bb (str) – Location of the bigBed file
Example
1 2 3 4
if not self.bed2bigbed(bed_file, chrom_file, bb_file): output_metadata.set_exception( Exception( "bed2bigbed: Could not process files {}, {}.".format(*input_files)))
-
bed2hdf5
(**kwargs)[source]¶ BED to HDF5 converter
Loads the BED file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.
Parameters: - file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later
- assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
- feature_length (int) – Defines the level of resolution that the features should be recorded at. The 2 options are 1 or 1000. 1 records features at every single base whereas 1000 groups features into 1000bp chunks. The single base pair option should really only be used when features are less than 10bp to
- file_sorted_bed (str) – Location of the sorted BED file
- file_hdf5 (str) – Location of the HDF5 index file
Example
1 2 3 4
if not self.bed2hdf5(file_id, assembly, bed_file, hdf5_file): output_metadata.set_exception( Exception( "bed2hdf5: Could not process files {}, {}.".format(*input_files)))
-
bed_feature_length
(file_bed)[source]¶ BED Feature Length
Function to calcualte the averagte length of a feature in BED file.
Parameters: file_bed (str) – Location of teh BED file Returns: average_feature_length – The average length of the features in a BED file. Return type: int
-
run
(input_files, input_metadata, output_files)[source]¶ Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API
Parameters: - input_files (list) –
- bed_file : str
- Location of the sorted bed file
- chrom_size : str
- Location of chrom.size file
- hdf5_file : str
- Location of the HDF5 index file
- metadata (list) –
- file_id : str
- file_id used to identify the original bed file
- assembly : str
- Genome assembly accession
Returns: - bed_file : str
Location of the sorted bed file
- bb_file : str
Location of the BigBed file
- hdf5_file : str
Location of the HDF5 index file
Return type: list
Example
1 2 3 4 5 6 7
import tool # Bed Indexer b = tool.bedIndexerTool(self.configuration) bi, bm = bd.run( [bed_file_id, chrom_file_id, hdf5_file_id], [], {'assembly' : assembly} )
- input_files (list) –
-
WIG Indexer¶
-
class
mg_process_files.tool.wig_indexer.
wigIndexerTool
(configuration=None)[source]¶ Tool for running indexers over a WIG file for use in the RESTful API
-
run
(input_files, input_metadata, output_files)[source]¶ Function to run the WIG file sorter and indexer so that the files can get searched as part of the REST API
Parameters: - input_files (dict) –
- wig_file : str
- Location of the wig file
- chrom_size : str
- Location of chrom.size file
- hdf5_file : str
- Location of the HDF5 index file
- meta_data (dict) –
Returns: - bw_file : str
Location of the BigWig file
- hdf5_file : str
Location of the HDF5 index file
Return type: list
- input_files (dict) –
-
wig2bigwig
(**kwargs)[source]¶ WIG to BigWig converter
This uses the
wigToBigWig
program binary provided at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ to perform the conversion from WIG to BigWig.Parameters: - file_wig (str) – Location of the wig file
- file_chrom (str) – Location of the chrom.size file
- file_bw (str) – Location of the bigWig file
Example
1 2 3 4
if not self.wig2bigwig(wig_file, chrom_file, bw_file): output_metadata.set_exception( Exception( "wig2bigWig: Could not process files {}, {}.".format(*input_files)))
-
wig2hdf5
(**kwargs)[source]¶ WIG to HDF5 converter
Loads the WIG file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original WIG file.
Parameters: - file_id (str) – The file_id as stored by the DMP so that it can be used for file retrieval later
- assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
- file_wig (str) – Location of the wig file
- file_hdf5 (str) – Location of the HDF5 index file
Example
1 2 3 4
if not self.wig2hdf5(file_id, assembly, wig_file, hdf5_file): output_metadata.set_exception( Exception( "wig2hdf5: Could not process files {}, {}.".format(*input_files)))
-
GFF3 Indexer¶
-
class
mg_process_files.tool.gff3_indexer.
gff3IndexerTool
(configuration=None)[source]¶ Tool for running indexers over a WIG file for use in the RESTful API
-
gff32hdf5
(**kwargs)[source]¶ GFF3 to HDF5 converter
Loads the GFF3 file into the HDF5 index file that gets used by the REST API to determine if there are files that have data in a given region. Overlapping regions are condensed into a single feature block rather than maintaining all of the detail of the original bed file.
Parameters: - file_id (str) – The file_id as stored by the DM-API so that it can be used for file retrieval later
- assembly (str) – Assembly of the genome that is getting indexed so that the chromosomes match
- file_sorted_gff3 (str) – Location of the sorted GFF3 file
- file_hdf5 (str) – Location of the HDF5 index file
Example
1 2 3 4
if not self.gff32hdf5(file_id, assembly, bed_file, hdf5_file): output_metadata.set_exception( Exception( "gff32hdf5: Could not process files {}, {}.".format(*input_files)))
-
gff32tabix
(**kwargs)[source]¶ GFF3 to Tabix
Compresses the sorted GFF3 file and then uses Tabix to generate an index of the GFF3 file.
Parameters: - file_sorted_gff3 (str) – Location of a sorted GFF3 file
- file_sorted_gz_gff3 (str) – Location of the bgzip compressed GFF3 file
- file_gff3_tbi (str) – Location of the Tabix index file
Example
1 2 3 4
if not self.gff32tabix(self, file_sorted_gff3, gz_file, tbi_file): output_metadata.set_exception( Exception( "gff32tabix: Could not process files {}, {}.".format(*input_files)))
-
run
(input_files, input_metadata, output_files)[source]¶ Function to run the BED file sorter and indexer so that the files can get searched as part of the REST API
Parameters: - input_files (list) –
- gff3_file : str
- Location of the bed file
- hdf5_file : str
- Location of the HDF5 index file
- meta_data (list) –
- file_id : str
- file_id used to identify the original bed file
- assembly : str
- Genome assembly accession
Returns: - gz_file : str
Location of the sorted gzipped GFF3 file
- tbi_file : str
Location of the Tabix index file
- hdf5_file : str
Location of the HDF5 index file
Return type: list
- input_files (list) –
-
3D JSON Indexer¶
-
class
mg_process_files.tool.json_3d_indexer.
json3dIndexerTool
(configuration=None)[source]¶ Tool for running indexers over 3D JSON files for use in the RESTful API
-
json2hdf5
(**kwargs)[source]¶ Genome Model Indexing
Load the JSON files generated by TADbit into a specified HDF5 file. The file includes the x, y and z coordinates of all the models for each region along with the matching stats, clusters, TADs and adjacency values used during the modelling.
Parameters: - json_files (list) – Locations of all the JSON 3D model files generated by TADbit for a given dataset
- file_hdf5 (str) – Location of the HDF5 index file for this dataset.
Example
1 2 3 4
if not self.json2hdf5(json_files, assembly, wig_file, hdf5_file): output_metadata.set_exception( Exception( "wig2hdf5: Could not process files {}, {}.".format(*input_files)))
-
run
(input_files, input_metadata, output_files)[source]¶ Function to index models of the geome structure generated by TADbit on a per dataset basis so that they can be easily distributed as part of the RESTful API.
Parameters: - input_files (list) –
- gz_file : str
- Location of the archived JSON model files
- hdf5_file : str
- Location of the HDF5 index file
- meta_data (list) –
- file_id : str
- file_id used to identify the original wig file
- assembly : str
- Genome assembly accession
Returns: - hdf5_file : str
Location of the HDF5 index file
Return type: list
Example
1 2 3 4 5
import tool # WIG Indexer j3d = tool.json3dIndexerTool(self.configuration) j3di = j3d.run((gz_file, hdf5_file_id), ())
- input_files (list) –
-
unzipJSON
(file_targz)[source]¶ Unzips the zipped folder containing all the models for regions of the genome based on the information within the adjacency matrixes generated by TADbit.
Parameters: archive_location (str) – Location of archived JSON files Returns: json_file_locations – List of the locations of the files within an extracted archive Return type: list Example
1 2
gz_file = '/home/<user>/test.tar.gz' json_files = unzip(gz_file)
-