Test Data for RNA-seq pipeline

The following document is for the preparation of data set required for testing the RNA-seq pipeline. The document has been written with macOS Sierra in mind, although many of the commands are cross platform (*nix) compliant.

You would need to have the tools listed in “Prerequisites” installed on your system. For more details on installing the tools for this pipeline please refer to

http://multiscale-genomics.readthedocs.io/projects/mg-process-fastq/en/latest/full_installation.html

If you already have certain packages installed feel free to skip over certain steps. Likewise the bin, lib and code directories are relative to the home dir; if this is not the case for your system then make the required changes when running these commands.

Prerequisites

  • Kallisto
  • Samtools

Data set for genome file

Go to Ensemble website >> Human >> Example gene

http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000139618;r=13:32315474-32400266

Copy the chromosome number and coordinates given in the “location” field. Go to BioMart (top panel), and select Filters from the left panel. Expand Regions and enter the information retrieved above.

Click on “Attributes” in the left panel. Select Gene stable ID, Transcript stable ID from Features. Select cDNA sequences from Sequences radio button.

Click on the Results button above the left panel. Export results to fasta file.

Index this file using Kallisto indexer:

kallisto index -i kallisto.Human.GRCh38.fasta.idx /path/to/file/exportSequences.fasta

Download the fastq files

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030872/ERR030872_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR030/ERR030872/ERR030872_2.fastq.gz

Run the Kallisto quantifier using command:

kallisto quant -i kallisto.Human.GRCh38.fasta.idx -o out --pseudobam /path/to/ERR030872_1.fastq.gz /path/to/ERR030872_2.fastq.gz  >kallisto.ERR030872.sam

Filter the aligned sequence entries from the above sam file:

awk '$3 != "*"' kallisto.ERR030872.sam >kallisto.ERR030872.filtered.sam

Unzip the fastq files.

unzip ERR030872_1.fastq.gz
unzip ERR030872_2.fastq.gz

Checkout https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/RNASeq_Scripts/makeFastQFiles.py and use the following command to generate the fastq files:

python /path/to/makeFastQFiles.py --samfile kallisto.ERR030872.filtered.sam --fastQfile ERR030872_1.fastq --pathToOutput /path/to/make/fastqFile/ --fastqOut ERR030872_1.RNAseq.fastq
python /path/to/makeFastQFiles.py --samfile kallisto.ERR030872.filtered.sam --fastQfile ERR030872_2.fastq --pathToOutput /path/to/make/fastqFile/ --fastqOut ERR030872_2.RNAseq.fastq

Shorten these files by running the script at https://github.com/Multiscale-Genomics/mg-misc-scripts/blob/master/RNASeq_Scripts/randomSeqSelector.py using

python PythonScripts/randomSeqSelector.py ERR030872_1.RNAseq.fastq kallisto.Human.ERR030872_1.fastq
python PythonScripts/randomSeqSelector.py ERR030872_2.RNAseq.fastq kallisto.Human.ERR030872_2.fastq

Then zip them:

gzip kallisto.Human.ERR030872_1.fastq
gzip kallisto.Human.ERR030872_2.fastq