Home page
Term 1
Term 2
Term 3
About me
Faculty website

Transcriptome reads mapping

THERE IS A TABLE WITH ALL THE COMMANDS USED AT THE END OF THE WEBPAGE

Read quality analysis

According to the output of the FastQC program, the given single-end reads have a satisfactory quality, despite the fails in "Per base sequence content" which may be attributed to primers, "Per sequence GC content", which can be explained by the unrepresentativeness of the whole genome region, "Sequence duplication levels", "Overrepresented sequences", and also a warning in the "Kmer content" section of the FastQC report, which can all be ascribed to the fact that it was RNA sequencing.
FastQC report
The main merit of the given set of reads is their initial quality:

Trimming with Trimmomatic

Trimmomatic with TRAILING:20 and MINLEN:50 didn't change the per base quality to any significant extent. What it did was to covert a check in the "Per tile quality" section into a warning:
Before trimmming:

After trimmming:

Thus it was decided to work with the original set of reads instead of the trimmed ones. FastQC report after trimming
There were 24156 reads in the original file and 24051 reads in the resulting file.

Mapping the reads

The hisat2 program was launched without the --no-spliced-alignment parameter since we are dealing with transcriptome this time, where the reads can contain exon junctions.
This was the program's output:

24156 reads; of these:
  24156 (100.00%) were unpaired; of these:
    502 (2.08%) aligned 0 times
    23643 (97.88%) aligned exactly 1 time
    11 (0.05%) aligned >1 times
97.92% overall alignment rate

Below there is a .sam file containing those 23654 reads mapped to the reference sequence (chr5.fasta):
Alignment

Alignment analysis

The binary file returned by samtools view: mapped_transcriptome.bam
The binary file returned by samtools sort: mapped_transcriptome_sorted.bam
Indexed file produced by samtools index: mapped_transcriptome_sorted.bam.bai
23654 reads were mapped to the chromosome sequence, 502 were not. Additional information from hisat2 can be found above or in the .sam file produced by it (link in the previous section).

Counting the reads with htseq-count

The htseq-count program counts how many reads map to each feature from the feature file (in this case features are genes). Some of its options are -f, -s, -i, -m. -f specifies the format of the input file (sam/bam, default is sam); -s {yes, no, reverse} establishes whether the data is from a strand-specific assay, 'reverse' means 'yes' with reverse strand interepretation; -i sets the feature ID for a GFF feature file (default is gene_id suitable for Ensembl GTF files); -m {union, intersection-strict, intersection-nonempty} specifies the protocol to follow in ambiguous mapping cases:

The resulting file contained a host of genes with zero reads mapped to each of them, so regular expressions were used to extract the genes with nonzero number of pertaining reads:

 
ENSG00000181163.9       22529
ENSG00000249353.2       316
__no_feature    798
__not_aligned   502
__alignment_not_unique  22

Here is the same data in a file.
To sum up, out of 24156 reads 502 were no aligned, 798 were aligned but didn't match any genes, 22 were aligned ambiguously and were not resolved by the union mode, 22529 reads were aligned with the ENSG00000181163.9 gene, and 316 reads were aligned with the ENSG00000249353.2 gene. The ENSG00000181163.9 gene codes for nucleophosmin 1 (NPM1), which is a protein associated with nucleolar ribonucleoprotein structures and involved in the biogenesis of ribosomes. It binds single-stranded and double-stranded nucleic acids, preferentially G-quadruplex forming ones. ENSG00000249353.2 is nucleophosmin 1 pseudogene 27

Table with all the commands used

Command	Function
fastqc chr5.1.fastq	It returns chr5.1_fastqc.html file containing an analysis of a sequencer output (.fastq file). It also returns a .zip archive containing the sequencing data.
java -jar /nfs/srv/databases/ngs/suvorova/trimmomatic/trimmomatic-0.30.jar SE -phred33 chr5.1.fastq chr5.1_trimmomatic.fastq TRAILING:20 MINLEN:50	Trims reads having nucleotides of lower quality than 20 and then disposes of the reads shorter than 50 nt.
hisat2 -x chr5 -U chr5.1.fastq -S mapped_transcriptome.sam --no-softclip	It receives an index following -x, a set of reads following -U, and an output file name following -S, creating an alignment of the reads with the reference sequence. The parameter that is left disallows alignment of reads with clipped 5' and 3' ends. --no-spliced-alignment is not used because of possible exon junctions contained in the reads.
samtools view mapped_transcriptome.sam -bo mapped_transcriptome.bam	This command receives a .sam file and converts it into a .bam file. The -b option specifies ther .bam format, while the -o option specifies the name of the output file. These two options combined result in the -bo "option".
samtools sort -T temp.txt -o mapped_transcriptome_sorted.bam mapped_transcriptome.bam	It sorts the alignment from the input file. The option -T is needed to specify a file which the program can write temporary files into. The file temp.txt is deleted after the program finishes operating.
samtools index mapped_transcriptome_sorted.bam	This program indexes the input file producing mapped_reads_sorted.bam.bai file.
htseq-count -f bam mapped_transcriptome_sorted.bam gencode.v19.chr_patch_hapl_scaff.annotation.gtf > feature_htseq	Counts the number of reads mapped to each gene from the .gtf(.gff) file. -f specifies the format of the input file.
grep -vw 0 feature_htseq > feature_htseq_nonzero	Returns the lines that match the regular expression. In this case, -w 0 confines the output to the lines which contain the word "0", and -v reverses the output, leaving only lines that do not contain solitary zeros (the ones surrounded by spaces).