Command table

Command	Description
fastqc chr9.2.fastq	FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data.
java -jar /nfs/srv/databases/ngs/suvorova/trimmomatic/trimmomatic-0.30.jar SE -phred33 chr9_1.fastq 9.1cut.fastq TRAILING:20 MINLEN:50	This command cuts off parts with average quality less than 20 and removes reads that are less than 50 nucleotides long. Quality changes are presented in Fig.1 and 2
hisat2-build chr9.fasta r_chr9_ind	Indexing of reference sequence
hisat2 -x r_chr9_ind -U 9.1cut.fastq --no-softclip -S chr9_alig.sam --summary-file hisat2.txt	Alignment of reads and reference sequence. --no-spliced-alignment was removed due to the fact that in RNA reads it is possible that 2 or more different parts of one read can be mapped distantly due to splicing.
samtools view -t chr9.fasta.fai -b chr9_alig.sam -o chr9_alig.bam	Conversion of alignment into binary format
samtools sort chr9_alig.bam -T temp -o chr9_sort.bam	Alignment sorting by the coordinate in the reference
samtools index chr9_sort.bam	Indexing of sorted file
htseq-count -f bam -s no -m intersection-nonempty -i gene_id chr9_sort.bam /P/y14/term3/block4/SNP/rnaseq_reads/gencode.v19.chr_patch_hapl_scaff.annotation.gtf -o file.HTSeq-count.txt	This tool takes an alignment file in SAM or BAM format and feature file in GFF format and calculates the number of reads mapping to each feature. -f (format), --format=(format) Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam. -s (yes/no/reverse), --stranded=(yes/no/reverse) whether the data is from a strand-specific assay (default: yes) -i (id attribute), --idattr=(id attribute) GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is gene_id. -m (mode), --mode=(mode) Mode to handle reads overlapping more than one feature. Possible values for (mode) are union, intersection-strict and intersection-nonempty (default: union)

Table 1. Commands

Task 1. Reads preparation

Amount of reads didn't change dramatically: from 14020 to 13487 (533 were dropped). According to the command was used, reads whose length was less than 50 (after cutting off low quality parts) were dropped.

Figure 1 and 2. Per base sequence quality before and after trimming

From the graph it is clear that the quality of reading the first 25 nucleotides has hardly changed. Starting at about 30th nucleotide, the scatter of values slightly decreases and the quality improves.

Task 2. Reads mapping

Used commands are presented in Table 1. 13487 reads were unpaired, 78 of them were aligned 0 times, 13409 of them were aligned exactly one time (0 were aligned more than once).

13487 reads; of these:                    
  13487 (100.00%) were unpaired; of these:
    78 (0.58%) aligned 0 times            
    13409 (99.42%) aligned exactly 1 time 
    0 (0.00%) aligned >1 times            
99.42% overall alignment rate

Task 3. Reads count

Results of count.py script: __no_feature - 251; ENSG00000119335.12 gene - 13158;__not_aligned - 78

ENSG00000119335.12 is SET-201 gene, coding Protein SET. Multitasking protein, involved in apoptosis, transcription, nucleosome assembly and histone chaperoning.

It also seems that all reads ended up whithin ENSG00000119335.12 gene