Home page
Term 1
Term 2
Term 3
About me
Faculty website

Sequencing data analysis

THERE IS A TABLE WITH ALL THE COMMANDS USED AT THE END OF THE WEBPAGE

Read quality analysis

According to the output of the FastQC program, the given single-end reads have a high quality. There are warnings in the "Per base sequence content" (a little wobble there, mostly in the left-hand region, presumably caused by primers), "Per sequence GC content", and "Sequence length distribution" sections of the FastQC report, displaying some minor contaminations of the sample, but overall it can be concluded that the quality is high.
FastQC report
In the "Sequence length distribution" section there was a warning, although the plot clearly shows that there are only sequences with lengths of around 100. To clarify the cause of the warning, I opened the fastqc_data.txt file from the .zip archive, and it showed that there actually is one short read, though not seen in the diagram:

>>Sequence Length Distribution	warn
#Length	Count
40-41	1.0
42-43	0.0
44-45	0.0
46-47	0.0
48-49	0.0
50-51	0.0
52-53	0.0
54-55	0.0
56-57	0.0
58-59	0.0
60-61	0.0
62-63	0.0
64-65	0.0
66-67	0.0
68-69	0.0
70-71	0.0
72-73	0.0
74-75	0.0
76-77	0.0
78-79	0.0
80-81	0.0
82-83	0.0
84-85	0.0
86-87	0.0
88-89	0.0
90-91	0.0
92-93	0.0
94-95	0.0
96-97	0.0
98-99	0.0
100-101	8207.0

Trimming and discarding poor quality or inappropriately short reads

This is a diagram displaying per base quality before the application of Trimmomatic:

When the Trimmomatic program was launched with the parameters specified in the task (minimum length 50 and minimum quality 20), the resulting .fastq file happened to be much worse than the original one. The TRAILING parameter turned out to trim reads with low quality bases in the middle of the sequence, creating many short "reads" and thus lowering the overall quality of the sequencing data (the GC content section gave a fail instead of a warning), although the diagram in "Per base quality" section did not show low quality bases any more:
FastQC report after incorrect trimming
Afterwards, another trimming was performed with MINLEN set to 100. Such parameters neither allowed those short pseudoreads to contaminate the sequencing data, nor deleted any significant data since all the reads except one were 100 nt long in the original .fastq file. Thus, I ended up with 7049 100-nt-long reads (instead of 8207 in the original file). Below there is the "Per base quality" diagram from the correctly trimmed set of reads together with the complete FastQC report, which showes an improved quality:

FastQC report after correct trimming
There were 8208 reads (including the one short read) in the original file and 7049 reads in the resulting file.
This is the distribution of read length after the use of Trimmomatic (now there is no warning in this section):

Mapping the reads

This was the output of the hisat2 program.

7049 reads; of these:
  7049 (100.00%) were unpaired; of these:
    22 (0.31%) aligned 0 times
    7027 (99.69%) aligned exactly 1 time
    0 (0.00%) aligned >1 times
99.69% overall alignment rate

Below there is a .sam file containing those 7027 reads mapped to the reference sequence (chr5.fasta):
Alignment

Alignment analysis

The binary file returned by samtools view: mapped_reads.bam
The binary file returned by samtools sort: mapped_reads_sorted.bam
Indexed file produced by samtools index: mapped_reads_sorted.bam.bai
7049 reads were mapped to the chromosome sequence, 22 were not. Additional information from hisat2 can be found above or in the .sam file produced by it (link in the previous section).

SNP analysis

Three SNPs from the pileup.vcf file:

Position	Polymorphism type	Reference/Read	Depth	Quality
35857177	SNP	G/C	101	221.999
35857235	SNP	C/G	71	221.999
35857308	Indel	T/TC	30	173.457

There are 31 polymorphisms altogether: 4 indels and 27 SNPs.
Depth varies from one read to 138 (median 26, which is OK). Quality varies from 8.64911 to 225.009 whith the median of 179.009, which is quite a high quality (these data were obtained via Excel).

RefSeq SNP types

SNP are classified by Annovar according to their position as follows:

SNP type	Explanation
exonic	variant overlaps a coding exon
splicing	variant is within 2-bp of a splicing junction (use -splicing_threshold to change this)
ncRNA	variant overlaps a transcript without coding annotation in the gene definition (see Notes below for more explanation)
UTR5	variant overlaps a 5' untranslated region
UTR3	variant overlaps a 3' untranslated region
intronic	variant overlaps an intron
upstream	variant overlaps 1-kb region upstream of transcription start site
downstream	variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this)
intergenic	variant is in intergenic region

Annotation analysis

Annovar with RefGene found the following types among the ones being analyzed (information from the .variant_function file):

SNP type	Count
intronic	22
exonic	4
UTR3	1

The same file provided information about which genes do the SNPs happen to be in:

Gene	SNP count
IL7R	14
CAPSL	4
HMGCR	9

According to the files returned by annovar with DBSNP, 24 SNPs have rs.
According to the output of the annovar with 1000Genomes, the mean frequency of occurence of SNPs is 0.417897.
Clinical annotation of some SNPs was returned by annovar with GWAS:

Medical condition	Position in the 5th chromosome	Reference/read
Multiple sclerosis,Type 1 diabetes	35874575	C/T
Type 1 diabetes	35910529	C/T
LDL cholesterol,Cholesterol, total	74651084	A/G
Quantitative traits,LDL cholesterol	74655726	C/T
Cholesterol, total,LDL cholesterol	74656539	T/C

Here one can find an Excel table summarizing all the data obtained from the annotation.

Table with all the commands used

Command	Function
fastqc chr5_fastq	It returns chr5_fastqc.html file containing an analysis of a sequencer output (.fastq file). It also returns a .zip archive containing the sequencing data.
java -jar /nfs/srv/databases/ngs/suvorova/trimmomatic/trimmomatic-0.30.jar SE -phred33 chr5.fastq chr5_trimmomatic.fastq TRAILING:20 MINLEN:100	Trims reads having nucleotides of lower quality than 20 and then disposes of the reads shorter than 100 nt.
PATH=${PATH}:/home/students/y06/anastaisha_w/hisat2-2.0.5	This command allows to launch programs located in the specified directory outside that directory.
hisat2-build chr5.fasta chr5	It builds a HISAT2 index from the .fasta input file. It creates several files whose basename is (names start with) "chr5" (the second argument).
hisat2 -x chr5 -U chr5_trimmomatic.fastq -S mapped_reads.sam --no-spliced-alignment --no-softclip	It receives an index following -x, a set of reads following -U, and an output file name following -S, creating an alignment of the reads with the reference sequence. The two parameters that are left disallow alignment of reads with clipped 5' and 3' ends and alignment of reads with an indel inside a read corresponding to an excised intron.
samtools view mapped_reads.sam -bo mapped_reads.bam	This command receives a .sam file and converts it into a .bam file. The -b option specifies ther .bam format, while the -o option specifies the name of the output file. These two options combined result in the -bo "option".
samtools sort -T temp.txt -o mapped_reads_sorted.bam mapped_reads.bam	It sorts the alignment from the input file. The option -T is needed to specify a file which the program can write temporary files into. The file temp.txt is deleted after the program finishes operating.
samtools index mapped_reads_sorted.bam	This program indexes the input file producing mapped_reads_sorted.bam.bai file.
samtools mpileup -uf chr5.fasta -o pileup.bcf mapped_reads_sorted.bam	It creates a binary file (pileup.bcf) containing information about polymorphisms. -f precedes a reference file (chr5.fasta), and -u specifies that the output file must be uncompressed.
bcftools call -cv pileup.bcf -o pileup.vcf	Creates a .vcf file from the binary .bcf.
vcftools --vcf pileup.vcf --remove-indels --recode --recode-INFO-all --out pileup_no_indels	Deletes indels from the input .vcf file. --recode-INFO-all allows to retain the INFO field (otherwise it would be deleted).
convert2annovar.pl -format vcf4 pileup_no_indels.recode.vcf -outfile pileup	This progam converts a .vcf file into an input file for annovar.
annotate_variation.pl -out refgene_anno -build hg19 pileup /nfs/srv/databases/annovar/humandb.old/	Annotation using the RefGene database.
annotate_variation.pl -filter -out dbsnp_anno -build hg19 -dbtype snp138 pileup /nfs/srv/databases/annovar/humandb.old/	Annotation using the DBSNP database.
annotate_variation.pl -filter -out 1000gen_anno -build hg19 -dbtype 1000g2014oct_all pileup /nfs/srv/databases/annovar/humandb.old/	Annotation using the 1000Genomes database
annotate_variation.pl -regionanno -out gwas_anno -build hg19 -dbtype gwasCatalog pileup /nfs/srv/databases/annovar/humandb.old/	Annotation using the GWAS database