< 3rd term

Resequencing and SNP search

Last update on the 20th of November, 2017

Here I process human exome on chromosome 5 raw reads for study purposes and look for significant SNPs.

List of downloads
File Link
Spreadsheet table table.ods
Description of commands (RU, UTF-8) flags.txt

Read trimming and quality control

Given reads were trimmed with Trimmomatic software with leading and trailing low-quality bases removed as well as short reads themsevles. Quality control was done with FastQC before and after triimming. Neccessary commands are presented in Fig. 1.

Fig. 1. CLI commands for read trimming.
fastqc chr5.fastq
java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr5.fastq trim.fastq LEADING:20 TRAILING:20 MINLEN:50
fastqc trim.fastq

FastQC produces zip archive with statistics in text and graphics on various read properties. Most of them are common for both read sets. However, there are some differring properties. Untrimmed set contains 8208 reads of 41-100 nts, trimmed — 8114 of 50-100 nts. Upper boundary is a characteristic value of Illumina sequencing. Only 1,15% were trimmed away.

The distribution of sequence quality per base also differing (Fig.2 and 3). First, right-most box 10th decile was shortened. Other deciles and average line also got closer to the top. However, leading nts didn't change their boxes. Second, in untrimmed set 10th percentiles decreased evenly to the right end, whereas in trimmed set do not. That means 1) short reads (less than 50 nt) contain more low-quality bases; 2) left end is more precise than right one. Those are likely to be explained with standard sequencing protocol.

Fig. 2. Per base sequence quality of untrimmed set.
Fig. 3. Per base sequence quality of trimmed set.

Other properties don't differ significantly. However, there is something to say about them. Untrimmed reads are taken as an example.

First two nucleotides exhibit bias in base content (Fig. 4) compared to the null model of random base distribution across reads. This might be explained as imprecise adapter trimming or restriction site.

Fig. 4. Per base sequence content of untrimmed set.

GC content is declared as normal distribution along reads and calculated from the observed data. The presented set shows biased and roughly shaped distribution, which might indicate 1) difference of human GC content from overall observed; 2) putative statistically biased GC content of exons.

Fig. 5. Per sequence GC content of untrimmed set.

Read mapping

table.ods

Mapping was performed with HiSat2 (commands in Fig. 6) on hg19 with no soft-clipping (i.e. read ends are mapped strongly) and no spliced alignment (because no splicing occurs in DNA-seq). Results were saved to *.sam file. Of 8114 reads 8085 (99,64%) were aligned exactly 1 time and rest 29 (0,36%) were aligned 0 times. HiSat2 puts out some stats in --met-file but they seem to be relevant to algorithm details and do not carry signficant information.

Fig. 6. CLI commands for read mapping.
hisat2-build chr5.fasta indexed
hisat2 -x indexed -q trim.fastq -S chr5.sam --no-spliced-alignment --no-softclip --met-file hi.txt

Next step is statistics of mapping calculated with SamTools. First, *.sam file was transformed to more compact binary *.bam file, then sorted by start coordinate (as default) and indexed for quicker operating (command in Fig. 7). depth.txt file was processed in spreadsheet. Some stats are presented in Figs. 8 — 10.

Fig. 7. CLI commands for mapping stats.
samtools view -b chr5.sam -o chr5.bam
samtools sort -O bam chr5.bam -o sorted_5.bam -T temp
samtools index sorted_5.bam
samtools depth sorted_5.bam > depth.txt
samtools depth sorted_5.bam -r chr5:74643029-74643134 > exon_depth.txt
Fig. 8. Overall statistics of mapping.
Coverage differs from 1 to 212 with average of 33,38 and median of 17.
Fig. 9. Mapped regions (part 1).
Two extended exon-containing regions of ~20Kb and ~40Kb.
Fig. 10. Mapped regions (part 2).
The third extended exon-containing region of ~25Kb.

It is clearly seen that exonic peaks are focal. To investigate one of them I chose one of the most covered bases — 74643070. It corresponded to exon 74643029-74643134 with average coverage of 196 ranging from 173 to 212 (Fig. 7 for CLI command). Corresponding gene encodes HMG-CoA reductase.

Polymorphism search

table.ods

Polymorphisms were called via samtools |bcftools (Fig. 11). 28 SNPs and 4 indels were found. SNP coverage varies from 1 to 164 with median of 29, quality &mdash from 11 to 225 with median of 177. Indel qualities are 68, 73, 176, 217. Overall quality seems good. Three of polymorphisms are given in Table 1.

Fig. 11. CLI commands for polymorphism calling.
samtools mpileup -uvf chr5.fasta sorted_5.bam -o diff.vcf
bcftools call -cv diff.vcf -Ov -o snp.vcf
Table 1. Sample polymorphisms in reads.
Type Coordinate Reference Alternative Quality Coverage
SNP 35874575 C T 225.009 164
SNP 35873899 C A 3.0136 3
INDEL 74639544 CTTGTATTGT CTTGT 73.4665 N/A

All three polymoprhisms were visualized with IGV in Figs. 12-14.

Fig. 12. C35874575T SNP.
SNP is stable across reads.
Fig. 13. C35873899A SNP.
Few reads observed.
Fig. 14. Indel at 74639544 CTTGTATTGT -> CTTGT.
5 nts (TTGTA) are deleted in few reads. In some T74639546C occurs.

SNP annotation

table.ods

Only SNPs were annotated across reference SNP database dbsnp, gene reference refgene, confirmed SNP frequencies in 1000 genome project, trait associations in GWAS and clinically significant SNPs in Clinvar. Corresponding commands along with transformation of SNP file to appropriate format are shown in Fig. 15.

Fig. 15. CLI commands for SNP annotations.
perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 snp_2.vcf > snp.avinput
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out snp_rs -build hg19 -dbtype snp138 snp.avinput /nfs/srv/databases/annovar/humandb/
perl /nfs/srv/databases/annovar/annotate_variation.pl  -out refseq -build hg19 snp.avinput /nfs/srv/databases/annovar/humandb/
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out 1000g -buildver hg19 -dbtype 1000g2014oct_all snp.avinput /nfs/srv/databases/annovar/humandb/ 
perl /nfs/srv/databases/annovar/annotate_variation.pl -regionanno -out gwas -buildver hg19 -dbtype gwasCatalog snp.avinput /nfs/srv/databases/annovar/humandb/
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype clinvar_20150629 -out clinvar -buildver hg19  snp.avinput /nfs/srv/databases/annovar/humandb/

Refseq annotation revealed three groups of SNPs: exonic (4), 3'-UTR (1) and intronic (23). Corresponding genes are IL7R (15), CAPSL (4), HMGCR (9). Exonic SNPs lead to nonsynonymous changes like Thr to Ile (see table.ods for more information) with no pattern in it. 24 of 28 SNPs has an rs reference and defined allele frequencies according to 1000 genomes. 6 are less than 0,1 frequent, 2 are more than 0,8 frequent and the rest are between 0,1 and 0,8.

Three SNPs are clinically annotated, all in IL7R gene exonic regions (exon 2, 4 and 6). First two lead to severe combined immunodeficiency. The last one is annoted as not specified in Clinvar, but connected with type 1 diabetes and multiple sclerosis according to GWAS. Whole information on SNPs is provided in table.ods file.

Exon coverage

Whole reads were visualized in IGV (Figs. 16-18). Reads cover whole exons with come flankings and few exons missed.

Fig. 16. IL7R exon coverage.
Only right flanking exon isn't covered with reads.
Fig. 17. CAPSL exon coverage.
All exons covered with reads from substantially to somehow.
Fig. 18. HMGCR exon coverage.
Only the last one, leftmost, isn't covered.