Resequencing and SNP search
Last update on the 20th of November, 2017Here I process human exome on chromosome 5 raw reads for study purposes and look for significant SNPs.
File | Link |
---|---|
Spreadsheet table | table.ods |
Description of commands (RU, UTF-8) | flags.txt |
Read trimming and quality control
Given reads were trimmed with Trimmomatic software with leading and trailing low-quality bases removed as well as short reads themsevles. Quality control was done with FastQC before and after triimming. Neccessary commands are presented in Fig. 1.
fastqc chr5.fastq java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr5.fastq trim.fastq LEADING:20 TRAILING:20 MINLEN:50 fastqc trim.fastq
FastQC produces zip archive with statistics in text and graphics on various read properties. Most of them are common for both read sets. However, there are some differring properties. Untrimmed set contains 8208 reads of 41-100 nts, trimmed — 8114 of 50-100 nts. Upper boundary is a characteristic value of Illumina sequencing. Only 1,15% were trimmed away.
The distribution of sequence quality per base also differing (Fig.2 and 3). First, right-most box 10th decile was shortened. Other deciles and average line also got closer to the top. However, leading nts didn't change their boxes. Second, in untrimmed set 10th percentiles decreased evenly to the right end, whereas in trimmed set do not. That means 1) short reads (less than 50 nt) contain more low-quality bases; 2) left end is more precise than right one. Those are likely to be explained with standard sequencing protocol.
Other properties don't differ significantly. However, there is something to say about them. Untrimmed reads are taken as an example.
First two nucleotides exhibit bias in base content (Fig. 4) compared to the null model of random base distribution across reads. This might be explained as imprecise adapter trimming or restriction site.
GC content is declared as normal distribution along reads and calculated from the observed data. The presented set shows biased and roughly shaped distribution, which might indicate 1) difference of human GC content from overall observed; 2) putative statistically biased GC content of exons.
Read mapping
table.ods
Mapping was performed with HiSat2 (commands in Fig. 6) on hg19 with no
soft-clipping (i.e. read ends
are mapped strongly) and no spliced alignment (because no splicing occurs in DNA-seq).
Results were saved to *.sam file. Of 8114 reads 8085 (99,64%) were aligned exactly 1 time
and rest 29 (0,36%) were aligned 0 times. HiSat2 puts out some stats in --met-file
but they seem to be relevant to algorithm details and do not carry signficant information.
hisat2-build chr5.fasta indexed hisat2 -x indexed -q trim.fastq -S chr5.sam --no-spliced-alignment --no-softclip --met-file hi.txt
Next step is statistics of mapping calculated with SamTools. First, *.sam file was transformed
to more compact binary *.bam file, then sorted by start coordinate (as default) and indexed
for quicker operating (command in Fig. 7). depth.txt
file was processed in
spreadsheet. Some stats are presented in Figs. 8 — 10.
samtools view -b chr5.sam -o chr5.bam samtools sort -O bam chr5.bam -o sorted_5.bam -T temp samtools index sorted_5.bam samtools depth sorted_5.bam > depth.txt samtools depth sorted_5.bam -r chr5:74643029-74643134 > exon_depth.txt
It is clearly seen that exonic peaks are focal. To investigate one of them I chose one of the most covered bases — 74643070. It corresponded to exon 74643029-74643134 with average coverage of 196 ranging from 173 to 212 (Fig. 7 for CLI command). Corresponding gene encodes HMG-CoA reductase.
Polymorphism search
table.odsPolymorphisms were called via samtools |bcftools (Fig. 11). 28 SNPs and 4 indels were found. SNP coverage varies from 1 to 164 with median of 29, quality &mdash from 11 to 225 with median of 177. Indel qualities are 68, 73, 176, 217. Overall quality seems good. Three of polymorphisms are given in Table 1.
samtools mpileup -uvf chr5.fasta sorted_5.bam -o diff.vcf bcftools call -cv diff.vcf -Ov -o snp.vcf
Type | Coordinate | Reference | Alternative | Quality | Coverage |
---|---|---|---|---|---|
SNP | 35874575 | C | T | 225.009 | 164 |
SNP | 35873899 | C | A | 3.0136 | 3 |
INDEL | 74639544 | CTTGTATTGT | CTTGT | 73.4665 | N/A |
All three polymoprhisms were visualized with IGV in Figs. 12-14.
SNP annotation
table.odsOnly SNPs were annotated across reference SNP database dbsnp, gene reference refgene, confirmed SNP frequencies in 1000 genome project, trait associations in GWAS and clinically significant SNPs in Clinvar. Corresponding commands along with transformation of SNP file to appropriate format are shown in Fig. 15.
perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 snp_2.vcf > snp.avinput perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out snp_rs -build hg19 -dbtype snp138 snp.avinput /nfs/srv/databases/annovar/humandb/ perl /nfs/srv/databases/annovar/annotate_variation.pl -out refseq -build hg19 snp.avinput /nfs/srv/databases/annovar/humandb/ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out 1000g -buildver hg19 -dbtype 1000g2014oct_all snp.avinput /nfs/srv/databases/annovar/humandb/ perl /nfs/srv/databases/annovar/annotate_variation.pl -regionanno -out gwas -buildver hg19 -dbtype gwasCatalog snp.avinput /nfs/srv/databases/annovar/humandb/ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype clinvar_20150629 -out clinvar -buildver hg19 snp.avinput /nfs/srv/databases/annovar/humandb/
Refseq annotation revealed three groups of SNPs: exonic (4), 3'-UTR (1) and intronic (23). Corresponding
genes are IL7R (15), CAPSL (4), HMGCR (9). Exonic SNPs lead to nonsynonymous changes like Thr to Ile (see table.ods
for more information) with no pattern in it. 24 of 28 SNPs has an rs reference and defined allele frequencies according to 1000 genomes.
6 are less than 0,1 frequent, 2 are more than 0,8 frequent and the rest are between 0,1 and 0,8.
Three SNPs are clinically annotated, all in IL7R gene exonic regions (exon 2, 4 and 6). First two lead to severe combined immunodeficiency.
The last one is annoted as not specified in Clinvar, but connected with type 1 diabetes and multiple sclerosis according to GWAS. Whole information
on SNPs is provided in table.ods
file.
Exon coverage
Whole reads were visualized in IGV (Figs. 16-18). Reads cover whole exons with come flankings and few exons missed.