ResequencingSearching for polymorphisms in humans
Part I
Task 0
I created a directory /nfs/srv/databases/ngs/potanina.darya. This directory contains reads and chromosome files and all prodused during making the tasks ones.
Task 1 and 2
Firstly, I conducted quality control of reads by FastQC installed on kodomo. The result you can see here: chr9_1_fastqc.html.
Secondly, I sorted reads. It was made by Trimmomatic installed on kodomo with parameters TRAILING:20 (to cut off end less then 20 nucleotides quality) and MINLEN:50 (to leave only more then 50 nucleotides length reads).
$ java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr9_1.fastq chr9_1_cleaned.fastq TRAILING:20 MINLEN:50 |
Output (from 10701 reads 10536 were survived and 165 were dropped):
TrimmomaticSE: Started with arguments: -phred33 chr9_1.fastq chr9_1_cleaned.fastq TRAILING:20 MINLEN:50
Automatically using 8 threads
Input Reads: 10701 Surviving: 10536 (98,46%) Dropped: 165 (1,54%)
TrimmomaticSE: Completed successfully |
Then I made quality control again. The result here: chr9_1_cleaned_fastqc.html.
$ fastqc chr9_1_cleaned.fastq |
As we can see quality of reads was increased after trimming of ends: average value increased and lenght of whiskers decreased which means reducing the number of drop-down values (see fig. 1,2).
Per base sequence quality |
Figure 1. Quality of reads before cleaning. | Figure 2. Quality of reads after cleaning. |
 |  |
Part II
Task 3
Cleaned reads were scanned by HISAT2. Firstly the reference sequence was indexed. The protocol of the program execution: hisat2.txt.
$ export PATH=${PATH}:/home/students/y06/anastaisha_w/hisat2-2.0.5
$ hisat2-build chr9.fasta chr9_indexed |
As a result of program execution 8 files were created they are named chr9_indexed.*.ht2 (*=1,2,...8).
Then I run hisat2 with parametres --no-spliced-alignment and --no-softclip. In the end, there was made an alignment of the reference sequence and purified reads in SAM format.
$ hisat2 -x chr9_indexed -U chr9_1_cleaned.fastq --no-spliced-alignment --no-softclip -S chr9_alignment.sam |
Parametres:
--no-softclip - Disallow soft-clipping
--no-spliced-alignment - Disable spliced alignment (gaps are forbidden)
-x - The basename of the index for the reference genome.
-U - input file with reads
-S - output file in SAM format
Output:
10536 reads; of these:
10536 (100.00%) were unpaired; of these:
74 (0.70%) aligned 0 times
10459 (99.27%) aligned exactly 1 time
3 (0.03%) aligned >1 times
99.30% overall alignment rate |
Task 4
Analisys of the alignment was made by samtools. Firstly, alignment was converted from SAM to BAM format.
$ samtools view chr9_alignment.sam -b -o chr9_alignment.bam |
Secondly, I sort the alignment of readings with the reference and the coordinate of the beginning reference reading (*_temp.txt is a temporary file created by the program).
$ samtools sort chr9_alignment.bam -T chr9_alignment_temp.txt
-o chr9_alignment_sort.bam |
Thirdly, sequence in the resulting file was indexed.
$ samtools index chr9_alignment_sort.bam |
Part III
Task 5
I created file with polynorphisms
$ samtools mpileup -uf chr9.fasta chr9_alignment_sort.bam -o chr9_snp.bcf |
By bcftools was created a file with a list of differences between reference sequense and reads.
$ bcftools call -cv chr9_snp.bcf -o chr9_snp.vcf |
In VCF file I found a few SNPs (see table 5.1). In total there are 5 indels and 106 snp.
Table 5.1.
# | Type of polimorphism | Coordinates | Reference | Read | Depth | Quality |
1 | Replasing | 5093614 | G | T | 3 | 4.12848 |
2 | Indel | 5122854 | agg | ag | 11 | 95.4652 |
3 | Replasing | 136135238 | T | C | 15 | 222 |
Task 6
For SNP annotation I deleted all indels from chr9_snp.vcf and saved it as chr9_snp_6.vcf. Then I converted the new one from VCF to AVINPUT format by annovar.
$ perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 chr9_snp_6.vcf -o chr9.avinput |
Output (106 SNPs: 73 transitions* and 33 transversions** and 0 indels/substitutions):
NOTICE: Finished reading 138 lines from VCF file
NOTICE: A total of 106 locus in VCF file passed QC threshold, representing 106 SNP (73 transitions and 33 transversions) and 0 indels/substitutions
NOTICE: Finished writing 106 SNP genotypes (73 transitions and 33 transversions) and 0 indels/substitutions for 1 sample |
*Transition is an interchange of purines (A-G) or pyrimidines (T-C), **transversion is an interchage of purine for pyrimidine.
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out chr9_snp_rs -build hg19 -dbtype snp138 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output (amount of rs 51):
NOTICE: Variants matching filtering criteria are written to chr9_snp_rs.hg19_snp138_dropped, other variants are written to chr9_snp_rs.hg19_snp138_filtered
NOTICE: Processing next batch with 106 unique variants in 106 input lines
NOTICE: Database index loaded. Total number of bins is 2894320 and the number of bins to be scanned is 51
NOTICE: Scanning filter database /nfs/srv/databases/annovar/humandb/hg19_snp138.txt...Done
|
Output files:
Then these 106 SNPs were annotate using differnt databases.
1. Refgene
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -out chr9_refgene -build hg19 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output:
NOTICE: The --geneanno operation is set to ON by default
NOTICE: Reading gene annotation from /nfs/srv/databases/annovar/humandb/hg19_refGene.txt ... Done with 50914 transcripts (including 11516 without coding sequence annotation) for 26271 unique genes
NOTICE: Reading FASTA sequences from /nfs/srv/databases/annovar/humandb/hg19_refGeneMrna.fa ... Done with 5 sequences
WARNING: A total of 345 sequences will be ignored due to lack of correct ORF annotation
NOTICE: Finished gene-based annotation on 106 genetic variants in chr9.avinput
NOTICE: Output files were written to chr9_refgene.variant_function, chr9_refgene.exonic_variant_function
|
Output files:
Table 6.1. Locations of SNPs.
Area | Amount of SNPs | Comment |
intronic | 78 | variant overlaps an intron |
exonic | 15 | variant overlaps a coding |
UTR3 | 8 | variant overlaps a 3' untranslated region |
intergenic | 3 | variant is in intergenic region |
downstream | 2 | variant overlaps 1-kb region downtream of transcription end site (use -neargene to change this) |
Table 6.2. SNPs in exons.*
Coordinates | Gene | exon:aa replacemetn:nucleotide replacement | Ref | Read | Synonymy | Quality | DP | Genotype |
5081780 | JAK2:NM_004972 | exon19:c.G2490A:p.L830L | G | A | synonymous | 221.999 | 50 | hom |
5126443 | JAK2:NM_004972 | exon24:c.T3288A:p.D1096E | T | A | nonsynonymous | 207.009 | 38 | het |
6253571 | IL33:NM_001199641 | exon3:c.C111T:p.Y37Y,IL33:NM_033439:exon6:c.C489T: p.Y163Y,IL33:NM_001199640:exon5:c.C363T:p.Y121Y | T | C | synonymous | 225.009 | 54 | het |
*There are 12 SNPs in unknow areas (due to various errors in the gene structure definition in the database file).
2. dbsnp
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype snp138 -out chr9_dbsnp -build hg19 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output:
NOTICE: Variants matching filtering criteria are written to chr9_dbsnp.hg19_snp138_dropped, other variants are written to chr9_dbsnp.hg19_snp138_filtered
NOTICE: Processing next batch with 106 unique variants in 106 input lines
NOTICE: Database index loaded. Total number of bins is 2894320 and the number of bins to be scanned is 51
NOTICE: Scanning filter database /nfs/srv/databases/annovar/humandb/hg19_snp138.txt...Done
|
Output files:
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype 1000g2014oct_all -out chr_1000genomes -buildver hg19 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output:
NOTICE: Variants matching filtering criteria are written to chr_1000genomes.hg19_ALL.sites.2014_10_dropped, other variants are written to chr_1000genomes.hg19_ALL.sites.2014_10_filtered
NOTICE: Processing next batch with 106 unique variants in 106 input lines
NOTICE: Database index loaded. Total number of bins is 2824642 and the number of bins to be scanned is 51
NOTICE: Scanning filter database /nfs/srv/databases/annovar/humandb/hg19_ALL.sites.2014_10.txt...Done
|
Output files:
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -regionanno -dbtype gwasCatalog -out chr9_gwas -buildver hg19 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output:
NOTICE: Reading annotation database /nfs/srv/databases/annovar/humandb/hg19_gwasCatalog.txt ... Done with 18027 regions
NOTICE: Finished region-based annotation on 106 genetic variants in chr9.avinput
NOTICE: Output file is written to chr9_gwas.hg19_gwasCatalog
|
Output files:
Table 6.3. The responsibility of the SNPs for various diseases.
Disease | Coordinate | Ref | Read | Genotype | Quality | DP |
Crohn's disease | 4985879 | A | G | het | 225.009 | 30 |
Endometriosis | 6253571 | C | T | het | 225.009 | 54 |
Tumor biomarkers, Coagulation factor levels | 136131188 | C | T | het | 13.2188 | 10 |
mean corpuscular hemoglobin concentration | 136131322 | G | T | het | 180.009 | 24 |
End-stage coagulation | 136131415 | C | T | het | 134.008 | 21 |
Malaria | 136132754 | C | A | het | 123.008 | 33 |
D-dimer levels, Venous thromboembolism, Activated partial thromboplastin time | 136137065 | A | G | hom | 42.7648 | 2 |
Coagulation factor levels | 136137106 | G | A | hom | 9.52546 | 1 |
$ perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype clinvar_20150629 -out chr9_clinvar -buildver hg19 chr9.avinput /nfs/srv/databases/annovar/humandb/ |
Output:
NOTICE: the --dbtype clinvar_20150629 is assumed to be in generic ANNOVAR database format
NOTICE: Variants matching filtering criteria are written to chr9_clinvar.hg19_clinvar_20150629_dropped, other variants are written to chr9_clinvar.hg19_clinvar_20150629_filtered
NOTICE: Processing next batch with 106 unique variants in 106 input lines
NOTICE: Database index loaded. Total number of bins is 49139 and the number of bins to be scanned is 7
NOTICE: Scanning filter database /nfs/srv/databases/annovar/humandb/hg19_clinvar_20150629.txt...Done
|
Output files:
Table 6.4. Clinical interpretation of the SNPs.*
CLNDBN | Coordinates | Ref | Read | Genotype | Quality | DP |
ABO_BLOOD_GROUP_SYSTEM | 136131315 | C | G | het | 204.009 | 24 |
ABO_BLOOD_GROUP_SYSTEM | 136131651 | G | A | het | 4.12848 | 3 |
*There is not_specified SNP in 5126443 position.
The summary result you can see in summary table (click the link, pls).
© Darya Potanina, 2017
|