Resequencing. The search for polymorphisms

Resequencing. A search for polymorphisms

Part 1

0. Make your new directory and copy your .fastq and .fasta files there

New address: /nfs/srv/databases/ngs/sophia.veselova/ Files: chr7.fastq, chr7.fasta

1. Use FASTQC to analyse .fastq file

Terminal command:
fastqc chr7.fastq

App: File > Open.. (or key 'cmd+O')
Result:

Used commands

CommandFunction
fastqc chr7.fastq
Analyse file with FASTQC
java -jar /usr/share/java/trimmomatic.jar
SE -phred33 chr7.fastq chr7_1.fastq TRAILING:20 MINLEN:50
Cut off end low quality (<20) nucleotides; left only >50 length reads
hisat2-build chr7.fasta chr7_neue
Indexing of a reference sequence
hisat2 --no-spliced-alignment --no-softclip
-x chr7_neue -U chr7_1.fastq -S alignment.sam
Alignment of the reference sequence
samtools view alignment.sam -b -o alignment.bam
SAM -> BAM (binary)
samtools sort alignment.bam -T alignment.txt -o samsorted.bam
Sorts alignment file; writes temporary files to PREFIX.nnnn.bam
samtools index samsorted.bam
Indexing of samsorted.bam
samtools mpileup -uf chr7.fasta samsorted.bam -o snp.bcf
Creates .bcf file (with polymorphisms)
bcftools call -cv snp.bcf -o SNP.vcf
Creates file with diefferences between reference and read in .vcf format
perl /nfs/srv/databases/annovar/convert2annovar.pl
-format vcf4 snp_neue.vcf > chr7.avinput
Creates file.avinput (required by annovar)
 perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -out chr7.snp -build hg19 -dbtype snp138 chr7.avinput
/nfs/srv/databases/annovar/humandb/
Shows which SNPs have rs
perl /nfs/srv/databases/annovar/annotate_variation.pl
-out refgene -build hg19 chr7.avinput
/nfs/srv/databases/annovar/humandb/
Makes refGene annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -out dbsnp -build hg19 -dbtype snp138 chr7.avinput
/nfs/srv/databases/annovar/humandb/
Makes dbsnp annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -dbtype 1000g2014oct_all -buildver hg19 -out 1000g
chr7.avinput /nfs/srv/databases/annovar/humandb/
Makes 1000 genomes annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl
-regionanno -build hg19 -out GWAS -dbtype gwasCatalog
chr7.avinput /nfs/srv/databases/annovar/humandb/
Makes GWAS annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -out clinvar -dbtype clinvar_20150629 -buildver
hg19 chr7.avinput /nfs/srv/databases/annovar/humandb/
Makes Clinvar annotation

2. Use Trimmomatic to cut off end low quality (<20) nucleotides; leave only >50 length reads

Why is it important?
The end low quality nucleotides are acquired from the longest sequencing reaction products and they have to be removed because the reaction product peaks are not well resolved and may bleed into each other.
Reads shorter than 50 nucleotides have to be removed because they may contain adapters. Adapter contamination will lead to alignment errors and increased amount of unaligned reads. Even if there are no adapters after removing these reads the quality has improved (see results, there are no more columns in the orange zone)

Terminal command:

java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr7.fastq chr7_1.fastq TRAILING:20 MINLEN:50

Result (compared with an old one):

Amount of reads before/after: 3752/3650

Part 2. Use HISAT2 for mapping sequence reads

3. Mapping

hisat2 commands:
Command Notes
hisat2-build chr7.fasta chr7_neue
Indexing of a reference sequence
hisat2 --no-spliced-alignment --no-softclip -x chr7_neue -U chr7_1.fastq -S alignment.sam
Alignment of the reference sequence
--no-softclip - without soft sequence clipping (when bases in 5' and 3' are not the part of the alignment BUT they haven't been removed from the read sequence)
--no-spliced-alignment - no gaps in alignment
-x - 'basename' for all indexed files
-U - file with reads
-S - SAM output file

Output:
    3650 reads; of these:
    3650 (100.00%) were unpaired; of these:
    8 (0.22%) aligned 0 times
    3458 (94.74%) aligned exactly 1 time
    184 (5.04%) aligned >1 times
    99.78% overall alignment rate
As we can see from the output, 8 reads were not aligned, and 184 were aligned more than once.
From the total of 3650 reads 3458 were aligned exactly one time.

4. Alignment analysis

SAM commands:
Command Notes
samtools view alignment.sam -b -o alignment.bam
SAM -> BAM (binary)
-b - BAM
-o - output file
samtools sort alignment.bam -T alignment.txt -o samsorted.bam
Sorts alignment file; writes temporary files to PREFIX.nnnn.bam
-T - temporary file
-o - output file
samtools index samsorted.bam
Indexing of samsorted.bam

Info, received from alignment.sam (description of all fields separated with tabs in the file):

1. Read name
2. Sum of all flags (example: 272)
3. Name of reference sequence where alignment occurs (chr7)
4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs (example: 100488365)
5. Mapping quality (60)
6. CIGAR string representation of alignment (example: 97M)
7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. (*)
8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate. (0)
9. Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. (0)
10. Read sequence (reverse-complemented if aligned to the reverse strand) (example: GAAGGCACCGCGGGGGAGGGAGCTCAGCCTGAGACATGCAGAGGACCGGGAGCCCCGGGGGACGTCGGGGTGTGGTGGGGATGGGCAGAGTCTGGGGCTC)
11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. (example: CCCCCCFFFFFHHHHHIJJEHDDDDDDDDDDDDD?CCCDCCDCDCBBBDBDDDBBBDDDBD@BDDDBDDDDDB;>5??A?BDB85?@B??BBDA>>ACAB@5?)
12. Alignment score (example: AS:i:-4)
13. The number of ambiguous bases in the reference covering this alignment (example: XN:i:0)
14. The number of mismatches in the alignment (example: XM:i:1)
15. The number of gap opens, for both read and reference gaps, in the alignment (example: XO:i:0)
16. The number of gap extensions, for both read and reference gaps, in the alignment (example: XG:i:0)
17. The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string (example: NM:i:1)
18. A string representation of the mismatched reference bases in the alignment (example: MD:Z:72G27)
19. Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly (example: YT:Z:UU)
20. The number of mapped locations for the read or the pair. (example: NH:i:1)

Part 3. SNP analysis

5. Search for SNP and indels

Commands:
    samtools mpileup -uf chr7.fasta samsorted.bam -o snp.bcf
    bcftools call -cv snp.bcf -o SNP.vcf
#Coordinates TypeReferenceReadDepthQuality
1120965718InsertiontaatAAaa216.5627
2120978918InsertionATTATTT8105.467
3134250322ReplacingAC68225.009

What does each column mean?
Coordinates - where each polymorphism took place (120965718 in #1)
Type - the type of the polymorphism (insertion in #1, #2, replacing in #3)
Reference - what WAS before (taa in #1)
Read - what is NOW (AA in #1 has inserted, C replaced A in #3)
Depth - read feature - #1, #2 have low depth
Quality - read feature - #1 has low quality

6. SNP annotation

First, all 3 indels were removed (new file snp_neue.vcf) Commands:
    perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 snp_neue.vcf > chr7.avinput | Result: A total of 31 locus in VCF file passed QC threshold,
                                                                                                  representing 31 SNPs (24 transitions and 7 transversions) and 0 indels/substitutions
                                                                                                  Note: transition - interchange of purines (A-G) or pyrimidines (T-C)
                                                                                                        transversion - interchage of purine for pyrimidine
    perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out chr7.snp -build hg19     | Result: amount of rs: 28
    -dbtype snp138 chr7.avinput /nfs/srv/databases/annovar/humandb/
Database InputOutputNotes
Refgene
perl /nfs/srv/databases/annovar/annotate_variation.pl
-out refgene -build hg19 chr7.avinput
        /nfs/srv/databases/annovar/humandb/
refgene.exonic_variant_function
refgene.log
refgene.variant_function
Groups: exonic/intronic/UTR3
25 intronic SNPs
4 exonic SNPs, 2 UTR3 SNPs - (UTR3 means 3' noncoding region)
dbsnp
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -out dbsnp -build hg19 -dbtype snp138 chr7.avinput
/nfs/srv/databases/annovar/humandb/
dbsnp.hg19_snp138_dropped
dbsnp.hg19_snp138_filtered
dbsnp.log
28 rs for 31 SNPs (so, 28 SNPs were annotated)
1000 genomes
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -dbtype 1000g2014oct_all -buildver hg19 -out 1000g
chr7.avinput /nfs/srv/databases/annovar/humandb/
1000g.hg19_ALL.sites.2014_10_dropped
1000g.hg19_ALL.sites.2014_10_filtered
1000g.log
There is frequency data for each SNP (except four of them). Minimum value: 0,0239617
Maximum value: 0,650759
Average: 0,32507983
Median: 0,336462
GWAS
perl /nfs/srv/databases/annovar/annotate_variation.pl
-regionanno -build hg19 -out GWAS -dbtype gwasCatalog chr7.avinput
/nfs/srv/databases/annovar/humandb/
GWAS.hg19_gwasCatalog
GWAS.log
See Table below (Type 2 diabetes, Bone mineral density, Longevity and Cortical thickness). from the received data we can assume that some kind of SNPs can be associated with different kind of diseases or physiological features.
Clinvar
perl /nfs/srv/databases/annovar/annotate_variation.pl
-filter -out clinvar -dbtype clinvar_20150629 -buildver hg19 chr7.avinput
        /nfs/srv/databases/annovar/humandb/
clinvar.hg19_clinvar_20150629_dropped
clinvar.hg19_clinvar_20150629_filtered
clinvar.log
.dropped is empty

CoordinateSNPQualityDPrefgenedbsnp1000 genomesGWASClinvar
100487721G T37.51364UTR3    ACHE(NM_001302622:c.*114C>A,NM_015831:c.*858C>A,NM_000665:c.*114C>A,NM_001282449:c.*114C>A)rs172286160.0936502--
100490077G A221.99934exonic  ACHErs76360.105232Type 2 diabetes-
120965652C T62.00738intronic   WNT16rs77826480.105232--
120969769G A222.97414exonic  WNT16rs29080040.510383Bone mineral density-
120979089 C T225.23exonic  WNT16rs27074660.502995Cortical thickness-
134221694G A7.799932intronic AKR1B10rs7061590.442891--
134222091C A10.42472intronic  AKR1B10rs138166076---
134237239T C6.202261intronic  AKR1B15rs737249740.0241613--
134237294A G6.202261intronic  AKR1B15rs731648570.341653--
134239978C A29.012310intronic  AKR1B15rs21134510.336462--
134246036G A3.545571intronic  AKR1B15----
134248045A G11.34291intronic  AKR1B15rs102615320.504992--
134250322A C225.00968intronic  AKR1B15rs47320380.503594Longevity-
134251041C T5.463831intronic  AKR1B15rs7062010.573482--
134252691C T26.01775intronic  AKR1B15rs591364740.273762--
134252718C T78.00756intronic  AKR1B15rs177759340.368011--
134253269C T135.0159intronic  AKR1B15rs560977120.337859--
134254029G A212.00947intronic  AKR1B15rs37925740.513578--
134254427G A185.99923intronic  AKR1B15rs7825380.650759--
134255326C T7.799931intronic  AKR1B15----
134259951T C87.00768intronic  AKR1B15rs77888010.241813--
134259962T C105.00812intronic  AKR1B15rs14654730.473243--
134260106C T225.00938intronic  AKR1B15rs7825390.0273562--
 134260464G A225.00971intronic  AKR1B15rs737249790.0239617--
 134261097C G225.00966intronic  AKR1B15rs21618030.482628--
 134261302T C145.00826intronic  AKR1B15rs593260830.024361--
 134261674C T225.00953intronic  AKR1B15rs69799330.306709--
134262441T C225.00971intronic  AKR1B15rs102419980.336262--
134262747G C68.00748intronic  AKR1B15rs102298760.336462--
 134264286C T187.00942exonic  AKR1B15rs64675380.335663--
 134264546C G3.01363UTR3    AKR1B15(NM_001080538:c.*245C>G)----


Back to term 3 page 🚶

© Sophia Veselova, 2017.