Resequencing. The search for polymorphisms

Resequencing. A search for polymorphisms

Part 1

0. Make your new directory and copy your .fastq and .fasta files there

New address: /nfs/srv/databases/ngs/sophia.veselova/ Files: chr7.fastq, chr7.fasta

1. Use FASTQC to analyse .fastq file

Terminal command:

fastqc chr7.fastq

App: File > Open.. (or key 'cmd+O')
Result: →

Used commands

Command	Function
fastqc chr7.fastq	Analyse file with FASTQC
java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr7.fastq chr7_1.fastq TRAILING:20 MINLEN:50	Cut off end low quality (<20) nucleotides; left only >50 length reads
hisat2-build chr7.fasta chr7_neue	Indexing of a reference sequence
hisat2 --no-spliced-alignment --no-softclip -x chr7_neue -U chr7_1.fastq -S alignment.sam	Alignment of the reference sequence
samtools view alignment.sam -b -o alignment.bam	SAM -> BAM (binary)
samtools sort alignment.bam -T alignment.txt -o samsorted.bam	Sorts alignment file; writes temporary files to PREFIX.nnnn.bam
samtools index samsorted.bam	Indexing of samsorted.bam
samtools mpileup -uf chr7.fasta samsorted.bam -o snp.bcf	Creates .bcf file (with polymorphisms)
bcftools call -cv snp.bcf -o SNP.vcf	Creates file with diefferences between reference and read in .vcf format
perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 snp_neue.vcf > chr7.avinput	Creates file.avinput (required by annovar)
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out chr7.snp -build hg19 -dbtype snp138 chr7.avinput /nfs/srv/databases/annovar/humandb/	Shows which SNPs have rs
perl /nfs/srv/databases/annovar/annotate_variation.pl -out refgene -build hg19 chr7.avinput /nfs/srv/databases/annovar/humandb/	Makes refGene annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out dbsnp -build hg19 -dbtype snp138 chr7.avinput /nfs/srv/databases/annovar/humandb/	Makes dbsnp annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 -out 1000g chr7.avinput /nfs/srv/databases/annovar/humandb/	Makes 1000 genomes annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl -regionanno -build hg19 -out GWAS -dbtype gwasCatalog chr7.avinput /nfs/srv/databases/annovar/humandb/	Makes GWAS annotation
perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out clinvar -dbtype clinvar_20150629 -buildver hg19 chr7.avinput /nfs/srv/databases/annovar/humandb/	Makes Clinvar annotation

2. Use Trimmomatic to cut off end low quality (<20) nucleotides; leave only >50 length reads

Why is it important?
The end low quality nucleotides are acquired from the longest sequencing reaction products and they have to be removed because the reaction product peaks are not well resolved and may bleed into each other.
Reads shorter than 50 nucleotides have to be removed because they may contain adapters. Adapter contamination will lead to alignment errors and increased amount of unaligned reads. Even if there are no adapters after removing these reads the quality has improved (see results, there are no more columns in the orange zone)

Terminal command:

java -jar /usr/share/java/trimmomatic.jar SE -phred33 chr7.fastq chr7_1.fastq TRAILING:20 MINLEN:50

Result (compared with an old one):

Amount of reads before/after: 3752/3650

Part 2. Use HISAT2 for mapping sequence reads

3. Mapping

hisat2 commands:

Command	Notes
hisat2-build chr7.fasta chr7_neue	Indexing of a reference sequence
hisat2 --no-spliced-alignment --no-softclip -x chr7_neue -U chr7_1.fastq -S alignment.sam	Alignment of the reference sequence --no-softclip - without soft sequence clipping (when bases in 5' and 3' are not the part of the alignment BUT they haven't been removed from the read sequence) --no-spliced-alignment - no gaps in alignment -x - 'basename' for all indexed files -U - file with reads -S - SAM output file

Output:

    3650 reads; of these:
    3650 (100.00%) were unpaired; of these:
    8 (0.22%) aligned 0 times
    3458 (94.74%) aligned exactly 1 time
    184 (5.04%) aligned >1 times
    99.78% overall alignment rate

As we can see from the output, 8 reads were not aligned, and 184 were aligned more than once.
From the total of 3650 reads 3458 were aligned exactly one time.

4. Alignment analysis

SAM commands:

Command	Notes
samtools view alignment.sam -b -o alignment.bam	SAM -> BAM (binary) -b - BAM -o - output file
samtools sort alignment.bam -T alignment.txt -o samsorted.bam	Sorts alignment file; writes temporary files to PREFIX.nnnn.bam -T - temporary file -o - output file
samtools index samsorted.bam	Indexing of samsorted.bam

Info, received from alignment.sam (description of all fields separated with tabs in the file):

1. Read name
2. Sum of all flags (example: 272)
3. Name of reference sequence where alignment occurs (chr7)
4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs (example: 100488365)
5. Mapping quality (60)
6. CIGAR string representation of alignment (example: 97M)
7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. (*)
8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate. (0)
9. Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. (0)
10. Read sequence (reverse-complemented if aligned to the reverse strand) (example: GAAGGCACCGCGGGGGAGGGAGCTCAGCCTGAGACATGCAGAGGACCGGGAGCCCCGGGGGACGTCGGGGTGTGGTGGGGATGGGCAGAGTCTGGGGCTC)
11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. (example: CCCCCCFFFFFHHHHHIJJEHDDDDDDDDDDDDD?CCCDCCDCDCBBBDBDDDBBBDDDBD@BDDDBDDDDDB;>5??A?BDB85?@B??BBDA>>ACAB@5?)
12. Alignment score (example: AS:i:-4)
13. The number of ambiguous bases in the reference covering this alignment (example: XN:i:0)
14. The number of mismatches in the alignment (example: XM:i:1)
15. The number of gap opens, for both read and reference gaps, in the alignment (example: XO:i:0)
16. The number of gap extensions, for both read and reference gaps, in the alignment (example: XG:i:0)
17. The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string (example: NM:i:1)
18. A string representation of the mismatched reference bases in the alignment (example: MD:Z:72G27)
19. Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly (example: YT:Z:UU)
20. The number of mapped locations for the read or the pair. (example: NH:i:1)

Part 3. SNP analysis

5. Search for SNP and indels

Commands:

    samtools mpileup -uf chr7.fasta samsorted.bam -o snp.bcf
    bcftools call -cv snp.bcf -o SNP.vcf

#	Coordinates	Type	Reference	Read	Depth	Quality
1	120965718	Insertion	taa	tAAaa	2	16.5627
2	120978918	Insertion	ATT	ATTT	8	105.467
3	134250322	Replacing	A	C	68	225.009

What does each column mean?
Coordinates - where each polymorphism took place (120965718 in #1)
Type - the type of the polymorphism (insertion in #1, #2, replacing in #3)
Reference - what WAS before (taa in #1)
Read - what is NOW (AA in #1 has inserted, C replaced A in #3)
Depth - read feature - #1, #2 have low depth
Quality - read feature - #1 has low quality

6. SNP annotation

First, all 3 indels were removed (new file snp_neue.vcf) Commands:

    perl /nfs/srv/databases/annovar/convert2annovar.pl -format vcf4 snp_neue.vcf > chr7.avinput | Result: A total of 31 locus in VCF file passed QC threshold,
                                                                                                  representing 31 SNPs (24 transitions and 7 transversions) and 0 indels/substitutions
                                                                                                  Note: transition - interchange of purines (A-G) or pyrimidines (T-C)
                                                                                                        transversion - interchage of purine for pyrimidine

    perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out chr7.snp -build hg19     | Result: amount of rs: 28
    -dbtype snp138 chr7.avinput /nfs/srv/databases/annovar/humandb/

Database	Input	Output	Notes
Refgene	perl /nfs/srv/databases/annovar/annotate_variation.pl -out refgene -build hg19 chr7.avinput /nfs/srv/databases/annovar/humandb/	refgene.exonic_variant_function refgene.log refgene.variant_function	Groups: exonic/intronic/UTR3 25 intronic SNPs 4 exonic SNPs, 2 UTR3 SNPs - (UTR3 means 3' noncoding region)
dbsnp	perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out dbsnp -build hg19 -dbtype snp138 chr7.avinput /nfs/srv/databases/annovar/humandb/	dbsnp.hg19_snp138_dropped dbsnp.hg19_snp138_filtered dbsnp.log	28 rs for 31 SNPs (so, 28 SNPs were annotated)
1000 genomes	perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 -out 1000g chr7.avinput /nfs/srv/databases/annovar/humandb/	1000g.hg19_ALL.sites.2014_10_dropped 1000g.hg19_ALL.sites.2014_10_filtered 1000g.log	There is frequency data for each SNP (except four of them). Minimum value: 0,0239617 Maximum value: 0,650759 Average: 0,32507983 Median: 0,336462
GWAS	perl /nfs/srv/databases/annovar/annotate_variation.pl -regionanno -build hg19 -out GWAS -dbtype gwasCatalog chr7.avinput /nfs/srv/databases/annovar/humandb/	GWAS.hg19_gwasCatalog GWAS.log	See Table below (Type 2 diabetes, Bone mineral density, Longevity and Cortical thickness). from the received data we can assume that some kind of SNPs can be associated with different kind of diseases or physiological features.
Clinvar	perl /nfs/srv/databases/annovar/annotate_variation.pl -filter -out clinvar -dbtype clinvar_20150629 -buildver hg19 chr7.avinput /nfs/srv/databases/annovar/humandb/	clinvar.hg19_clinvar_20150629_dropped clinvar.hg19_clinvar_20150629_filtered clinvar.log	.dropped is empty

Coordinate	SNP	Quality	DP	refgene	dbsnp	1000 genomes	GWAS	Clinvar
100487721	G T	37.5136	4	UTR3 ACHE(NM_001302622:c.114C>A,NM_015831:c.858C>A,NM_000665:c.114C>A,NM_001282449:c.114C>A)	rs17228616	0.0936502	-	-
100490077	G A	221.999	34	exonic ACHE	rs7636	0.105232	Type 2 diabetes	-
120965652	C T	62.0073	8	intronic WNT16	rs7782648	0.105232	-	-
120969769	G A	222.974	14	exonic WNT16	rs2908004	0.510383	Bone mineral density	-
120979089	C T	225.2	3	exonic WNT16	rs2707466	0.502995	Cortical thickness	-
134221694	G A	7.79993	2	intronic AKR1B10	rs706159	0.442891	-	-
134222091	C A	10.4247	2	intronic AKR1B10	rs138166076	-	-	-
134237239	T C	6.20226	1	intronic AKR1B15	rs73724974	0.0241613	-	-
134237294	A G	6.20226	1	intronic AKR1B15	rs73164857	0.341653	-	-
134239978	C A	29.0123	10	intronic AKR1B15	rs2113451	0.336462	-	-
134246036	G A	3.54557	1	intronic AKR1B15	-	-	-	-
134248045	A G	11.3429	1	intronic AKR1B15	rs10261532	0.504992	-	-
134250322	A C	225.009	68	intronic AKR1B15	rs4732038	0.503594	Longevity	-
134251041	C T	5.46383	1	intronic AKR1B15	rs706201	0.573482	-	-
134252691	C T	26.0177	5	intronic AKR1B15	rs59136474	0.273762	-	-
134252718	C T	78.0075	6	intronic AKR1B15	rs17775934	0.368011	-	-
134253269	C T	135.015	9	intronic AKR1B15	rs56097712	0.337859	-	-
134254029	G A	212.009	47	intronic AKR1B15	rs3792574	0.513578	-	-
134254427	G A	185.999	23	intronic AKR1B15	rs782538	0.650759	-	-
134255326	C T	7.79993	1	intronic AKR1B15	-	-	-	-
134259951	T C	87.0076	8	intronic AKR1B15	rs7788801	0.241813	-	-
134259962	T C	105.008	12	intronic AKR1B15	rs1465473	0.473243	-	-
134260106	C T	225.009	38	intronic AKR1B15	rs782539	0.0273562	-	-
134260464	G A	225.009	71	intronic AKR1B15	rs73724979	0.0239617	-	-
134261097	C G	225.009	66	intronic AKR1B15	rs2161803	0.482628	-	-
134261302	T C	145.008	26	intronic AKR1B15	rs59326083	0.024361	-	-
134261674	C T	225.009	53	intronic AKR1B15	rs6979933	0.306709	-	-
134262441	T C	225.009	71	intronic AKR1B15	rs10241998	0.336262	-	-
134262747	G C	68.0074	8	intronic AKR1B15	rs10229876	0.336462	-	-
134264286	C T	187.009	42	exonic AKR1B15	rs6467538	0.335663	-	-
134264546	C G	3.0136	3	UTR3 AKR1B15(NM_001080538:c.*245C>G)	-	-	-	-

Back to term 3 page 🚶