Transcriptome analysis

Transcriptome analysis

0. Choose one file

Chosen file: chr7.2.fastq

1. Use FASTQC to analyse .fastq file

Terminal command:
fastqc chr7.2.fastq

App: File > Open.. (or key 'cmd+O')
Result:

Used commands

CommandFunction
fastqc chr7.fastq
Analyse file with FASTQC
hisat2-build chr7.fasta chr7
Indexing of a reference sequence
hisat2 --no-softclip -x chr7
-U chr7.2.fastq -S al.sam
Alignment of the reference sequence
samtools view al.sam -b -o al.bam
SAM -> BAM (binary)
samtools sort al.bam -T al.txt -o samsa.bam
Sorts alignment file; writes temporary files to PREFIX.nnnn.bam
samtools index samsa.bam
Indexing of samsorted.bam

Use Trimmomatic to cut off end low quality (<20) nucleotides; leave only >50 length reads


I've decided to not use this command because the quality is good enough to work with it.
Amount of reads: 13701

2. Mapping

hisat2 commands:
    hisat2-build chr7.fasta chr7
    hisat2 --no-softclip -x chr7 -U chr7.2.fastq -S al.sam
    (trancriptome isn't supposed to be not spliced, so '--no-spliced-alignment' parameter has been removed from the command line)

Output:

- 8 .ht2 files
    13701 reads; of these:
    13701 (100.00%) were unpaired; of these:
    122 (0.89%) aligned 0 times
    13529 (98.74%) aligned exactly 1 time
    50 (0.36%) aligned >1 times
    99.11% overall alignment rate
As we can see from the output, 122 reads were not aligned, and 50 were aligned more than once.
From the total of 13701 reads 13529 were aligned exactly one time.

3. Alignment analysis

sam commands:
    samtools view al.sam -b -o al.bam
    samtools sort al.bam -T al.txt -o samsa.bam
    samtools index samsa.bam

Info, received from alignment.sam (description of all fields separated with tabs in the file):

1. Read name
2. Sum of all flags (example: 272)
3. Name of reference sequence where alignment occurs (chr7)
4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs (example: 120628221)
5. Mapping quality (60)
6. CIGAR string representation of alignment (example: 51M)
7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. (*)
8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate. (0)
9. Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. (0)
10. Read sequence (reverse-complemented if aligned to the reverse strand) (example: GTCCTACAACACCACTGTTCAGCCATTTTCTCTTAAAACTCTTTTGGCTTC)
11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. (example: CCCFFFFFHHHHGJJJJHIJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJB)
12. Alignment score (example: AS:i:0)
13. The number of ambiguous bases in the reference covering this alignment (example: XN:i:0)
14. The number of mismatches in the alignment (example: XM:i:0)
15. The number of gap opens, for both read and reference gaps, in the alignment (example: XO:i:0)
16. The number of gap extensions, for both read and reference gaps, in the alignment (example: XG:i:0)
17. The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string (example: NM:i:0)
18. A string representation of the mismatched reference bases in the alignment (example: MD:Z:51)
19. Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly (example: YT:Z:UU)
20. The number of mapped locations for the read or the pair. (example: NH:i:1)

4. Read counting

CommandFunction
bedtools bamtobed -i samsa.bam > samsa.bed
converts sequence alignments in BAM format into BED
bedtools intersect -a /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed -b samsa.bed -u > samsadone.bed
allows one to screen for overlaps between two sets of genomic features.
-a - BED file "A"
-b - one BED file "B"
-u - reports the fact at least one overlap was found in B.

From the samsadone.bed file we can receive an information about amounts of reads for each gene.
Download Excel data:
for CPED1: all 62 (because other genes are pseudogenes)

5. Results analysis

GeneNotes
CPED1Cadherin like and PC-esterase domain containing 1 coding gene. CDEP1 located in ER in cell. Its function is unknown.
HMGN1P18Pseudogene
RNA5SP241Pseudogene
RNU6-517PPseudogene

6. Additional

TaskCommand
1. Receive .fq file from .bam
bedtools bamtofastq -i result.bam -fq result.fq
2. Receive .fasta file with nucleotide sequence for your reference sequence
bedtools getfasta -fi chr7.fasta -bed nu2.bed > nu2.fasta

nu2.bed content:
chr7    120627524       120627575       D00795:16:C7BV0ACXX:5:1311:14764:31328  60      +
3. Split chr7 into equal (1 miilion of nuclotides length) fragments
bedtools makewindows -g data.txt -w 1000000 > nu3.bed
4. Unite all reads into one cluster
bedtools cluster -i samsa.bed -s >nu4.bed
5. Pick 100 random fragments 200 nucleotides length each
bedtools random -g data.txt -n 1000 -l 200 > ran.bed


Back to term 3 page 🚶

© Sophia Veselova, 2017.