Transcriptome analysis

0. Choose one file

Chosen file: chr7.2.fastq

1. Use FASTQC to analyse .fastq file

Terminal command:

fastqc chr7.2.fastq

App: File > Open.. (or key 'cmd+O')
Result: →

Used commands

Command	Function
fastqc chr7.fastq	Analyse file with FASTQC
hisat2-build chr7.fasta chr7	Indexing of a reference sequence
hisat2 --no-softclip -x chr7 -U chr7.2.fastq -S al.sam	Alignment of the reference sequence
samtools view al.sam -b -o al.bam	SAM -> BAM (binary)
samtools sort al.bam -T al.txt -o samsa.bam	Sorts alignment file; writes temporary files to PREFIX.nnnn.bam
samtools index samsa.bam	Indexing of samsorted.bam

Use Trimmomatic to cut off end low quality (<20) nucleotides; leave only >50 length reads

I've decided to not use this command because the quality is good enough to work with it.
Amount of reads: 13701

2. Mapping

hisat2 commands:

    hisat2-build chr7.fasta chr7
    hisat2 --no-softclip -x chr7 -U chr7.2.fastq -S al.sam
    (trancriptome isn't supposed to be not spliced, so '--no-spliced-alignment' parameter has been removed from the command line)

Output:

- 8 .ht2 files

    13701 reads; of these:
    13701 (100.00%) were unpaired; of these:
    122 (0.89%) aligned 0 times
    13529 (98.74%) aligned exactly 1 time
    50 (0.36%) aligned >1 times
    99.11% overall alignment rate

As we can see from the output, 122 reads were not aligned, and 50 were aligned more than once.
From the total of 13701 reads 13529 were aligned exactly one time.

3. Alignment analysis

sam commands:

    samtools view al.sam -b -o al.bam
    samtools sort al.bam -T al.txt -o samsa.bam
    samtools index samsa.bam

Info, received from alignment.sam (description of all fields separated with tabs in the file):

1. Read name
2. Sum of all flags (example: 272)
3. Name of reference sequence where alignment occurs (chr7)
4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs (example: 120628221)
5. Mapping quality (60)
6. CIGAR string representation of alignment (example: 51M)
7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. (*)
8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate. (0)
9. Inferred fragment length. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if the mates did not align concordantly. (0)
10. Read sequence (reverse-complemented if aligned to the reverse strand) (example: GTCCTACAACACCACTGTTCAGCCATTTTCTCTTAAAACTCTTTTGGCTTC)
11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. (example: CCCFFFFFHHHHGJJJJHIJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJB)
12. Alignment score (example: AS:i:0)
13. The number of ambiguous bases in the reference covering this alignment (example: XN:i:0)
14. The number of mismatches in the alignment (example: XM:i:0)
15. The number of gap opens, for both read and reference gaps, in the alignment (example: XO:i:0)
16. The number of gap extensions, for both read and reference gaps, in the alignment (example: XG:i:0)
17. The edit distance; that is, the minimal number of one-nucleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string (example: NM:i:0)
18. A string representation of the mismatched reference bases in the alignment (example: MD:Z:51)
19. Value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly (example: YT:Z:UU)
20. The number of mapped locations for the read or the pair. (example: NH:i:1)

4. Read counting

Command	Function
bedtools bamtobed -i samsa.bam > samsa.bed	converts sequence alignments in BAM format into BED
bedtools intersect -a /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed -b samsa.bed -u > samsadone.bed	allows one to screen for overlaps between two sets of genomic features. -a - BED file "A" -b - one BED file "B" -u - reports the fact at least one overlap was found in B.

From the samsadone.bed file we can receive an information about amounts of reads for each gene.
Download Excel data:
for CPED1: all 62 (because other genes are pseudogenes)

5. Results analysis

Gene	Notes
CPED1	Cadherin like and PC-esterase domain containing 1 coding gene. CDEP1 located in ER in cell. Its function is unknown.
HMGN1P18	Pseudogene
RNA5SP241	Pseudogene
RNU6-517P	Pseudogene

6. Additional

Task	Command
1. Receive .fq file from .bam	bedtools bamtofastq -i result.bam -fq result.fq
2. Receive .fasta file with nucleotide sequence for your reference sequence	bedtools getfasta -fi chr7.fasta -bed nu2.bed > nu2.fasta nu2.bed content: chr7 120627524 120627575 D00795:16:C7BV0ACXX:5:1311:14764:31328 60 +
3. Split chr7 into equal (1 miilion of nuclotides length) fragments	bedtools makewindows -g data.txt -w 1000000 > nu3.bed
4. Unite all reads into one cluster	bedtools cluster -i samsa.bed -s >nu4.bed
5. Pick 100 random fragments 200 nucleotides length each	bedtools random -g data.txt -n 1000 -l 200 > ran.bed

Back to term 3 page 🚶