Analysis of the transcriptomeBedtools (↓)
I created a subdirectory /nfs/srv/databases/ngs/potanina.darya/pr12. This directory contains two samples of RNA sequencing reads derived from human's chromosome 9 and all prodused during making the tasks files.
All comands you can find here: script.sh.
Task 1
I conducted quality control of reads by FastQC installed on kodomo.
$ fastqc chr9.1.fastq
$ fastqc chr9.2.fastq |
The result you can see here: Output files:
| Per base sequence quality |
| Figure 1. First replica. | Figure 2. Second replica. |
 |  |
We can observe high quality of reads (are entirely in the green area).
|
| Per Illumina tile sequence quality |
| Figure 3. First replica. | Figure 4. Second replica. |
 |  |
The quality of reads in the cells colored blue is the highest and lower the cells colored yellow and red. As we can see the most of reads are high quality, but there are a few poorly read ones and even one very poor in the second replica. In general, the quality of the second replica is higher than the first.
|
| Per base sequence content |
| Figure 5. First replica. | Figure 6. Second replica. |
 |  |
The average amount of A and T is greater than the average number of C and G in each position along the reads. We can also notice that the number of T in the positions along the first half of the reads is increased.
|
| Per sequence GC content |
| Figure 7. First replica. | Figure 8. Second replica. |
 |  |
The amount of GC content is much higher than expected in 40ths and 20ths positions, but it is lower then expected in 25ths and 55ths positions.
|
Task 2
Reads were mapped on chromosome 9 by HISAT2. I run hisat2 without parameter --no-spliced-alignment because gaps are not forbidden for transcriptome reads (the indexed reference sequense was taken from the previous work). Then there was made an alignment of the reference sequence and readings in SAM format.
| $ export PATH=${PATH}:/home/students/y06/anastaisha_w/hisat2-2.0.5 |
$ hisat2 -x chr9_indexed -U chr9.1.fastq --no-softclip -S chr9.1.sam
$ hisat2 -x chr9_indexed -U chr9.2.fastq --no-softclip -S chr9.2.sam |
Output (here you can see the number of unaligned reads, reads aligned once and reads aligned more than once):
1st replica:
19976 reads; of these:
19976 (100.00%) were unpaired; of these:
115 (0.58%) aligned 0 times
19680 (98.52%) aligned exactly 1 time
181 (0.91%) aligned >1 times
99.42% overall alignment rate
2nd replica:
14020 reads; of these:
14020 (100.00%) were unpaired; of these:
83 (0.59%) aligned 0 times
13811 (98.51%) aligned exactly 1 time
126 (0.90%) aligned >1 times
99.41% overall alignment rate
|
Output files:
Task 3.
Analisys of the alignment was made by samtools. Firstly, alignment was converted from SAM to BAM format.
| $ samtools view chr9.1.sam -b -o chr9.1.bam
$ samtools view chr9.2.sam -b -o chr9.2.bam |
Output files:
Secondly, I sort the alignment of readings with the reference and the coordinate of the beginning reference reading (*_temp.txt is a temporary file created by the program).
$ samtools sort chr9.1.bam -T chr9.1_temp.txt -o chr9.1_sorted.bam
$ samtools sort chr9.2.bam -T chr9.2_temp.txt -o chr9.2_sorted.bam |
Output files:
chr9.1_sorted.bam chr9.2_sorted.bam
|
Thirdly, sequence in the resulting file was indexed.
$ samtools index chr9.1_sorted.bam
$ samtools index chr9.2_sorted.bam |
Output files:
chr9.1_sorted.bam.bai chr9.2_sorted.bam.bai
|
Task 4
I conducted reads' counting by bedtools. Firstly BAM files were converted to BED format.
$ bedtools bamtobed -i chr9.1_sorted.bam > chr9.1.bed
$ bedtools bamtobed -i chr9.2_sorted.bam > chr9.2.bed |
Output files:
Then compared RNA sequencing reads with the reference sequence of human genome h19 assembly (/nfs/srv/samba/public/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed). This program allows one to screen for overlaps between two sets of genomic features. It was run with the following parametres:
-a - the first BED file
-b - the second BED file
-u - reports the fact at least one overlap was found in B.
grep - to extract chr9 only from the whole genome
$ export PATH=${PATH}:/P/y14/term3/block4/SNP/bedtools2/bin
$ bedtools intersect -a gencode.genes.bed -b chr9.1.bed -c | grep 'chr9' > chr9.1_intersect.bed
$ bedtools intersect -a gencode.genes.bed -b chr9.2.bed -c | grep 'chr9' > chr9.2_intersect.bed |
Output:
***** WARNING: File /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed has inconsistent naming convention for record:
GL877870.2 2348 2545 pseudogene CICP10
***** WARNING: File /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed has inconsistent naming convention for record:
GL877870.2 2348 2545 pseudogene CICP10
***** WARNING: File /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed has inconsistent naming convention for record:
GL877870.2 2348 2545 pseudogene CICP10
***** WARNING: File /P/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed has inconsistent naming convention for record:
GL877870.2 2348 2545 pseudogene CICP10
|
Output files:
There are gene types: [antisense, IG_C_pseudogene, IG_V_pseudogene, lincRNA, miRNA, misc_RNA, polymorphic_pseudogene, processed_transcript, protein_coding, rRNA, sense_intronic, sense_overlapping, snoRNA, snRNA, TR_V_gene, TR_V_pseudogene, pseudogene] - in the final files. The list of a few genes covered by reads you can see in table 1.
Table 1
| Gene | Start | End | Amount of reads in the 1st replica | Amount of reads in the 2nd replica | Gene function |
| CBWD1 | 121041 | 188979 | 235 | 235 | COBW domain-containing protein 1 |
| DOCK8 | 214854 | 465259 | 438 | 438 | Dedicator of cytokinesis 8 - a protein involved in intracellular signalling networks. |
| SMARCA2 | 2015342 | 2193624 | 326 | 326 | Probable global transcription activator |
| SPATA6L | 4553386 | 4666674 | 248 | 248 | Spermatogenesis associated 6 like |
| MPDZ | 13105703 | 13279589 | 652 | 652 | Multiple PDZ domain protein |
Some additional files (columns: amount of reads covered the region, gene type, gene name):
The list of genes covered by reads for 1st replica: chr9.1_genes_list.csv.txt
The list of genes covered by reads for 2nd replica: chr9.2_genes_list.csv.txt
The list of protein coding genes covered by reads for 1st replica: chr9.1_prot_cod.csv.txt
The list of protein coding genes covered by reads for 1st replica: chr9.2_prot_cod.csv.txt
I noted that two final BED files differ very little. Therefore "csv.txt" files are the same. I can't explain why it is so. A list of different lines you can find here. The result is that almost all genes are covered by the same number of reads.
Bedtools
Task 1
According the task I need to extract reads in FASTQ format from the file contain alignment.
| $ bedtools bamtofastq -i chr9.1_sorted.bam -fq chr9.1.fq |
Output files:
Task 2
According the task I need to receive FASTA file of one of the genes covered by reads from the whole chromosome. I saved coordinates of the gene in BED file CBWD1.bed. Then I used getfasta function of Bedtools.
| $ bedtools getfasta -fi chr9.fasta -bed CBWD1.bed > CBWD1.fasta |
Output files:
Task 3
According the task I need to break the chromosome into fragments of 1 million nucleotides. Informatoin about chromosome length I checked here and save it in chr9.genome. Total tenth of the chromosome = 141213431 nucleotodes. Amount of received fragments is 142.
| $ bedtools makewindows -g chr9.genome -w 1000000 > chr9_1mln_fragments.bed |
Output files:
Task 4
According the task I need to merge reads into clusters using BED file contains alignment from the previous task (pr 12). Parameter -d means the maximum distance between reads to merge them in the same clusters.
| $ bedtools cluster -i chr9.1.bed -d 10000 > clusters1.bed |
Output files (there are only 1 cluster even if use -d 100000):
Task 5
According the task I need to make 1000 random fragments of 200 nucleotides from the chromosome. Informatoin about chromosome lenth I checked here and save it in chr9.genome.
| $ bedtools random -g chr9.genome -n 1000 -l 200 > chr9_random.bed |
Output files:
Task 7
According the task I need to recieve coordinates of a gene covered by reads expanded to 1000 nucleotides in right and left. I saved coordinates of the gene in BED file CBWD1.bed. Then I used slop function of Bedtools with parameter -b = 1000 (increase the BED/GFF/VCF entry -b base pairs in each direction).
| $ bedtools slop -i CBWD1.bed -g chr9.genome -b 1000
|
Output:
Task 8
According the task I need to recieve coordinates of a gene covered by reads shifted 500 nucleotides closer to the start of the chromosome. I saved coordinates of the gene in BED file CBWD1.bed. Then I used slop function of Bedtools with parameters -l = 500 (the number of base pairs to subtract from the start coordinate) and -r = 500 (the number of base pairs to add to the end coordinate).
| $ bedtools slop -i CBWD1.bed -g chr9.genome -l 500 -r -500 |
Output files:
Task 9
According the task I need to obtain non-overlapping fragments which correspond to the regions covered by 'my' genes. I used BED file contains alignment from the previous task (pr 12) and this file (human genome assembly h19): /nfs/srv/samba/public/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed.
| $ bedtools annotate -i chr9.1.bed -files gencode.genes.bed -both > annotation1.bed |
Output files:
© Darya Potanina, 2017
|