Analysis of the transcriptomeBedtools (↓)I created a subdirectory /nfs/srv/databases/ngs/potanina.darya/pr12. This directory contains two samples of RNA sequencing reads derived from human's chromosome 9 and all prodused during making the tasks files. All comands you can find here: script.sh. Task 1I conducted quality control of reads by FastQC installed on kodomo.
The result you can see here:
Task 2Reads were mapped on chromosome 9 by HISAT2. I run hisat2 without parameter --no-spliced-alignment because gaps are not forbidden for transcriptome reads (the indexed reference sequense was taken from the previous work). Then there was made an alignment of the reference sequence and readings in SAM format.
Output (here you can see the number of unaligned reads, reads aligned once and reads aligned more than once):
Output files:
Task 3.Analisys of the alignment was made by samtools. Firstly, alignment was converted from SAM to BAM format.
Output files:
Secondly, I sort the alignment of readings with the reference and the coordinate of the beginning reference reading (*_temp.txt is a temporary file created by the program).
Output files:
Thirdly, sequence in the resulting file was indexed.
Output files:
Task 4I conducted reads' counting by bedtools. Firstly BAM files were converted to BED format.
Output files:
Then compared RNA sequencing reads with the reference sequence of human genome h19 assembly (/nfs/srv/samba/public/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed). This program allows one to screen for overlaps between two sets of genomic features. It was run with the following parametres:
Output:
Output files:
There are gene types: [antisense, IG_C_pseudogene, IG_V_pseudogene, lincRNA, miRNA, misc_RNA, polymorphic_pseudogene, processed_transcript, protein_coding, rRNA, sense_intronic, sense_overlapping, snoRNA, snRNA, TR_V_gene, TR_V_pseudogene, pseudogene] - in the final files. The list of a few genes covered by reads you can see in table 1. Table 1
Some additional files (columns: amount of reads covered the region, gene type, gene name): I noted that two final BED files differ very little. Therefore "csv.txt" files are the same. I can't explain why it is so. A list of different lines you can find here. The result is that almost all genes are covered by the same number of reads. BedtoolsTask 1According the task I need to extract reads in FASTQ format from the file contain alignment.
Output files:
Task 2According the task I need to receive FASTA file of one of the genes covered by reads from the whole chromosome. I saved coordinates of the gene in BED file CBWD1.bed. Then I used getfasta function of Bedtools.
Output files:
Task 3According the task I need to break the chromosome into fragments of 1 million nucleotides. Informatoin about chromosome length I checked here and save it in chr9.genome. Total tenth of the chromosome = 141213431 nucleotodes. Amount of received fragments is 142.
Output files:
Task 4According the task I need to merge reads into clusters using BED file contains alignment from the previous task (pr 12). Parameter -d means the maximum distance between reads to merge them in the same clusters.
Output files (there are only 1 cluster even if use -d 100000):
Task 5According the task I need to make 1000 random fragments of 200 nucleotides from the chromosome. Informatoin about chromosome lenth I checked here and save it in chr9.genome.
Output files:
Task 7According the task I need to recieve coordinates of a gene covered by reads expanded to 1000 nucleotides in right and left. I saved coordinates of the gene in BED file CBWD1.bed. Then I used slop function of Bedtools with parameter -b = 1000 (increase the BED/GFF/VCF entry -b base pairs in each direction).
Output:
Task 8According the task I need to recieve coordinates of a gene covered by reads shifted 500 nucleotides closer to the start of the chromosome. I saved coordinates of the gene in BED file CBWD1.bed. Then I used slop function of Bedtools with parameters -l = 500 (the number of base pairs to subtract from the start coordinate) and -r = 500 (the number of base pairs to add to the end coordinate).
Output files:
Task 9According the task I need to obtain non-overlapping fragments which correspond to the regions covered by 'my' genes. I used BED file contains alignment from the previous task (pr 12) and this file (human genome assembly h19): /nfs/srv/samba/public/y14/term3/block4/SNP/rnaseq_reads/gencode.genes.bed.
Output files:
© Darya Potanina, 2017 |