Transcriptome analysis and bedtools
Last update on the 4th of December, 2017Here I process RNA sequencing reads and analyze mapping with bedtools. Two samples derived from chromosome 5 are mapped side-by-side and compared. First, only one sample processed.
File | Link |
---|---|
Spreadsheet table | table.ods |
Read mapping
Quality control of sample 1 reads was done with FastQC. The quality is good and across almost all length of reads is uniform (fig. 1). However, per base sequence content and per sequence GC content are far away from uniform because of 1) read length (41—51), 2) read nature (RNA reads with putative specific content).
Reads were mapped on chromosome 5 with removed -no-spliced-alignment
option to enable splicing. Extraction of known splliced sites was done prior to mapping. All
24156 reads were presented, of these 504 (2,09%) were mapped 0 times, 17070 (70,67%) —
exactly 1 time and 6582 (27,25%) — more than one time. 97,91% reads were aligned in total.
Obtained .sam file was
transformed to .bam file, sorted and indexed. Sequencing depth was calcualted, too.
All required CLI commands are shown in fig. 2.
fastqc chr5.1.fastq hisat2_extract_splice_sites.py /P/y14/term3/block4/SNP/rnaseq_reads/gencode.v19.chr_patch_hapl_scaff.annotation.gtf > splicedsites.txt hisat2 -x ../../task11/indexed -q chr5.1.fastq -S chr5.1.sam --no-softclip --met-file hi1.txt --known-splicesite-infile splicesites.txt samtools view -b chr5.1.sam -o chr5.1.bam samtools sort -O bam chr5.1.bam -o sorted_5.1.bam -T temp samtools index sorted_5.1.bam samtools depth sorted_5.1.bam > depth.1.txt
Overall coverage distribution is presented in fig. 3. Coverages and alignments of particular genes are shown in figs. 4—7. Two regions were discovered: region 1 with apparently false alignment and region 2 with true alignment and multiple splicing events. Open figures in new tabs to see them clearly.
Analysis
Analysis was made with bedtools
package. To state genes covered
with reads, their coverage and reads that overflowed gene boundaries following
CLI commands were executed:
bedtools bamtobed -i sorted_5.1.bam > infile.bed bedtools intersect -b infile.bed -a chr5.bed -c > coverage1.bed bedtools subtract -a infile.bed -b chr5.bed > no_gene.bed
coverage1.bed
file consists of 6 columns:
chromosome, start, end, gene type, gene name, coverage, — respective
properties from chr5.bed
file.
no_gene.bed
exhibit 6 columns: chromosome, read start,
read end, read properties, length, strandness, — respective properties from
infile.bed
file.
Obtained genes are FAM172A, NPM1P27 and NPM1. First two genes are covered a little (score = 319), whereas is covered significantly (103 magnitude). The information about genes is provided in table 1.
gene | size | start | end | strand | function | exon | intron | additional information |
---|---|---|---|---|---|---|---|---|
FAM172A | 493974 | 92953431 | 93447404 | - | Protein product participate in alternative splicing, secrets extracellularly[1] | 11[2] | 10 | No |
NPM1P27 | 830 | 93018544 | 93019373 | - | Pseudogene[3] | No | No | No |
NPM1 | 23181 | 170814708 | 170837888 | + | Centrosome duplication, protein chaperoning, cell proliferation[4] | 11 | 10 | Mutations are associated with acute myeloid leukemia |
41 reads were aligned out of genes. 36 of them were aligned downstream NPM1 gene starting from 70 bases down (fig. 9). 5 reads were aligned upstream NPM1 gene (500 bases up, fig. 10).
Samples comparisons
Another biological sample reads are presented. They were piped as desribed above. Overall consistence of both samples was observed so it is unnecessary to show analogous pictures as of sample 1. But it is a concern of interest to compare coverage of both samples. To do this, the ratio of coverages was calculated and plotted along with boh coverages on genome. The results are shown in figs. 11 and 12.
bedtools training
Several tasks were done with bedtools. The working directory is task12/sample1/
.
1. bamtofastq
The task was to transform bam alignment into fastq reads.
Flag -i
denotes input bam file, -fq
denotes output fastq file.
$ bedtools bamtofastq -i sorted_5.1.bam -fq fast.fastq
2. getfasta
Retrieve fasta sequence of one covered gene. NPM1 was chosen. -fi
denotes
genome fasta file, -bed
denotes input bed file with gene intervals. The interval was passed through
pipe as string in bed format. Unix fold
utility was used to transform fasta record in good-loking view.
$ echo -e "chr5\t170814652\t170837888\t+\n" | bedtools getfasta -fi ../../task11/chr5.fasta -bed stdin | fold -w 60 > npm1.fasta
3. makewindows
Binning chromosome 5 into 1M bins. -g
denotes file with genome in bed format (fasta indexed file was good for it). The length
of chromosome is 180915260, got 181 bins.
$ head ../../task11/chr5.fasta.fai chr5 180915260 6 60 61 $ bedtools makewindows -g ../../task11/chr5.fasta.fai -w 1000000 > windows.bed $ head windows.bed chr5 0 1000000 chr5 1000000 2000000 chr5 2000000 3000000 chr5 3000000 4000000 chr5 4000000 5000000 chr5 5000000 6000000 chr5 6000000 7000000 chr5 7000000 8000000 chr5 8000000 9000000 chr5 9000000 10000000 $ wc -l windows.bed 181
4. cluster
Cluster reads from -i
flag with cluster
utility.
$ bedtools cluster -i infile.bed > cluster.bed $ head cluster.bed chr5 93018239 93018286 D00795:16:C7BV0ACXX:5:2115:11098:40724 0 + 1 chr5 93018243 93018294 D00795:16:C7BV0ACXX:5:2315:15445:45968 0 + 1 chr5 93018576 93018627 D00795:16:C7BV0ACXX:5:1115:6802:36381 60 + 2 chr5 93018615 93018666 D00795:16:C7BV0ACXX:5:2313:16747:76228 60 + 2 chr5 93018640 93018691 D00795:16:C7BV0ACXX:5:2215:7013:98402 60 + 2 chr5 93018650 93018701 D00795:16:C7BV0ACXX:5:2106:3736:77999 60 + 2 chr5 93018650 93018701 D00795:16:C7BV0ACXX:5:2108:18402:20632 60 + 2 chr5 93018650 93018701 D00795:16:C7BV0ACXX:5:2314:11224:6902 60 + 2 chr5 93018650 93018701 D00795:16:C7BV0ACXX:5:2314:9027:44428 60 + 2 chr5 93018650 93018701 D00795:16:C7BV0ACXX:5:1112:7786:31822 60 + 2
5. Random
Retrieve 1000 random intervals of 200 bases long. -l
denotes length,
-n
denotes amount, -g
denotes genome in bed format.
$ bedtools random -l 200 -n 1000 -g ../../task11/chr5.fasta.fai > random.bed $ head random.bed chr5 67147443 67147643 1 200 + chr5 110897329 110897529 2 200 + chr5 118003199 118003399 3 200 + chr5 65810329 65810529 4 200 + chr5 86387665 86387865 5 200 - chr5 77885218 77885418 6 200 - chr5 56364008 56364208 7 200 + chr5 81389630 81389830 8 200 - chr5 143847269 143847469 9 200 + chr5 1926676 1926876 10 200 -
6. 3'-UTR
The task is to get interval of 1000 bases downstream the particular gene. First, the bed file with
gene coordinates was done, then its interval was enlengthed with slop
. -i, -g
stand for input bed file and genome bed file, respectively, -r, -l
denote size of stretching at
right and left flanks, respectively, -s
stands for respect of strandness (as 3'-UTR must be defined regarding strandness of gene).
The obtained interval is gene + 3'-UTR, hence the subtraction must be done.
$ echo -e "chr5\t170814652\t170837888\t+\n" > npm1.bed $ bedtools slop -g ../../task11/chr5.fasta.fai -i npm1.bed -r 1000 -s -l 0 | bedtools subtract -a stdin -b npm1.bed > 3utr.bed $ head 3utr.bed chr5 170837888 170838888 +
7. Enlarge
Enlarge gene NPM1 by 1000 bases in both directions. The bed-formatted string was passed to stdin of slop
utility, -b
stands for both sides, -s
is redundant here.
$ echo -e "chr5\t170814652\t170837888\t+\n" | bedtools slop -g ../../task11/chr5.fasta.fai -i stdin -b 1000 -s chr5 170813652 170838888 +
8. Shift 500
The task was to shift interval to the start of chromosome by 500 bases, -s
flag stand for shift, -g
is necessary to prevent interval from overflowing genome boundaries.
$ echo -e "chr5\t170814652\t170837888\t+\n" | bedtools shift -i stdin -g ../../task11/chr5.fasta.fai -s -500 chr5 170814160 170837392 +
9. Disjoint regions
Obtain disjoint regions covered with reads from infile.bed
alignment.
$ bedtools merge -i infile.bed > merged.bed $ head merged.bed chr5 93018239 93018294 chr5 93018576 93018766 chr5 93018813 93019009 chr5 93019031 93019216 chr5 93019282 93019333 chr5 93019349 93019400 chr5 170813468 170813519 chr5 170814056 170814180 chr5 170814275 170814326 chr5 170814885 170814952
10. Coverage
Obtain coverage of reads in any format. genomecov
was used, -split
is needed
to count spliced reads distinctively, -bga
reports coverage in BedGraph format (intervals) with zero
coverages reported.
$ bedtools genomecov -bga -split -g ../../task11/chr5.fasta.fai -i infile.bed > genomecov.bed $ head genomecov.bed chr5 0 93018239 0 chr5 93018239 93018243 1 chr5 93018243 93018286 2 chr5 93018286 93018294 1 chr5 93018294 93018576 0 chr5 93018576 93018615 1 chr5 93018615 93018627 2 chr5 93018627 93018640 1 chr5 93018640 93018650 2 chr5 93018650 93018651 9
References
- Wikipedia article on FAM127A;
- UCGC page of FAM127A;
- UCGC page of NPM1P27;
- UCGC page of NPM1.