и его применение в биоинформатике
Лекция 13
Анна Валяева
29 ноября 2024
Что из нижеперечисленного является названием гена.
Некоторые дополнительные (обычно полученнные откуда-то извне) данные, которые дополняют и поясняют ваши экспериментальные данные.
ENSG00000075624
, ENTREZ: 60
)ACTB
actin beta
protein-coding
ENST00000674681
, …P60709
chr7
Доступ к этим сервисам и базам данных можно получить в том числе с помощью R пакетов из Bioconductor.
С каждой новой версией R выходит и новая версия Bioconductor с новыми версиями пакетов. Поэтому определенной версии R соотвествует определенная версия пакета из Bioconductor. Посмотреть соответствие версий можно на сайте.
Пакеты из Bioconductor предпочтительнее устанавливать с помощью специальной системы управления пакетами - BiocManager.
[1] '3.15'
* sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.utf8 LC_CTYPE=Russian_Russia.utf8
[3] LC_MONETARY=Russian_Russia.utf8 LC_NUMERIC=C
[5] LC_TIME=Russian_Russia.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] biomaRt_2.52.0 org.Hs.eg.db_3.15.0 AnnotationDbi_1.58.0
[4] IRanges_2.30.1 S4Vectors_0.34.0 Biobase_2.56.0
[7] BiocGenerics_0.42.0 lubridate_1.9.2 forcats_1.0.0
[10] stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1
[13] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[16] ggplot2_3.4.1 tidyverse_2.0.0 dbplyr_2.3.4
loaded via a namespace (and not attached):
[1] httr_1.4.4 bit64_4.0.5 jsonlite_1.8.4
[4] carData_3.0-5 BiocManager_1.30.18 BiocFileCache_2.4.0
[7] blob_1.2.3 GenomeInfoDbData_1.2.8 yaml_2.3.5
[10] progress_1.2.2 backports_1.4.1 pillar_1.9.0
[13] RSQLite_2.2.18 glue_1.6.2 digest_0.6.29
[16] XVector_0.36.0 ggsignif_0.6.4 colorspace_2.0-3
[19] htmltools_0.5.8.1 XML_3.99-0.11 pkgconfig_2.0.3
[22] broom_1.0.4 zlibbioc_1.42.0 scales_1.2.1
[25] tzdb_0.4.0 timechange_0.2.0 KEGGREST_1.36.3
[28] car_3.1-1 generics_0.1.3 ellipsis_0.3.2
[31] ggpubr_0.6.0 cachem_1.0.6 withr_2.5.0
[34] cli_3.4.1 magrittr_2.0.3 crayon_1.5.2
[37] memoise_2.0.1 evaluate_0.17 fansi_1.0.3
[40] rstatix_0.7.2 xml2_1.3.6 tools_4.2.1
[43] prettyunits_1.1.1 hms_1.1.2 lifecycle_1.0.3
[46] munsell_0.5.0 Biostrings_2.64.1 compiler_4.2.1
[49] GenomeInfoDb_1.32.4 rlang_1.1.0 grid_4.2.1
[52] RCurl_1.98-1.9 rstudioapi_0.14 rappdirs_0.3.3
[55] htmlwidgets_1.6.4 bitops_1.0-7 rmarkdown_2.17
[58] gtable_0.3.1 abind_1.4-5 DBI_1.1.3
[61] curl_4.3.3 R6_2.5.1 knitr_1.40
[64] fastmap_1.1.1 bit_4.0.4 utf8_1.2.2
[67] filelock_1.0.2 stringi_1.7.8 Rcpp_1.0.10
[70] vctrs_0.6.2 png_0.1-7 tidyselect_1.2.0
[73] xfun_0.40
Bioconductor version '3.15'
* 304 packages out-of-date
* 5 packages too new
create a valid installation with
BiocManager::install(c(
"abind", "amap", "ape", "aplot", "arrow", "askpass", "backports", "bbmle",
"bdsmatrix", "BH", "bigD", "BiocManager", "BiocParallel", "bit", "bit64",
"bitops", "blob", "bookdown", "brew", "brio", "broom", "broom.helpers",
"bslib", "cachem", "callr", "car", "caTools", "checkmate", "chromote",
"circlize", "classInt", "cli", "clock", "clue", "clusterProfiler", "coda",
"colorspace", "commonmark", "corrplot", "cowplot", "cpp11", "crayon",
"credentials", "crosstalk", "crul", "curl", "data.table", "datasauRus",
"datawizard", "DBI", "dbplyr", "deldir", "dendextend", "desc", "DescTools",
"devEMF", "digest", "distributional", "DOSE", "dotCall64", "dplyr",
"dqrng", "DT", "dtplyr", "e1071", "ellipse", "emdbook", "emmeans",
"estimability", "eulerr", "evaluate", "Exact", "expm", "extrafont",
"FactoMineR", "fansi", "farver", "fastDummies", "fastmap", "fastmatch",
"filelock", "fitdistrplus", "flexdashboard", "flextable", "flipbookr",
"FNN", "fontawesome", "formatR", "formatters", "fs", "future",
"future.apply", "gargle", "gdtools", "GenSA", "geomtextpath", "gert",
"gganimate", "ggdendro", "ggdist", "ggforce", "ggfortify", "ggfun",
"ggh4x", "ggnewscale", "ggplot2", "ggplotify", "ggraph", "ggrepel",
"ggridges", "ggsci", "ggupset", "ggVennDiagram", "gh", "globals", "glue",
"googledrive", "googlesheets4", "GOSemSim", "gplots", "gprofiler2",
"graphlayouts", "gt", "gtable", "gtExtras", "gtsummary", "hardhat",
"haven", "highr", "Hmisc", "hms", "htmlTable", "httpuv", "httr", "igraph",
"insight", "interp", "ipred", "isoband", "jpeg", "jsonlite", "knitr",
"labeling", "labelled", "later", "leaflet", "leaflet.providers", "leaps",
"learnr", "lifecycle", "listenv", "lme4", "lmom", "locfit", "lubridate",
"magick", "maps", "markdown", "MatrixModels", "matrixStats", "minqa",
"multcompView", "munsell", "mvtnorm", "nloptr", "officer", "openssl",
"openxlsx", "packrat", "pagedown", "paletteer", "parallelly", "patchwork",
"pbkrtest", "pdftools", "pkgbuild", "pkgdown", "pkgload", "plotly", "plyr",
"png", "polyclip", "polylabelr", "prettyunits", "prismatic", "processx",
"prodlim", "profvis", "progress", "progressr", "proj4", "promises", "ps",
"psych", "purrr", "qpdf", "quantreg", "quarto", "R.oo", "R.utils", "ragg",
"randomForest", "RANN", "raster", "rbibutils", "Rcpp", "RcppAnnoy",
"RcppArmadillo", "RcppEigen", "RcppHNSW", "RcppNumerical", "RCurl",
"Rdpack", "reactR", "readr", "readxl", "recipes", "rematch", "remotes",
"renv", "reprex", "reticulate", "rio", "rjson", "rlang", "rmarkdown",
"rootSolve", "roxygen2", "rprojroot", "rsconnect", "RSpectra", "RSQLite",
"rstudioapi", "rtables", "Rtsne", "Rttf2pt1", "rvest", "rvg", "s2",
"scales", "scatterpie", "scatterplot3d", "servr", "Seurat", "SeuratObject",
"sf", "shadowtext", "shape", "shiny", "shinyWidgets", "showtext", "sjmisc",
"sourcetools", "sp", "spam", "SparseM", "spatstat.data",
"spatstat.explore", "spatstat.geom", "spatstat.random", "spatstat.sparse",
"spatstat.utils", "stringi", "stringr", "survminer", "sys", "sysfonts",
"systemfonts", "tern", "testthat", "textshaping", "tidygraph", "tidyr",
"tidyselect", "tidytree", "timechange", "timeDate", "tinytex", "tvthemes",
"tweenr", "units", "usethis", "utf8", "uuid", "uwot", "V8", "vcd", "vctrs",
"vegan", "vipor", "viridis", "viridisLite", "vroom", "waldo", "webr",
"webshot2", "websocket", "whisker", "withr", "wk", "xaringan",
"xaringanExtra", "xfun", "XML", "xopen", "yaml", "yulab.utils", "zip", "zoo"
), update = TRUE, ask = FALSE)
more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date
[1] "BSgenome.Mmusculus.UCSC.mm10" "BSgenome.Mmusculus.UCSC.mm10.masked"
[3] "BSgenome.Mmusculus.UCSC.mm39" "BSgenome.Mmusculus.UCSC.mm8"
[5] "BSgenome.Mmusculus.UCSC.mm8.masked" "BSgenome.Mmusculus.UCSC.mm9"
[7] "BSgenome.Mmusculus.UCSC.mm9.masked" "EnsDb.Mmusculus.v75"
[9] "EnsDb.Mmusculus.v79" "PWMEnrich.Mmusculus.background"
[11] "TxDb.Mmusculus.UCSC.mm10.ensGene" "TxDb.Mmusculus.UCSC.mm10.knownGene"
[13] "TxDb.Mmusculus.UCSC.mm39.refGene" "TxDb.Mmusculus.UCSC.mm9.knownGene"
Использование пакетов Bioconductor строится на работе с объектами класса S4 (объектно-ориентированное программирование).
Что для нас важно, объекты класса S4 имеют свое название, слоты, в которых хранятся данные, и методы - функции, позволяющие работать с объектами этого класса. Также может быть прописано наследование классов.
[1] "TxDb"
attr(,"package")
[1] "GenomicFeatures"
Reference Class "TxDb":
Class fields:
Name: conn packageName user2seqlevels0 user_seqlevels
Class: SQLiteConnection character integer character
Name: user_genome isActiveSeq
Class: character logical
Class Methods:
"import", ".objectParent", "usingMethods", "show", "getClass", "untrace",
"export", "initialize", ".objectPackage", "callSuper", "copy",
"initFields", "getRefClass", "trace", "field", "finalize"
Reference Superclasses:
"AnnotationDb", "envRefClass"
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2022-Mar17
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
| GOSOURCEDATE: 2022-03-10
| GOEGSOURCEDATE: 2022-Mar17
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2022-Nov23
| ENSOURCEDATE: 2021-Dec21
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Fri Apr 1 14:42:16 2022
Пакет AnnotationDbi предоставляет методы для работы данными из orgDb, TxDb, EnsDb и Go.db.
keys()
возвращает Entrez идентификаторы, доступные в аннотацииkeytypes()
и columns()
возвращают доступные для конвертации идентификаторыselect()
По вектору из идентификаторов можно запросить информацию из аннотации org.Hs.eg.db с помощью функции
select(annotationDb, keys, columns, keytype)
, где
Установите и подгрузите пакет org.Hs.eg.db.
С помощью него и функции select()
из пакета AnnotationDbi переведите ENSEMBL
идентификаторы генов в символьные названия (SYMBOL
), а также достаньте из аннотации полные названия этих генов (GENENAME
) и ассоциацию с заболеваниями по базе данных OMIM
.
mapIds()
Работает аналогично функции select()
, но только с одной аннотацией (column
вместо columns
) и позволяет с помощью параметра multiVals
указать, как вести себя, если одному гену соответствует несколько аннотаций.
Пакеты TxDb и EnsDb содержат генные разметки - координаты генов, транскриптов и экзонов в геноме.
entrez_genes <- mapIds(org.Hs.eg.db, ens_genes, "ENTREZID", "ENSEMBL")
select(
txdb,
entrez_genes,
c("TXNAME", "TXCHROM", "TXSTART", "TXEND", "TXSTRAND"),
"GENEID") %>%
head()
GENEID TXNAME TXCHROM TXSTRAND TXSTART TXEND
1 80221 ENST00000510410.5 chr17 + 50426158 50474670
2 80221 ENST00000504945.1 chr17 + 50426218 50461048
3 80221 ENST00000503408.5 chr17 + 50426218 50461265
4 80221 ENST00000506582.5 chr17 + 50426218 50462426
5 80221 ENST00000504392.5 chr17 + 50426218 50474670
6 80221 ENST00000300441.9 chr17 + 50426218 50474837
Дает доступ к большему количеству ресурсов и версий баз данных. Облегчает работу с немодельными организмами. Обеспесчивает большую воспроизводимость.
AnnotationHub with 67229 records
# snapshotDate(): 2022-04-25
# $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/g...
# $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus,...
# $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile...
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH5012"]]'
title
AH5012 | Chromosome Band
AH5013 | STS Markers
AH5014 | FISH Clones
AH5015 | Recomb Rate
AH5016 | ENCODE Pilot
... ...
AH107043 | Zonotrichia_albicollis.Zonotrichia_albicollis-1.0.1.ncrna.2bit
AH107044 | Zosterops_lateralis_melanops.ASM128173v1.cdna.all.2bit
AH107045 | Zosterops_lateralis_melanops.ASM128173v1.dna_rm.toplevel.2bit
AH107046 | Zosterops_lateralis_melanops.ASM128173v1.dna_sm.toplevel.2bit
AH107047 | Zosterops_lateralis_melanops.ASM128173v1.ncrna.2bit
[1] "UCSC"
[2] "Ensembl"
[3] "RefNet"
[4] "Inparanoid8"
[5] "NHLBI"
[6] "ChEA"
[7] "Pazar"
[8] "NIH Pathway Interaction Database"
[9] "Haemcode"
[10] "BroadInstitute"
[11] "PRIDE"
[12] "Gencode"
[13] "CRIBI"
[14] "Genoscope"
[15] "MISO, VAST-TOOLS, UCSC"
[16] "UWashington"
[17] "Stanford"
[18] "dbSNP"
[19] "BioMart"
[20] "GeneOntology"
[21] "KEGG"
[22] "URGI"
[23] "EMBL-EBI"
[24] "MicrosporidiaDB"
[25] "FungiDB"
[26] "TriTrypDB"
[27] "ToxoDB"
[28] "AmoebaDB"
[29] "PlasmoDB"
[30] "PiroplasmaDB"
[31] "CryptoDB"
[32] "TrichDB"
[33] "GiardiaDB"
[34] "The Gene Ontology Consortium"
[35] "ENCODE Project"
[36] "SchistoDB"
[37] "NCBI/UniProt"
[38] "GENCODE"
[39] "http://www.pantherdb.org"
[40] "RMBase v2.0"
[41] "snoRNAdb"
[42] "tRNAdb"
[43] "NCBI"
[44] "DrugAge, DrugBank, Broad Institute"
[45] "DrugAge"
[46] "DrugBank"
[47] "Broad Institute"
[48] "HMDB, EMBL-EBI, EPA"
[49] "STRING"
[50] "OMA"
[51] "OrthoDB"
[52] "PathBank"
[53] "EBI/EMBL"
[54] "NCBI,DBCLS"
[55] "FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,BADERLAB,SINGLECELLSIGNALR,HOMOLOGENE"
[56] "WikiPathways"
[57] "UCSC Jaspar"
[58] "VAST-TOOLS"
[59] "pyGenomeTracks "
[60] "NA"
[61] "UoE"
[62] "mitra.stanford.edu/kundaje/akundaje/release/blacklists/"
[63] "ENCODE"
[64] "TargetScan,miRTarBase,USCS,ENSEMBL"
[65] "TargetScan"
[66] "QuickGO"
[67] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
Для поиска нужной базы данных в AnnotationHub есть функция query()
.
Например, мы хотим воспользоваться базой данных генов человека Ensembl версии 105. В Bioconductor есть пакеты аннотаций EnsDb
, но нет нужной нам версии. Однако ее можно найти в AnnotationHub:
AnnotationHub with 1 record
# snapshotDate(): 2022-04-25
# names(): AH98047
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2021-10-20
# $title: Ensembl 105 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("105", "Annotation", "AnnotationHubSoftware", "Coverage",
# "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
# "Transcript")
# retrieve record with 'object[["AH98047"]]'
Скачаем и сохраним локально нужную аннотацию:
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.7
|Creation time: Sat Dec 18 14:48:15 2021
|ensembl_version: 105
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
| No. of genes: 69329.
| No. of transcripts: 268255.
|Protein data available.
Работать со скачанной аннотацией можно с помощью функций пакета AnnotationDbi:
[1] "ENTREZID" "EXONID" "GENEBIOTYPE"
[4] "GENEID" "GENENAME" "PROTDOMID"
[7] "PROTEINDOMAINID" "PROTEINDOMAINSOURCE" "PROTEINID"
[10] "SEQNAME" "SEQSTRAND" "SYMBOL"
[13] "TXBIOTYPE" "TXID" "TXNAME"
[16] "UNIPROTID"
В Ensembl основным идентификатором является ENSEMBL ID (GENEID
).
Функции genes()
, transcripts()
и exons()
пакета GenomicFeatures позволяют получать координаты этих участков в геноме в виде GRanges
объекта.
GRanges object with 69329 ranges and 9 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000223972 1 11869-14409 + | ENSG00000223972
ENSG00000227232 1 14404-29570 - | ENSG00000227232
ENSG00000278267 1 17369-17436 - | ENSG00000278267
ENSG00000243485 1 29554-31109 + | ENSG00000243485
ENSG00000284332 1 30366-30503 + | ENSG00000284332
... ... ... ... . ...
ENSG00000224240 Y 26549425-26549743 + | ENSG00000224240
ENSG00000227629 Y 26586642-26591601 - | ENSG00000227629
ENSG00000237917 Y 26594851-26634652 - | ENSG00000237917
ENSG00000231514 Y 26626520-26627159 - | ENSG00000231514
ENSG00000235857 Y 56855244-56855488 + | ENSG00000235857
gene_name gene_biotype seq_coord_system
<character> <character> <character>
ENSG00000223972 DDX11L1 transcribed_unproces.. chromosome
ENSG00000227232 WASH7P unprocessed_pseudogene chromosome
ENSG00000278267 MIR6859-1 miRNA chromosome
ENSG00000243485 MIR1302-2HG lncRNA chromosome
ENSG00000284332 MIR1302-2 miRNA chromosome
... ... ... ...
ENSG00000224240 CYCSP49 processed_pseudogene chromosome
ENSG00000227629 SLC25A15P1 unprocessed_pseudogene chromosome
ENSG00000237917 PARP4P1 unprocessed_pseudogene chromosome
ENSG00000231514 CCNQP2 processed_pseudogene chromosome
ENSG00000235857 CTBP2P1 processed_pseudogene chromosome
description gene_id_version canonical_transcript
<character> <character> <character>
ENSG00000223972 DEAD/H-box helicase .. ENSG00000223972.5 ENST00000450305
ENSG00000227232 WASP family homolog .. ENSG00000227232.5 ENST00000488147
ENSG00000278267 microRNA 6859-1 [Sou.. ENSG00000278267.1 ENST00000619216
ENSG00000243485 MIR1302-2 host gene .. ENSG00000243485.5 ENST00000473358
ENSG00000284332 microRNA 1302-2 [Sou.. ENSG00000284332.1 ENST00000607096
... ... ... ...
ENSG00000224240 CYCS pseudogene 49 [.. ENSG00000224240.1 ENST00000420810
ENSG00000227629 solute carrier famil.. ENSG00000227629.1 ENST00000456738
ENSG00000237917 poly(ADP-ribose) pol.. ENSG00000237917.1 ENST00000435945
ENSG00000231514 CCNQ pseudogene 2 [S.. ENSG00000231514.1 ENST00000435741
ENSG00000235857 CTBP2 pseudogene 1 [.. ENSG00000235857.1 ENST00000431853
symbol entrezid
<character> <list>
ENSG00000223972 DDX11L1 102725121,100287596,100287102,...
ENSG00000227232 WASH7P <NA>
ENSG00000278267 MIR6859-1 102466751
ENSG00000243485 MIR1302-2HG <NA>
ENSG00000284332 MIR1302-2 100302278
... ... ...
ENSG00000224240 CYCSP49 <NA>
ENSG00000227629 SLC25A15P1 <NA>
ENSG00000237917 PARP4P1 <NA>
ENSG00000231514 CCNQP2 <NA>
ENSG00000235857 CTBP2P1 <NA>
-------
seqinfo: 456 sequences (1 circular) from GRCh38 genome
GRanges object with 15 ranges and 12 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000499732 11 65422774-65426457 + | ENST00000499732
ENST00000687132 11 65422797-65426532 + | ENST00000687132
ENST00000501122 11 65422798-65445540 + | ENST00000501122
ENST00000685861 11 65422798-65426529 + | ENST00000685861
ENST00000601801 11 65422800-65426405 + | ENST00000601801
... ... ... ... . ...
ENST00000693290 11 65425414-65426529 + | ENST00000693290
ENST00000616315 11 65425551-65426385 + | ENST00000616315
ENST00000687943 11 65431820-65433023 + | ENST00000687943
ENST00000691530 11 65440182-65440864 + | ENST00000691530
ENST00000693747 11 65440182-65440864 + | ENST00000693747
tx_biotype tx_cds_seq_start tx_cds_seq_end gene_id
<character> <integer> <integer> <character>
ENST00000499732 lncRNA <NA> <NA> ENSG00000245532
ENST00000687132 lncRNA <NA> <NA> ENSG00000245532
ENST00000501122 lncRNA <NA> <NA> ENSG00000245532
ENST00000685861 lncRNA <NA> <NA> ENSG00000245532
ENST00000601801 lncRNA <NA> <NA> ENSG00000245532
... ... ... ... ...
ENST00000693290 lncRNA <NA> <NA> ENSG00000245532
ENST00000616315 lncRNA <NA> <NA> ENSG00000245532
ENST00000687943 lncRNA <NA> <NA> ENSG00000245532
ENST00000691530 lncRNA <NA> <NA> ENSG00000245532
ENST00000693747 lncRNA <NA> <NA> ENSG00000245532
tx_support_level tx_id_version gc_content
<integer> <character> <numeric>
ENST00000499732 2 ENST00000499732.3 47.6024
ENST00000687132 <NA> ENST00000687132.1 48.2334
ENST00000501122 <NA> ENST00000501122.2 44.0531
ENST00000685861 <NA> ENST00000685861.1 48.2315
ENST00000601801 4 ENST00000601801.3 48.4393
... ... ... ...
ENST00000693290 <NA> ENST00000693290.1 36.6487
ENST00000616315 2 ENST00000616315.2 36.5234
ENST00000687943 <NA> ENST00000687943.1 33.2226
ENST00000691530 <NA> ENST00000691530.1 45.0952
ENST00000693747 <NA> ENST00000693747.1 45.0952
tx_external_name tx_is_canonical tx_name symbol
<character> <integer> <character> <character>
ENST00000499732 NEAT1-201 0 ENST00000499732 NEAT1
ENST00000687132 NEAT1-211 0 ENST00000687132 NEAT1
ENST00000501122 NEAT1-202 1 ENST00000501122 NEAT1
ENST00000685861 NEAT1-210 0 ENST00000685861 NEAT1
ENST00000601801 NEAT1-203 0 ENST00000601801 NEAT1
... ... ... ... ...
ENST00000693290 NEAT1-214 0 ENST00000693290 NEAT1
ENST00000616315 NEAT1-205 0 ENST00000616315 NEAT1
ENST00000687943 NEAT1-212 0 ENST00000687943 NEAT1
ENST00000691530 NEAT1-213 0 ENST00000691530 NEAT1
ENST00000693747 NEAT1-215 0 ENST00000693747 NEAT1
-------
seqinfo: 1 sequence from GRCh38 genome
AnnotationHub with 1830 records
# snapshotDate(): 2022-04-25
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Escherichia coli, greater Indian_fruit_bat, Zootoca vivipara, Zo...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH100399"]]'
title
AH100399 | org.Ag.eg.db.sqlite
AH100400 | org.At.tair.db.sqlite
AH100401 | org.Bt.eg.db.sqlite
AH100402 | org.Cf.eg.db.sqlite
AH100403 | org.Gg.eg.db.sqlite
... ...
AH102596 | org.Lobosporangium_transversale.eg.sqlite
AH102597 | org.Sulfolobus_acidocaldarius.eg.sqlite
AH102598 | org.Penicillium_rugulosum.eg.sqlite
AH102599 | org.Talaromyces_rugulosus.eg.sqlite
AH102600 | org.Metallosphaera_sedula.eg.sqlite
Допустим, мы все еще интересуемся пингвинами…
AnnotationHub with 2 records
# snapshotDate(): 2022-04-25
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Pygoscelis adeliae, Pygoscelis adelia
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH102335"]]'
title
AH102335 | org.Pygoscelis_adeliae.eg.sqlite
AH102336 | org.Pygoscelis_adelia.eg.sqlite
ENTREZID SYMBOL GENENAME
1 103922433 PON2 paraoxonase 2
2 103925683 SGSM1 small G protein signaling modulator 1
3 103922264 KCTD19 potassium channel tetramerization domain containing 19
4 103913577 RADIL Rap associating with DIL domain
5 103913313 TMEM61 transmembrane protein 61
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 113
2 ENSEMBL_MART_MOUSE Mouse strains 113
3 ENSEMBL_MART_SNP Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113
Данные по каким организмам доступны:
dataset description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.3)
3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
version
1 ASM259213v1
2 fAstCal1.3
3 AnoCar2.0v2
4 bAquChr1.2
5 Midas_v5
6 ASM200744v2
Для создания mart объекта лучше использовать функцию useEnsembl()
.
Запрашивать информацию из датасета можно с помощью функции
getBM(attributes, filters, values, mart)
, где
mart
объект который мы только что создали name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
name description
1 chromosome_name Chromosome/scaffold name
2 start Start
3 end End
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
two_genes <- c("NEAT1", "MALAT1")
getBM(
attributes = c("ensembl_gene_id", "external_gene_name", "description"),
filters = "external_gene_name",
values = two_genes,
mart = mart)
ensembl_gene_id external_gene_name
1 ENSG00000251562 MALAT1
2 ENSG00000245532 NEAT1
description
1 metastasis associated lung adenocarcinoma transcript 1 [Source:HGNC Symbol;Acc:HGNC:29665]
2 nuclear paraspeckle assembly transcript 1 [Source:HGNC Symbol;Acc:HGNC:30815]
Установите и подгрузите пакет bioMart.
С помощью этого пакета добудьте информацию о GC-составе генов из списка.
По умолчанию biomaRt предлагает работать с самой свежей версией Ensembl. Однако если вы в своем анализе данных использовали другую версию, то лучше придерживаться именно ее. Доступные предыдущие версии можно вывести с помощью функции listEnsemblArchives()
.
name date url version
1 Ensembl GRCh37 Feb 2014 https://grch37.ensembl.org GRCh37
2 Ensembl 113 Oct 2024 https://oct2024.archive.ensembl.org 113
3 Ensembl 112 May 2024 https://may2024.archive.ensembl.org 112
4 Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org 111
5 Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org 110
6 Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org 109
current_release
1
2 *
3
4
5
6
ensembl <- useEnsembl("ensembl", host = "https://sep2019.archive.ensembl.org/")
human <- useDataset("hsapiens_gene_ensembl", mart = ensembl)
chimpz <- useDataset("ptroglodytes_gene_ensembl", mart = ensembl)
hs2pt <- getLDS(
# человек
mart = human,
attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name"),
# шимпанзе
martL = chimpz,
attributesL = c("ensembl_gene_id", "external_gene_name", "chromosome_name"))
hs2pt %>% head()
Gene.stable.ID Gene.name Chromosome.scaffold.name Gene.stable.ID.1
1 ENSG00000196757 ZNF700 19 ENSPTRG00000010515
2 ENSG00000172819 RARG 12 ENSPTRG00000005007
Gene.name.1 Chromosome.scaffold.name.1
1 ZNF700 19
2 RARG 12