Дедлайн - 3 октября 2023 23:59

Инструкция

При выполнении заданий используйте средства пакета tidyverse.

Обратите особое внимание на оформление домашнего задания и правила, озвученные на занятии.

Приводите весь код, который вам понадобился для получения ответа.

На основании заполненного .Rmd файла создайте .html файл.

Присылать заполненные .Rmd тетрадки и .html файл необходимо через гугл-форму.

Задание 1

1.1

В ходе некоторого анализа был получен список (вектор) из очень важных функциональных категорий генов - они закодированы с помощью идентификаторов. Однако в векторе оказались смешаны идентификаторы из разных баз данных - Gene Ontology (начинаются на GO:) и KEGG (начинаются на hsa в случае человеческих генов). Разделите этот вектор на два в соотвествии с двумя типами идентификаторов из разных баз данных.

gene_cat <- c("GO:1902222", "hsa00380", "hsa00630", "GO:0006559", "GO:0042773", "hsa00350", "hsa00730", "GO:0051792", "GO:0006572", "GO:0006573", "GO:0032324", "GO:0006390", "GO:0009250", "hsa01212", "GO:0005978")
go <- str_subset(gene_cat, "^GO:\\d+")
go
##  [1] "GO:1902222" "GO:0006559" "GO:0042773" "GO:0051792" "GO:0006572"
##  [6] "GO:0006573" "GO:0032324" "GO:0006390" "GO:0009250" "GO:0005978"
kegg <- str_subset(gene_cat, "^hsa\\d+")
kegg
## [1] "hsa00380" "hsa00630" "hsa00350" "hsa00730" "hsa01212"

1.2

У всех этих идентификаторов значимыми являются только несколько последних цифр, которые идут после 0, поэтому давайте их извлечем.

str_remove(go, "GO:0*")
##  [1] "1902222" "6559"    "42773"   "51792"   "6572"    "6573"    "32324"  
##  [8] "6390"    "9250"    "5978"
str_remove(kegg, "hsa0*")
## [1] "380"  "630"  "350"  "730"  "1212"

Задание 2

Из абстрактов молекулярно-биологических статей можно понять, о каких генах преимущественно идет речь в статье. Для каждого из абстрактов вычлените названия генов, которые в них упоминаются. Названия генов, которые мы ищем, содержат все заглавные буквы и цифры и длиной не короче 3 символов.

abstracts <- c(
  "Atherosclerosis (AS) is one of the main causes of cardiovascular diseases (CVDs). Trimethylamine N-oxide (TMAO) exacerbates the development of AS. This study aimed to investigate the roles of TMAO in AS. In this study, mice were fed with high fat food (HF) and/or injected with TMAO. Oil red O staining was applied for histological analysis. ELISA, qRT-PCR, and western blot were conducted to determine the TMAO, serum, mRNA, and protein levels. CCK-8, colony formation assay, and flow cytometry assays were performed to detect the functions of human aortic endothelial cells (HUVECs). The results showed that TMAO induced thick internal and external walls and intimal plaques in vivo, and HUVECs dysfunction in vitro. TMAO and lncRNA enriched abundant transcript 1 (NEAT1) were increased in AS clinical samples and TMAO-HUVECs. Downregulated NEAT1 inhibited proliferation and promoted the apoptosis of HUVECs. NEAT1 regulated the expression of signal transducer and activator of transcription 3 (STAT3) via sponging miR-370-3p. Overexpression of miR-370-3p facilitated the effects of NEAT1 on the cellular functions of HUVECs, while STAT3 exerted opposing effects. The activation of STAT3 promoted the expression of flavin-containing monooxygenase-3 (FMO3). Taken together, our results show that TMAO-NEAT1/miR-370-3p/STAT3/FMO3 forms a positive feedback loop to exacerbate the development of AS. This novel feedback loop may be a promising therapeutic target for AS.",
  "RNA G-quadruplexes (rG4s) have functional roles in many cellular processes in diverse organisms. While a number of rG4 examples have been reported in coding messenger RNAs (mRNA), so far only limited works have studied rG4s in non-coding RNAs (ncRNAs), especially in long non-coding RNAs (lncRNAs) that are of emerging interest and significance in biology. Herein, we report that MALAT1 lncRNA contains conserved rG4 motifs, forming thermostable rG4 structures with parallel topology. We also show that rG4s in MALAT1 lncRNA can interact with NONO protein with high specificity and affinity in vitro and in nuclear cell lysate, and we provide cellular data to support that NONO protein recognizes MALAT1 lncRNA via rG4 motifs. Notably, we demonstrate that rG4s in MALAT1 lncRNA can be targeted by the rG4-specific small molecule, peptide, and L-aptamer, leading to the dissociation of MALAT1 rG4-NONO protein interaction. Altogether, this study uncovers new and important rG4s in MALAT1 lncRNAs, reveals their specific interactions with NONO protein, offers multiple strategies for targeting MALAT1 and its RNA-protein complex via its rG4 structure and illustrates the prevalence and significance of rG4s in ncRNAs.",
  "Automated and accurate EGFR mutation status prediction using computed tomography (CT) imagery is of great value for tailoring optimal treatments to non-small cell lung cancer (NSCLC) patients. However, existing deep learning based methods usually adopt a single task learning strategy to design and train EGFR mutation status prediction models with limited training data, which may be insufficient to learn distinguishable representations for promoting prediction performance. In this paper, a novel multi-task learning method named AIR-Net is proposed to precisely predict EGFR mutation status on CT images. First, an auxiliary image reconstruction task is effectively integrated with EGFR mutation status prediction, aiming at providing extra supervision at the training phase. Particularly, we adequately employ multi-level information in a shared encoder to generate more comprehensive representations of tumors. Second, a powerful feature consistency loss is further introduced to constrain semantic consistency of original and reconstructed images, which contributes to enhanced image reconstruction and offers more effective regularization to AIR-Net during training. Performance analysis of AIR-Net indicates that auxiliary image reconstruction plays an essential role in identifying EGFR mutation status. Furthermore, extensive experimental results demonstrate that our method achieves favorable performance against other competitive prediction methods. All the results executed in this study suggest that the effectiveness and superiority of AIR-Net in precisely predicting EGFR mutation status of NSCLC.",
  "Driver mutations promote initiation and progression of cancer. Pharmacological treatment can inhibit the action of the mutant protein; however, drug resistance almost invariably emerges. Multiple studies revealed that cancer drug resistance is based upon a plethora of distinct mechanisms. Drug resistance mutations can occur in the same protein or in different proteins; as well as in the same pathway or in parallel pathways, bypassing the intercepted signaling. The dilemma that the clinical oncologist is facing is that not all the genomic alterations as well as alterations in the tumor microenvironment that facilitate cancer cell proliferation are known, and neither are the alterations that are likely to promote metastasis. For example, the common KRasG12C driver mutation emerges in different cancers. Most occur in NSCLC, but some occur, albeit to a lower extent, in colorectal cancer and pancreatic ductal carcinoma. The responses to KRasG12C inhibitors are variable and fall into three categories, (i) new point mutations in KRas, or multiple copies of KRAS G12C which lead to higher expression level of the mutant protein; (ii) mutations in genes other than KRAS; (iii) original cancer transitioning to other cancer(s). Resistance to adagrasib, an experimental antitumor agent exerting its cytotoxic effect as a covalent inhibitor of the G12C KRas, indicated that half of the cases present multiple KRas mutations as well as allele amplification. Redundant or parallel pathways included MET amplification; emerging driver mutations in NRAS, BRAF, MAP2K1, and RET; gene fusion events in ALK, RET, BRAF, RAF1, and FGFR3; and loss-of-function mutations in NF1 and PTEN tumor suppressors. In the current review we discuss the molecular mechanisms underlying drug resistance while focusing on those emerging to common targeted cancer drivers. We also address questions of why cancers with a common driver mutation are unlikely to evolve a common drug resistance mechanism, and whether one can predict the likely mechanisms that the tumor cell may develop. These vastly important and tantalizing questions in drug discovery, and broadly in precision medicine, are the focus of our present review. We end with our perspective, which calls for target combinations to be selected and prioritized with the help of the emerging massive compute power which enables artificial intelligence, and the increased gathering of data to overcome its insatiable needs."
)
lapply(str_extract_all(abstracts, "(?<= )[[:upper:]]+\\d{2,}|(?<= )[[:upper:]]{2,}\\d+"), function(x) unique(x))
## [[1]]
## [1] "NEAT1" "STAT3"
## 
## [[2]]
## [1] "MALAT1"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "G12"   "MAP2"  "RAF1"  "FGFR3" "NF1"

Задание 3

3.1

Прочитайте датасет с репликами героев первых пяти серий первого сезона “Ведьмака”, датасет доступен по ссылке https://raw.githubusercontent.com/kirushka/datasets/main/witcher.csv. Ссылку можно использовать как путь до файла при чтении датасета в R.

Посмотрите и изучите как он выглядит и что содержит.

witcher <- read_csv("https://raw.githubusercontent.com/kirushka/datasets/main/witcher.csv",show_col_types = FALSE)
witcher
## # A tibble: 1,478 × 2
##    Character Text                                                               
##    <chr>     <chr>                                                              
##  1 Isadora   What will it be?                                                   
##  2 Geralt    Point me to the alderman's house.                                  
##  3 Isadora   It's down the alley to the left-                                   
##  4 Inkeeper  Isadora! We don't want your kind here, Witcher.                    
##  5 Geralt    The alderman, tell me where he is and I'll be on my way.           
##  6 Nohorn    You don't give the orders around here you mutant son of a bitch.   
##  7 Inkeeper  Hear that? Go. On your own or at the end of a rope, your choice.   
##  8 Geralt    Not a hard choice.                                                 
##  9 Inkeeper  Yeah, fuck that. Kill him with your bare hands if you have to.     
## 10 Nohorn    C'mon, Witcher. You're not scared of us, are ya? Show us what you'…
## # ℹ 1,468 more rows

3.2

Посчитайте, какой процент реплик Геральта начинается с его знаменитого “Хм” (по-английски “Hm”).

# посчитаем число реплик Геральта
geralt <- witcher %>% 
  filter(Character=='Geralt') %>% 
  nrow()
# посчитаем число реплик Геральта, начинающихся с hm 
hm <- witcher %>% 
  filter(Character=='Geralt',str_detect(Text,"^Hm")) %>% 
  nrow()
# посчитаем процент
hm/geralt*100
## [1] 8.278146

3.3

В скольких репликах Геральт обращается к Плотве (“Roach”)?

witcher %>% 
  filter(Character=='Geralt',str_detect(Text,",\\sRoach")) %>% 
  nrow()
## [1] 2