class: center, middle, inverse, title-slide # Язык R и его применение в биоинформатике ### Анастасия Жарикова, Анна Валяева ### 24.09.2021 --- # Факторы - Используются для работы с категориальными переменными - Уровни фактора - ограниченное число известных значений категориальной переменной - Для работы с факторами есть пакет `forcats` в составе `tidyverse` ```r library(forcats) # или library(tidyverse) ``` <img src="img/2021-09-24/forcats.png" width="30%" style="display: block; margin: auto;" /> --- ## Дни недели ```r weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") weekdays ``` ``` [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday" [7] "Sunday" ``` ```r weekdays_as_fct <- as_factor(weekdays) # сохраняет порядок уровней weekdays_as_fct ``` ``` [1] Monday Tuesday Wednesday Thursday Friday Saturday Sunday Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday ``` --- .pull-left[ ```r typeof(weekdays) ``` ``` [1] "character" ``` ```r class(weekdays) ``` ``` [1] "character" ``` ```r as.integer(weekdays) ``` ``` [1] NA NA NA NA NA NA NA ``` ```r sort(weekdays) ``` ``` [1] "Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday" [7] "Wednesday" ``` ] .pull-right[ ```r typeof(weekdays_as_fct) ``` ``` [1] "integer" ``` ```r class(weekdays_as_fct) ``` ``` [1] "factor" ``` ```r as.integer(weekdays_as_fct) ``` ``` [1] 1 2 3 4 5 6 7 ``` ```r sort(weekdays_as_fct) ``` ``` [1] Monday Tuesday Wednesday Thursday Friday Saturday Sunday Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday ``` ] --- # Уровни ```r levels(weekdays) ``` ``` NULL ``` ```r levels(weekdays_as_fct) ``` ``` [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday" [7] "Sunday" ``` ```r weekdays_as_fct <- weekdays_as_fct[1:3] weekdays_as_fct ``` ``` [1] Monday Tuesday Wednesday Levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday ``` --- # Создание вектора факторов ```r seasons_with_rep <- c("winter", "winter", "fall", "summer", "sommer", "fall", "fall") seasons_levels <- c("winter", "spring", "summer", "fall") seasons <- factor(seasons_with_rep, levels = seasons_levels) seasons ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter spring summer fall ``` --- ## `fct_count` Подсчитать количество факторов каждого уровня. ```r seasons ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter spring summer fall ``` ```r fct_count(seasons) ``` ``` # A tibble: 5 × 2 f n <fct> <int> 1 winter 2 2 spring 0 3 summer 1 4 fall 3 5 NA 1 ``` --- ## `fct_drop` Удалить неиспользуемые уровни фактора: `spring`. ```r seasons ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter spring summer fall ``` ```r fct_drop(seasons) ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter summer fall ``` --- ## `fct_explicit_na` Приписать отсутствующим уровням (`NA`) явное название. ```r seasons ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter spring summer fall ``` ```r fct_explicit_na(seasons) # na_level задает название нового уровня ``` ``` [1] winter winter fall summer (Missing) fall fall Levels: winter spring summer fall (Missing) ``` --- ## `fct_inorder` Упорядочить уровни фактора в порядке встречаемости в векторе. ```r fct_drop(seasons) %>% fct_inorder() # не должно быть неиспользуемых уровней ``` ``` [1] winter winter fall summer <NA> fall fall Levels: winter fall summer ``` -- ## `fct_infreq` Упорядочить уровни фактора по частоте встречаемости. ```r fct_infreq(seasons) ``` ``` [1] winter winter fall summer <NA> fall fall Levels: fall winter summer spring ``` --- ## `fct_rev` Изменить порядок уровней на обратный. ```r levels(seasons) ``` ``` [1] "winter" "spring" "summer" "fall" ``` ```r fct_rev(seasons) %>% levels() ``` ``` [1] "fall" "summer" "spring" "winter" ``` -- ## `fct_shuffle` Перемешать уровни. ```r fct_shuffle(seasons) %>% levels() ``` ``` [1] "fall" "spring" "winter" "summer" ``` --- ## `fct_reorder` ```r head(iris) ``` ``` Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa ``` ```r iris$Species %>% as_factor() %>% levels ``` ``` [1] "setosa" "versicolor" "virginica" ``` --- ## `fct_reorder` ```r iris %>% group_by(Species) %>% summarise(min = mean(Sepal.Width)) ``` ``` # A tibble: 3 × 2 Species min <fct> <dbl> 1 setosa 3.43 2 versicolor 2.77 3 virginica 2.97 ``` ```r iris$Species %>% as_factor() %>% fct_reorder(iris$Sepal.Width, min) %>% levels() ``` ``` [1] "versicolor" "virginica" "setosa" ``` -- См. еще `fct_reorder2`
--- ## `fct_lump` ```r homeworld_fct <- starwars$homeworld %>% as_factor() %>% fct_explicit_na("Unknown") levels(homeworld_fct) %>% length() ``` ``` [1] 49 ``` ```r fct_count(homeworld_fct, sort = TRUE) %>% head(5) ``` ``` # A tibble: 5 × 2 f n <fct> <int> 1 Naboo 11 2 Tatooine 10 3 Unknown 10 4 Alderaan 3 5 Kamino 3 ``` --- ## `fct_lump` Семейство функций, которое "схлопывает" часть уровней по условию и создает уровень "Other". ```r homeworld_fct %>% fct_lump_n(3) %>% table() ``` ``` . Tatooine Naboo Unknown Other 10 11 10 56 ``` --
- `fct_lump_prop` - оставляет уровни, встречающиеся не чаще указанной частоты - `fct_lump_min` - оставляет уровни, встречающиеся не реже указанного числа раз - `fct_lump_lowfreq` - максимально наполняет группу Other так, чтобы она все равно оставалась самой малопредставленной --- # forcats - что почитать - [forcats cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/factors.pdf) - Список всех функций пакета `forcats`: `help(package = "forcats")` - [Factors Chapter in R4DS](https://r4ds.had.co.nz/data-visualisation.html) --- # Чтение файла `readr` ```r library(readr) # или library(tidyverse) penguins <- read_csv("data/2021-09-24/penguins.csv") penguins ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl> ``` --- # Чтение файла `readr` Посмотреть все параметры: `?read_csv`. <img src="img/2021-09-24/readr_read_params.PNG" width="80%" style="display: block; margin: auto;" /> --- # Запись файла Посмотреть все параметры: `?write_csv`. ```r write_csv(penguins, "more_penguins.csv", append = TRUE) ``` --- # Чтение xlsx файла `readxl` ```r library(readxl) my_file <- read_excel("data/my_file.xlsx", sheet = "Best sheet ever") ``` --- # ggplot2 -- .pull-right[ <br> <br> <br> <img src="img/2021-09-24/ggplot_layers.png" width="110%" style="display: block; margin: auto;" /> ] .pull-left[ <br> <br> <br> ```r ggplot( data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>() + ... ``` ] --- count: false # Данные -> оси -> тип графика -> ... .panel1-ggplot_basics-user[ ```r *ggplot(penguins) ``` ] .panel2-ggplot_basics-user[ <img src="figs/ggplot_basics_user_01_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false # Данные -> оси -> тип графика -> ... .panel1-ggplot_basics-user[ ```r ggplot(penguins) + * aes( * x = bill_length_mm, * y = bill_depth_mm) ``` ] .panel2-ggplot_basics-user[ <img src="figs/ggplot_basics_user_02_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false # Данные -> оси -> тип графика -> ... .panel1-ggplot_basics-user[ ```r ggplot(penguins) + aes( x = bill_length_mm, y = bill_depth_mm) + * geom_point( * size = 3, * alpha = 0.8) ``` ] .panel2-ggplot_basics-user[ <img src="figs/ggplot_basics_user_03_output-1.png" width="80%" style="display: block; margin: auto;" /> ] --- count: false # Данные -> оси -> тип графика -> ... .panel1-ggplot_basics-user[ ```r ggplot(penguins) + aes( x = bill_length_mm, y = bill_depth_mm) + geom_point( size = 3, alpha = 0.8) + * theme_bw() ``` ] .panel2-ggplot_basics-user[ <img src="figs/ggplot_basics_user_04_output-1.png" width="80%" style="display: block; margin: auto;" /> ] <style> .panel1-ggplot_basics-user { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-ggplot_basics-user { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-ggplot_basics-user { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- # Запишем кусок графика в переменную ```r p <- ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + theme_bw() p + geom_point(size = 3, alpha = 0.8) ``` <img src="figs/ggplot_var-1.png" width="80%" style="display: block; margin: auto;" /> --- # Константа *vs* переменная .pull-left[ ```r p + * geom_point(color = "#2a9d8f") ``` <img src="figs/ggplot_aes_const-1.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ```r p + * geom_point(aes(color = species)) ``` <img src="figs/ggplot_aes_var-1.png" width="80%" style="display: block; margin: auto;" /> ] --- # Варианты `aes` - aesthetic .pull-left[ - `shape` - тип символа - `color` - цвет общий / цвет обводки - `fill` - заливка - `size` - размер - `stroke` - толщина обводки - `alpha` - прозрачность ] .pull-right[ <img src="img/2021-09-24/point_shape.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ Только `color`, определяет цвет всей фигуры - кружочка. ```r p + theme(legend.position = "top") + * geom_point(aes(color = species), shape = 16) ``` <img src="figs/point_shape_fill-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ Только `color`, определяет цвет всей фигуры - окружности. ```r p + theme(legend.position = "top") + * geom_point(aes(color = species), shape = 1) ``` <img src="figs/point_shape_col-1.png" width="100%" style="display: block; margin: auto;" /> ] --- И `color`, и `fill` для обводки и заливки. ```r p + * geom_point(aes(fill = species, color = island), shape = 21) ``` <img src="figs/point_shape_col_fill-1.png" width="80%" style="display: block; margin: auto;" /> --- # Цветовая шкала ```r cols <- c("#00afb9", "#fed9b7", "#f07167") p_labels <- c("Адели", "Антарктический", "Субантарктический") p <- p + geom_point(aes(color = species), size = 3, alpha = 0.8) + * scale_color_manual(values = cols, labels = p_labels) p ``` <img src="figs/ggplot_color_scale-1.png" width="80%" style="display: block; margin: auto;" /> --- # Названия осей ```r p <- p + geom_point(aes(color = species), size = 3, alpha = 0.8) + * labs(x = "Длина клюва (мм)", y = "Высота клюва (мм)", color = "Вид") p ``` <img src="figs/ggplot_axis_title-1.png" width="80%" style="display: block; margin: auto;" /> --- # Название графика ```r p <- p + * labs(title = "Пингвиньи клювы", * subtitle = "У трех видов пингвинов очень разные клювики.", * caption = "Данные: palmerpenguins") p ``` <img src="figs/ggplot_title-1.png" width="80%" style="display: block; margin: auto;" /> --- # Сделаем красиво... ```r p + * theme( * plot.title = element_text(size = 18, hjust = 0.5), * axis.title = element_text(face = "italic"), * axis.title.x = element_text(color = "purple")) ``` <img src="figs/ggplot_theme-1.png" width="80%" style="display: block; margin: auto;" /> --- # Тема: текстовые элементы *Image credit: Emi Tanaka* <img src="img/2021-09-24/ggplot-theme-text-annotation.png" width="100%" style="display: block; margin: auto;" /> --- # Тема: сетка графика *Image credit: Emi Tanaka* <img src="img/2021-09-24/ggplot-annotated-line-marks.png" width="100%" style="display: block; margin: auto;" /> --- # Тема: зоны графика *Image credit: Emi Tanaka* <img src="img/2021-09-24/ggplot-annotated-rect-marks.png" width="80%" style="display: block; margin: auto;" /> --- # Гистограмма ```r theme_set(theme_bw()) # закрепить эту тему ggplot(penguins, aes(x = body_mass_g)) + * geom_histogram(fill = "#e5989b") ``` <img src="figs/geom_hist-1.png" width="80%" style="display: block; margin: auto;" /> --- # Шаг гистограммы ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(fill = "#e5989b", color = "#6d6875", * binwidth = 250) ``` <img src="figs/geom_hist_bin-1.png" width="80%" style="display: block; margin: auto;" /> --- # Плотность - сглаженная гистограмма ```r ggplot(penguins, aes(x = body_mass_g)) + * geom_density(fill = "#e5989b", alpha = 0.7) ``` <img src="figs/geom_density-1.png" width="80%" style="display: block; margin: auto;" /> --- # Широкий формат Хочу нарисовать распределения длины и высоты клюва в виде боксплота. Но пока нужные значения разделены на 2 колонки. Нужно их собрать в одну колонку, то есть преобразовать tibble/dataframe в длинный формат. ```r penguins %>% select(species, bill_length_mm, bill_depth_mm) ``` ``` # A tibble: 344 × 3 species bill_length_mm bill_depth_mm <chr> <dbl> <dbl> 1 Adelie 39.1 18.7 2 Adelie 39.5 17.4 3 Adelie 40.3 18 4 Adelie NA NA 5 Adelie 36.7 19.3 6 Adelie 39.3 20.6 7 Adelie 38.9 17.8 8 Adelie 39.2 19.6 9 Adelie 34.1 18.1 10 Adelie 42 20.2 # … with 334 more rows ``` --- # Длинный и широкий формат - Из широкого в длинный - `pivot_longer` - Из длинного в широкий - `pivot_wider`
<img src="img/2021-09-24/wide_long.png" width="80%" style="display: block; margin: auto;" /> --- # Длинный формат с `pivot_longer` ```r penguins_long <- penguins %>% select(species, bill_length_mm, bill_depth_mm) %>% * pivot_longer(cols = bill_length_mm:bill_depth_mm, names_to = "bill_measure_mm", values_to = "value_mm") penguins_long ``` ``` # A tibble: 688 × 3 species bill_measure_mm value_mm <chr> <chr> <dbl> 1 Adelie bill_length_mm 39.1 2 Adelie bill_depth_mm 18.7 3 Adelie bill_length_mm 39.5 4 Adelie bill_depth_mm 17.4 5 Adelie bill_length_mm 40.3 6 Adelie bill_depth_mm 18 7 Adelie bill_length_mm NA 8 Adelie bill_depth_mm NA 9 Adelie bill_length_mm 36.7 10 Adelie bill_depth_mm 19.3 # … with 678 more rows ``` --- # Boxplot ```r ggplot(penguins_long) + * geom_boxplot(aes(x = bill_measure_mm, y = value_mm)) ``` <img src="figs/geom_box-1.png" width="80%" style="display: block; margin: auto;" /> --- # Boxplot ```r ggplot(penguins_long) + geom_boxplot(aes(x = bill_measure_mm, y = value_mm, * fill = species)) ``` <img src="figs/geom_box_col-1.png" width="80%" style="display: block; margin: auto;" /> --- # Facets ```r ggplot(penguins_long) + geom_boxplot(aes(x = bill_measure_mm, y = value_mm, fill = species)) + * facet_wrap(~ species) + * theme(legend.position = "none") ``` <img src="figs/facet_box-1.png" width="80%" style="display: block; margin: auto;" /> --- # Facets ```r ggplot(penguins) + geom_point(aes(x = flipper_length_mm, y = body_mass_g)) + * facet_grid(sex ~ species) ``` <img src="figs/facet_point-1.png" width="80%" style="display: block; margin: auto;" /> --- # Barplot ```r ggplot(penguins) + * geom_bar(aes(x = species, fill = species)) ``` <img src="figs/geom_bar-1.png" width="80%" style="display: block; margin: auto;" /> --- # Barplot - count ```r penguins %>% * mutate(species = as_factor(species), * species = fct_infreq(species)) %>% ggplot() + * geom_bar(aes(x = species, fill = species)) ``` <img src="figs/geom_bar_infreq-1.png" width="80%" style="display: block; margin: auto;" /> --- # Barplot - summary statistic ```r penguins %>% mutate(species = as_factor(species), * species = fct_reorder(species, flipper_length_mm, mean, na.rm = TRUE, .desc = TRUE)) %>% ggplot() + geom_bar(aes(x = species, y = flipper_length_mm, fill = species), * stat = "summary", fun = "mean") ``` <img src="figs/geom_bar_sum-1.png" width="80%" style="display: block; margin: auto;" /> --- # Barplot & errorbar ```r penguins_stat <- penguins %>% * group_by(species) %>% * summarise( * avg_flipper_mm = mean(flipper_length_mm, na.rm = TRUE), * min_flipper_mm = avg_flipper_mm - sd(flipper_length_mm, na.rm = TRUE), * max_flipper_mm = avg_flipper_mm + sd(flipper_length_mm, na.rm = TRUE)) penguins_stat ``` ``` # A tibble: 3 × 4 species avg_flipper_mm min_flipper_mm max_flipper_mm <chr> <dbl> <dbl> <dbl> 1 Adelie 190. 183. 196. 2 Chinstrap 196. 189. 203. 3 Gentoo 217. 211. 224. ``` --- # Barplot & errorbar ```r ggplot(penguins_stat) + geom_bar(aes(x = species, y = avg_flipper_mm, fill = species), * stat = "identity") + * geom_errorbar(aes(x = species, ymin = min_flipper_mm, ymax = max_flipper_mm), width = 0.2) ``` <img src="figs/geom_bar_error-1.png" width="80%" style="display: block; margin: auto;" /> --- # Стековая диаграмма ```r penguins %>% * count(species, island) %>% ggplot() + * geom_col(aes(x = species, y = n, fill = island)) ``` <img src="figs/geom_col-1.png" width="80%" style="display: block; margin: auto;" /> --- # Группированная диаграмма ```r penguins %>% * count(species, island) %>% ggplot() + geom_col(aes(x = species, y = n, fill = island), * position = "dodge") ``` <img src="figs/geom_col_dodge-1.png" width="80%" style="display: block; margin: auto;" /> --- # Сохранение в файл ```r ggsave("figures/my_plot.png", my_plot) ``` -- ```r ggsave("figures/my_plot.png", my_plot, dpi = 300, width = 10, height = 10, units = "cm") ``` -- ```r ggsave("figures/my_plot.pdf", my_plot) ``` -- ```r pdf("figures/my_plot.pdf") # Открыть файл для записи my_plot # Нарисовать график dev.off() # Закрыть файл ``` --- # Комбинирование графиков [Patchwork manual](https://patchwork.data-imaginist.com/index.html) ```r library(patchwork) p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp)) p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear)) *p1 + p2 ``` <img src="figs/patchwork-1.png" width="80%" style="display: block; margin: auto;" /> ```r plot_annotation( tag_levels = 'A', title = 'The surprising truth about mtcars', subtitle = 'These 3 plots will reveal yet-untold secrets about our beloved data-set', caption = 'Disclaimer: None of these plots are insightful') ``` ``` $title [1] "The surprising truth about mtcars" $subtitle [1] "These 3 plots will reveal yet-untold secrets about our beloved data-set" $caption [1] "Disclaimer: None of these plots are insightful" $tag_levels [1] "A" $tag_prefix NULL $tag_suffix NULL $tag_sep NULL $theme Named list() - attr(*, "class")= chr [1:2] "theme" "gg" - attr(*, "complete")= logi FALSE - attr(*, "validate")= logi TRUE attr(,"class") [1] "plot_annotation" ``` --- # Комбинирование графиков ```r p1 + p2 + * plot_annotation( * tag_levels = 'A', * title = 'The surprising truth about mtcars', * subtitle = 'These 3 plots will reveal yet-untold secrets about our beloved data-set', * caption = 'Disclaimer: None of these plots are insightful') ``` <img src="figs/patchwork_ann-1.png" width="80%" style="display: block; margin: auto;" /> --- class: top, center background-image: url("img/2021-09-24/ggext.png") background-size: contain # Ggplot2 extensions ## https://exts.ggplot2.tidyverse.org/ --- # Что почитать и посмотреть - [ggplot2 cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) - [ggplot2 website](https://ggplot2.tidyverse.org/reference/index.html) - ['A ggplot2 Tutorial for Beautiful Plotting in R' by Cédric Scherer](https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/) - [The R Graph Gallery](https://www.r-graph-gallery.com/) - [Как выбрать график под ваши данные](https://www.data-to-viz.com/) - [Data Visualization Chapter in R4DS](https://r4ds.had.co.nz/data-visualisation.html) - [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/) --- # Квиз 🏅 - `group_by() %>% summarise()` - `group_by() %>% mutate() %>% ungroup()` - `fct_reorder()`, `fct_lump()` - barplot, `position = "stack"` / `position = "dogde"` - менять названия осей и легенды или убирать их полностью - задавать цвета из вектора с помощью `scale_`