Язык R

и его применение в биоинформатике

Лекция 10

Анастасия Жарикова

3 ноября 2023

Задание №1

Смоделируйте скошенное распределение любым способом (положите в вектор)

Визуализируйте распределение, выбрав наиболее оптимальный тип графика

Решение 1.1

set.seed(123)
a <- c(rnorm(10000), rnorm(500, 4, 3), rnorm(500, 6, 5))

Решение 1.2

ggplot(mapping = aes(x = a)) + 
  geom_histogram(fill = '#598392', color = '#01161e') + 
  theme_bw()

Усеченное среднее

mean(a)

[1] 0.4604595

mean(a, trim = .2)

[1] 0.1117887

Что такое среднее с параметром trim = 0.5?

Асимметрия и эксцесс

skew(a)

[1] 3.24231

kurtosi(a)

[1] 15.81482

Асимметрия и эксцесс: примеры

Сводное описание

summary(a)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-7.9186 -0.6191  0.0928  0.4605  0.8825 20.0108

describe(a) # library(psych)

   vars     n mean   sd median trimmed  mad   min   max range skew kurtosis
X1    1 11000 0.46 2.18   0.09    0.13 1.11 -7.92 20.01 27.93 3.24    15.81
     se
X1 0.02

describe + group_by/summarise

iris %>%
  group_by(Species) %>%
  summarise(describe(Petal.Width))

# A tibble: 3 × 14
  Species  vars     n  mean    sd median trimmed   mad   min   max range    skew
  <fct>   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 setosa      1    50 0.246 0.105    0.2   0.238 0       0.1   0.6   0.5  1.18  
2 versic…     1    50 1.33  0.198    1.3   1.32  0.222   1     1.8   0.8 -0.0293
3 virgin…     1    50 2.03  0.275    2     2.03  0.297   1.4   2.5   1.1 -0.122 
# ℹ 2 more variables: kurtosis <dbl>, se <dbl>

library(skimr)

Data summary
Name	iris
Number of rows	150
Number of columns	5
_______________________
Column type frequency:
factor	1
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Species	0	1	FALSE	3	set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sepal.Length	1	5.84	0.83	4.3	5.1	5.80	6.4	7.9	▆▇▇▅▂
Sepal.Width	1	3.06	0.44	2.0	2.8	3.00	3.3	4.4	▁▆▇▂▁
Petal.Length	1	3.76	1.77	1.0	1.6	4.35	5.1	6.9	▇▁▆▇▂
Petal.Width	1	1.20	0.76	0.1	0.3	1.30	1.8	2.5	▇▁▇▅▃

qqplot

set.seed(1)
my_df <- data.frame(y=rexp(200, rate=3))
ggplot(my_df, mapping = aes(x = y)) + geom_histogram(fill = '#598392', color = '#01161e') + theme_bw()

qqplot

ggplot(my_df, aes(sample=y)) +
  stat_qq() + 
  stat_qq_line() + theme_bw()

qqplot

Задание 2

Отличается ли в среднем длина лепестков у ирисов от 4 см?

Сначала рисуйте!

Нарисуем!

ggplot(iris) +
  geom_histogram(aes(x = Petal.Length, fill = Species)) + theme_bw()

Какие идеи?

Подготовим данные

df <- iris %>%
  filter(Species %in% c('versicolor', 'virginica'))

Решение 2.1

t.test(df$Petal.Length)


    One Sample t-test

data:  df$Petal.Length
t = 59.425, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 4.742187 5.069813
sample estimates:
mean of x 
    4.906

Что не так?

Какая нулевая гипотеза?

Решение 2.2

t.test(df$Petal.Length, mu = 4)


    One Sample t-test

data:  df$Petal.Length
t = 10.974, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 4
95 percent confidence interval:
 4.742187 5.069813
sample estimates:
mean of x 
    4.906

Без интерпретации результата - ответа нет!

Неверная интерпретация - ответа нет!

Вывод результата теста

t.test(df$Petal.Length, mu = 4) -> x
x$p.value

[1] 8.307922e-19

Задание 3

Больше ли в среднем длина лепестков у ирисов, чем 3 см?

Какая нулевая гипотеза?

Решение 3.1

t.test(df$Petal.Length, mu = 3, alternative = 'greater')


    One Sample t-test

data:  df$Petal.Length
t = 23.087, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 3
95 percent confidence interval:
 4.768922      Inf
sample estimates:
mean of x 
    4.906

Задание 4

Различаются ли в среднем длины лепестков у ирисов видов versicolor и virginica?

Какая нулевая гипотеза?

Решение 4.1

ggplot(df) + geom_boxplot(aes(Sepal.Length, fill = Species)) + theme_bw()

Теперь визуализируйте правильно

Решение 4.2

ggerrorplot(df, x = "Species", y = "Petal.Length", 
            desc_stat = "mean_sd", color = "black",
            add = "jitter", add.params = list(color = "darkgray"), size = 0.8
            )

Решение 4.3

a <- filter(df, Species == 'versicolor')
b <- filter(df, Species == 'virginica')

t.test(a$Petal.Length, b$Petal.Length)


    Welch Two Sample t-test

data:  a$Petal.Length and b$Petal.Length
t = -12.604, df = 95.57, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.49549 -1.08851
sample estimates:
mean of x mean of y 
    4.260     5.552

Решение 4.4

Лучше

t.test(Petal.Length ~ Species, df)


    Welch Two Sample t-test

data:  Petal.Length by Species
t = -12.604, df = 95.57, p-value < 2.2e-16
alternative hypothesis: true difference in means between group versicolor and group virginica is not equal to 0
95 percent confidence interval:
 -1.49549 -1.08851
sample estimates:
mean in group versicolor  mean in group virginica 
                   4.260                    5.552

Решение 4.5

Теперь визуализируем еще лучше

ggbarplot(df, x = "Species", y = "Petal.Length", 
          fill = ("Species"), 
          add = c('mean', 'jitter'),
          palette = c("#9e2a2b", "#62929e")) +
  stat_compare_means(method = "t.test")

Задание 5

Пусть виды ирисов будут “до” обработки супер-средством и “после”.

Влияет ли супер-средство на длину лепестков ирисов?

Решение 5.1

t.test(Petal.Length ~ Species, df, paired = T)


    Paired t-test

data:  Petal.Length by Species
t = -12.091, df = 49, p-value = 2.562e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -1.506744 -1.077256
sample estimates:
mean difference 
         -1.292

Решение 5.2

Визуализируем

ggpaired(df, x = "Species", y = "Petal.Length", 
          fill = ("Species"), 
          add = c('mean', 'jitter'),
          palette = c("#9e2a2b", "#62929e"),
         line.color = "gray", line.size = 0.4) +
  stat_compare_means(method = "t.test")

t.test(Petal.Length ~ Species, df, paired = T)$p.value -> a

ggpaired(df, x = "Species", y = "Petal.Length", 
          fill = ("Species"), 
          add = c('mean', 'jitter'),
          palette = c("#9e2a2b", "#62929e"),
         line.color = "gray", line.size = 0.4) + 
annotate("text", x=1.3, y=7.5, 
         label = paste('p.value (t.test): ', a, sep = ''), size = 4)

immer

Урожайность ячменя в 1931(Y1) и 1932(Y2) годах.

Задание 6

Различается ли урожайность?

Смотрим

library(MASS)
head(immer)

  Loc Var    Y1    Y2
1  UF   M  81.0  80.7
2  UF   S 105.4  82.3
3  UF   V 119.7  80.4
4  UF   T 109.7  87.2
5  UF   P  98.3  84.2
6   W   M 146.6 100.4

Поправим данные

im <- immer %>%
  pivot_longer(Y1:Y2, names_to = 'year', values_to = 'value')

Рисуем

im %>% 
  ggplot() + geom_boxplot(aes(value, fill = year)) + theme_bw()

Еще рисуем

im %>%
  ggplot(aes(x = year, y = value)) + 
  geom_bar(position='dodge', stat='summary', fun='mean', 
                      fill = 'lightblue', color = 'brown') + 
  geom_jitter(width = 0.25) + theme_bw()

Решение 6.1

wilcox.test(value ~ year, im, paired = F)


    Wilcoxon rank sum test with continuity correction

data:  value by year
W = 589, p-value = 0.04059
alternative hypothesis: true location shift is not equal to 0

Задание 7

Есть ли зависимость между длиной лепестка и длиной чашелистика у ирисов?

Рисуем

ggplot(data = iris) + 
  geom_point(aes(x = Sepal.Length, y = Petal.Length)) + theme_bw()

cor(iris$Sepal.Length, iris$Petal.Length)

[1] 0.8717538

Пирсона или Спирмена?

cor(iris$Sepal.Length, iris$Petal.Length)

[1] 0.8717538

cor(iris$Sepal.Length, iris$Petal.Length, method = 'spearman')

[1] 0.8818981

А значимость?

cor.test(iris$Sepal.Length, iris$Petal.Length)


    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538

cor.test(iris$Sepal.Length, iris$Petal.Length, method = 'spearman')


    Spearman's rank correlation rho

data:  iris$Sepal.Length and iris$Petal.Length
S = 66429, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8818981

Матрица корреляций

cor(mtcars)

            mpg        cyl       disp         hp        drat         wt
mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
            qsec         vs          am       gear        carb
mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

Рисуем

corrplot(cor(mtcars), col=colorRampPalette(c("#1b4965","white","#c1121f"))(200))

Конец!