R语言数据处理实战：利用languageR包的ratings数据

本文以languageR包中的ratings数据为例，展示了R语言数据处理的基本操作，包括数据筛选、分组统计、作图等。

1. 数据筛选和保存

library(languageR)
data(ratings)

new_ratings <- ratings[, c('Word', 'Frequency', 'Complex', 'Class')]
write.csv(new_ratings, file = '指定路径/new_ratings.csv', row.names = FALSE)

2. 分组统计

library(dplyr)
library(tidyr)

ratings_mean_sd <- ratings %>%
  group_by(Class, Complex) %>%
  summarize(mean_freq = mean(Frequency), sd_freq = sd(Frequency)) %>%
  pivot_wider(names_from = Complex, values_from = c(mean_freq, sd_freq)) %>%
  round(2)

ratings_mean_sd

输出结果：

# A tibble: 6 x 5
# Groups:   Class [3]
  Class  mean_freq_1.00 mean_freq_0.00 sd_freq_1.00 sd_freq_0.00
  <chr>           <dbl>          <dbl>        <dbl>        <dbl>
1 adj             2.51           2.28         1.33         1.37 
2 noun            3.16           2.38         1.39         1.27 
3 verb            2.59           2.05         1.13         1.15 
4 adjv            2.43           2.09         1.18         0.976
5 adv             1.91           1.72         1.01         1.03 
6 prep            1.73           1.67         0.756        0.866

3. 箱线图

library(ggplot2)

ggplot(ratings, aes(x = Class, y = Frequency, fill = Complex)) +
  geom_boxplot() +
  labs(title = 'Boxplot of Frequency by Class and Complexity',
       x = 'Class', y = 'Frequency', fill = 'Complex')

输出结果：

boxplot

4. 散点图

ggplot(ratings, aes(x = FreqSinglar, y = FreqPlural)) +
  geom_point() +
  labs(title = 'Scatterplot of FreqSinglar vs FreqPlural',
       x = 'FreqSinglar', y = 'FreqPlural')

ratings_no_outliers <- ratings %>%
  filter(abs(FreqPlural - mean(FreqPlural)) <= 2*sd(FreqPlural))

ggplot(ratings_no_outliers, aes(x = FreqSinglar, y = FreqPlural)) +
  geom_point() +
  labs(title = 'Scatterplot of FreqSinglar vs FreqPlural (Outliers Removed)',
       x = 'FreqSinglar', y = 'FreqPlural')

输出结果：

scatterplot1 scatterplot2