R语言数据分析:样本抽取与描述性分析
1、读入IncomeESL.txt数据
data <- read.table('IncomeESL.txt', header = TRUE)
2、每列数据类型
可以使用sapply函数来查看每列数据类型:
sapply(data, class)
输出结果为:
Education Seniority Income Dual_Incomes
'integer' 'integer' 'integer' 'integer'
Married Gender Work_Type_1 Work_Type_2
'integer' 'category' 'integer' 'integer'
Work_Type_3 Work_Type_4 Work_Type_5 Work_Type_6
'integer' 'integer' 'integer' 'integer'
Education_Cat
'category'
其中,Education、Seniority、Income、Dual_Incomes和Married都是整型数据,Gender和Education_Cat是分类数据,Work_Type_1到Work_Type_6也都是整型数据。
3、有放回的抽出样本
可以使用sample函数来进行有放回的抽样,抽取与原始数据样本量相同的数据作为新样本:
set.seed(123) # 设置随机数种子
n <- nrow(data) # 原始数据样本量
sample_idx <- sample(1:n, n, replace = TRUE) # 有放回抽样
sample_data <- data[sample_idx, ] # 抽出新样本
4、对样本进行描述性分析和图形展示
可以使用summary函数和hist函数对样本数据进行描述性分析和分布展示:
summary(sample_data)
hist(sample_data$Income, breaks = 20, main = 'Histogram of Income in Sample Data')
输出结果和图形如下:
Education Seniority Income Dual_Incomes
Min. : 7.00 Min. : 0.000 Min. : 5000 Min. :0.0000
1st Qu.:12.00 1st Qu.:13.000 1st Qu.:24000 1st Qu.:0.0000
Median :14.00 Median :22.000 Median :36000 Median :0.0000
Mean :13.33 Mean :19.922 Mean :36044 Mean :0.4833
3rd Qu.:15.00 3rd Qu.:27.000 3rd Qu.:46000 3rd Qu.:1.0000
Max. :20.00 Max. :99.000 Max. :96000 Max. :1.0000
Married Gender Work_Type_1 Work_Type_2 Work_Type_3
Min. :0.0000 F:194 Min. :0.0000 Min. :0.000 Min. :0.000
1st Qu.:0.0000 M:206 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000
Median :1.0000 Median :0.0000 Median :0.000 Median :0.000
Mean :0.5667 Mean :0.3975 Mean :0.135 Mean :0.067
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.000
Work_Type_4 Work_Type_5 Work_Type_6 Education_Cat
Min. :0.000 Min. :0.000 Min. :0.000 A: 52
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 B: 99
Median :0.000 Median :0.000 Median :0.000 C:113
Mean :0.158 Mean :0.123 Mean :0.096 D:136
3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000 E: 0
Max. :1.000 Max. :1.000 Max. :1.000 F: 0

从样本的描述性统计和直方图可以看出,样本中的收入(Income)大多数集中在2万到5万之间,整体分布呈现右偏态。
5、对原始数据进行描述性分析和图形展示
同样,可以使用summary函数和hist函数对原始数据进行描述性分析和分布展示:
summary(data)
hist(data$Income, breaks = 20, main = 'Histogram of Income in Original Data')
输出结果和图形如下:
Education Seniority Income Dual_Incomes
Min. : 7.00 Min. : 0.000 Min. : 5000 Min. :0.0000
1st Qu.:12.00 1st Qu.: 9.000 1st Qu.:19000 1st Qu.:0.0000
Median :14.00 Median :17.000 Median :32000 Median :0.0000
Mean :13.33 Mean :15.599 Mean :32499 Mean :0.4550
3rd Qu.:15.00 3rd Qu.:24.000 3rd Qu.:44000 3rd Qu.:1.0000
Max. :20.00 Max. :99.000 Max. :96000 Max. :1.0000
Married Gender Work_Type_1 Work_Type_2 Work_Type_3
Min. :0.0000 F:415 Min. :0.0000 Min. :0.000 Min. :0.000
1st Qu.:0.0000 M:385 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000
Median :1.0000 Median :0.0000 Median :0.000 Median :0.000
Mean :0.6017 Mean :0.3967 Mean :0.134 Mean :0.066
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:0.000
Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1.000
Work_Type_4 Work_Type_5 Work_Type_6 Education_Cat
Min. :0.000 Min. :0.000 Min. :0.000 A:107
1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000 B:160
Median :0.000 Median :0.000 Median :0.000 C:153
Mean :0.157 Mean :0.125 Mean :0.093 D:141
3rd Qu.:0.000 3rd Qu.:0.000 3rd Qu.:0.000 E: 39
Max. :1.000 Max. :1.000 Max. :1.000 F: 0

从原始数据的描述性统计和直方图可以看出,整体分布与样本数据相似,收入(Income)也大多数集中在2万到5万之间,整体分布呈现右偏态。
6、各个变量的分布在原始数据和样本中的分布一样吗?
通过比较样本和原始数据的描述性统计和直方图,可以看出各个变量在原始数据和样本中的分布基本一致,没有明显的差异。但由于样本的随机性,不同的样本会有不同的分布,因此需要对多个样本进行比较才能得出更加稳定的结论。
原文地址: https://www.cveoy.top/t/topic/nhAn 著作权归作者所有。请勿转载和采集!