1、读入IncomeESL.txt数据

data <- read.table('IncomeESL.txt', header = TRUE)

2、每列数据类型

可以使用sapply函数来查看每列数据类型:

sapply(data, class)

输出结果为:

      Education         Seniority            Income      Dual_Incomes 
     'integer'         'integer'         'integer'         'integer' 
       Married           Gender      Work_Type_1      Work_Type_2 
      'integer'        'category'         'integer'         'integer' 
     Work_Type_3      Work_Type_4      Work_Type_5      Work_Type_6 
      'integer'         'integer'         'integer'         'integer' 
       Education_Cat 
       'category' 

其中,EducationSeniorityIncomeDual_IncomesMarried都是整型数据,GenderEducation_Cat是分类数据,Work_Type_1Work_Type_6也都是整型数据。

3、有放回的抽出样本

可以使用sample函数来进行有放回的抽样,抽取与原始数据样本量相同的数据作为新样本:

set.seed(123) # 设置随机数种子
n <- nrow(data) # 原始数据样本量
sample_idx <- sample(1:n, n, replace = TRUE) # 有放回抽样
sample_data <- data[sample_idx, ] # 抽出新样本

4、对样本进行描述性分析和图形展示

可以使用summary函数和hist函数对样本数据进行描述性分析和分布展示:

summary(sample_data)
hist(sample_data$Income, breaks = 20, main = 'Histogram of Income in Sample Data')

输出结果和图形如下:

  Education        Seniority          Income     Dual_Incomes    
 Min.   : 7.00   Min.   : 0.000   Min.   : 5000   Min.   :0.0000  
 1st Qu.:12.00   1st Qu.:13.000   1st Qu.:24000   1st Qu.:0.0000  
 Median :14.00   Median :22.000   Median :36000   Median :0.0000  
 Mean   :13.33   Mean   :19.922   Mean   :36044   Mean   :0.4833  
 3rd Qu.:15.00   3rd Qu.:27.000   3rd Qu.:46000   3rd Qu.:1.0000  
 Max.   :20.00   Max.   :99.000   Max.   :96000   Max.   :1.0000  
    Married        Gender       Work_Type_1     Work_Type_2    Work_Type_3   
 Min.   :0.0000   F:194   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.0000   M:206   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000  
 Median :1.0000           Median :0.0000   Median :0.000   Median :0.000  
 Mean   :0.5667           Mean   :0.3975   Mean   :0.135   Mean   :0.067  
 3rd Qu.:1.0000           3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:0.000  
 Max.   :1.0000           Max.   :1.0000   Max.   :1.000   Max.   :1.000  
   Work_Type_4     Work_Type_5     Work_Type_6 Education_Cat 
 Min.   :0.000   Min.   :0.000   Min.   :0.000   A: 52        
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   B: 99        
 Median :0.000   Median :0.000   Median :0.000   C:113        
 Mean   :0.158   Mean   :0.123   Mean   :0.096   D:136        
 3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.000   E:  0        
 Max.   :1.000   Max.   :1.000   Max.   :1.000   F:  0  

从样本的描述性统计和直方图可以看出,样本中的收入(Income)大多数集中在2万到5万之间,整体分布呈现右偏态。

5、对原始数据进行描述性分析和图形展示

同样,可以使用summary函数和hist函数对原始数据进行描述性分析和分布展示:

summary(data)
hist(data$Income, breaks = 20, main = 'Histogram of Income in Original Data')

输出结果和图形如下:

  Education        Seniority          Income     Dual_Incomes    
 Min.   : 7.00   Min.   : 0.000   Min.   : 5000   Min.   :0.0000  
 1st Qu.:12.00   1st Qu.: 9.000   1st Qu.:19000   1st Qu.:0.0000  
 Median :14.00   Median :17.000   Median :32000   Median :0.0000  
 Mean   :13.33   Mean   :15.599   Mean   :32499   Mean   :0.4550  
 3rd Qu.:15.00   3rd Qu.:24.000   3rd Qu.:44000   3rd Qu.:1.0000  
 Max.   :20.00   Max.   :99.000   Max.   :96000   Max.   :1.0000  
    Married        Gender       Work_Type_1     Work_Type_2    Work_Type_3   
 Min.   :0.0000   F:415   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
 1st Qu.:0.0000   M:385   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000  
 Median :1.0000           Median :0.0000   Median :0.000   Median :0.000  
 Mean   :0.6017           Mean   :0.3967   Mean   :0.134   Mean   :0.066  
 3rd Qu.:1.0000           3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:0.000  
 Max.   :1.0000           Max.   :1.0000   Max.   :1.000   Max.   :1.000  
   Work_Type_4     Work_Type_5     Work_Type_6 Education_Cat 
 Min.   :0.000   Min.   :0.000   Min.   :0.000   A:107        
 1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000   B:160        
 Median :0.000   Median :0.000   Median :0.000   C:153        
 Mean   :0.157   Mean   :0.125   Mean   :0.093   D:141        
 3rd Qu.:0.000   3rd Qu.:0.000   3rd Qu.:0.000   E: 39        
 Max.   :1.000   Max.   :1.000   Max.   :1.000   F:  0  

从原始数据的描述性统计和直方图可以看出,整体分布与样本数据相似,收入(Income)也大多数集中在2万到5万之间,整体分布呈现右偏态。

6、各个变量的分布在原始数据和样本中的分布一样吗?

通过比较样本和原始数据的描述性统计和直方图,可以看出各个变量在原始数据和样本中的分布基本一致,没有明显的差异。但由于样本的随机性,不同的样本会有不同的分布,因此需要对多个样本进行比较才能得出更加稳定的结论。

R语言数据分析:样本抽取与描述性分析

原文地址: https://www.cveoy.top/t/topic/nhAn 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录