Spark SQL 插入数据列数不一致错误:目标表有3列,但插入数据有4列
Spark SQL 插入数据列数不一致错误:目标表有3列,但插入数据有4列
在使用 Spark SQL 插入数据时,可能会遇到以下错误:
Exception in thread "main"
org.apache.spark.sql.AnalysisException: `deploy_uedsy`.`test_spark_sql` requires that the data to be inserted have the same number of columns as the target table: target table has 3 column(s) but the inserted data has 4 column(s), including 1 partition column(s) having constant value(s).;
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:373)
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion$$anonfun$apply$3.applyOrElse(rules.scala:368)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.apply(rules.scala:368)
at org.apache.spark.sql.execution.datasources.PreprocessTableInsertion.apply(rules.scala:328)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:124)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:118)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:103)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at com.startdt.spark.simbaspark.SparkMr.main(SparkMr.java:146)
这个错误信息表明,插入的数据列数与目标表的列数不一致。目标表有3列,但插入的数据有4列,其中包括1个分区列,该分区列的值是常数。
解决方法:
需要保证插入的数据列数与目标表的列数一致。可以使用以下方法解决这个问题:
- 检查插入数据的列数是否与目标表一致。
- 如果插入的数据包含分区列,需要将分区列的值设置为目标表的分区列的值。
- 可以使用
select * from your_table查询目标表的列数。 - 使用
select * from your_dataframe查询插入数据的列数。
示例:
假设目标表 test_spark_sql 有3列,分别是 id, name, age,而插入的数据包含4列,分别是 id, name, age, gender。为了解决这个问题,可以将 gender 列从插入的数据中删除,或者将 gender 列的值设置为常数,例如 null。
代码示例:
# 删除 gender 列
df = df.drop('gender')
# 将 gender 列的值设置为 null
df = df.withColumn('gender', lit(None))
通过以上方法,可以保证插入的数据列数与目标表的列数一致,从而解决这个错误。
原文地址: https://www.cveoy.top/t/topic/okr9 著作权归作者所有。请勿转载和采集!