Hudi測試

集君發表於2024-07-22

實驗環境

  • minio-8.0.10 http://192.168.137.100:32000/minio/bigdata/
  • spark-operator-1.1.26
  • spark-history-server 3.2.2 http://192.168.137.100:32627/

測試案例

案例hudi-spark-test001

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: hudi-spark-test001
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: "umr/spark:3.2.2_v2"
  imagePullPolicy: IfNotPresent
  mainClass: cc.hudi.HoodieSparkQuickstart
  mainApplicationFile: "s3a://bigdatas/jars/bigdataDemo-1.0-SNAPSHOT.jar"
  sparkVersion: "3.2.2"
  timeToLiveSeconds: 259200
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.2.2
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.2.2
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  sparkConf:
    spark.ui.port: "4045"
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: "s3a://sparklogs/all"
    spark.hadoop.fs.s3a.access.key: "minio"
    spark.hadoop.fs.s3a.secret.key: "minio123"
    spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
    spark.hadoop.fs.s3a.endpoint: "http://10.19.64.205:32000"
    spark.hadoop.fs.s3a.connection.ssl.enabled: "false"
    spark.hadoop.fs.s3a.path.style.access: "true"
    spark.hadoop.fs.s3a.aws.credentials.provider: "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"

報錯:

  1. ClusterRole許可權不足, ClusterRole賬戶缺失persistentvolumeclaims的許可權:
3/07/07 06:26:54 ERROR Utils: Uncaught exception in thread main
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/spark-operator/persistentvolumeclaims?labelSelector=spark-app-selector%3Dspark-a9d7e8f78bc6459c9282db57a02815d9. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:spark-operator:spark-operator" cannot list resource "persistentvolumeclaims" in API group "" in the namespace "spark-operator".

修改許可權:

# 部分示例
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - '*'

2.案例hudi-spark-test001報錯:

23/07/10 01:20:41 WARN SparkSession: Cannot use org.apache.spark.sql.hudi.HoodieSparkSessionExtension to configure session extensions.
java.lang.ClassNotFoundException: org.apache.spark.sql.hudi.HoodieSparkSessionExtension
	at java.base/java.net.URLClassLoader.findClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Unknown Source)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
	at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1(SparkSession.scala:1194)
	at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1$adapted(SparkSession.scala:1192)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$applyExtensions(SparkSession.scala:1192)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:956)
	at cc.utils.HoodieExampleSparkUtils.buildSparkSession(HoodieExampleSparkUtils.java:60)
	at cc.utils.HoodieExampleSparkUtils.defaultSparkSession(HoodieExampleSparkUtils.java:53)
	at cc.hudi.HoodieSparkQuickstart.main(HoodieSparkQuickstart.java:39)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

3.github原始碼編譯錯誤:

[ERROR] Failed to execute goal on project hudi-utilities_2.12: Could not resolve dependencies for project org.apache.hudi:hudi-utilities_2.12:jar:0.14.0-SNAPSHOT: The following artifacts could not be resolved: io.confluent:kafka-avr
o-serializer:jar:5.3.4, io.confluent:common-config:jar:5.3.4, io.confluent:common-utils:jar:5.3.4, io.confluent:kafka-schema-registry-client:jar:5.3.4: io.confluent:kafka-avro-serializer:jar:5.3.4 was not found in http://10.41.31.10:9081/repository/maven-public/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of chinaunicom has elapsed or updates are forced -> [Help 1]      
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :hudi-utilities_2.12

參考:

github的原始碼使用指南:

Developer Setup | Apache Hudi

社群查詢問題:

Apache Hudi - ASF JIRA

相關文章