碰壁Spark + Mongodb

weixin_34321977發表於2017-05-29

在嘗試Spark + MongoDB過程中，總是遇到Cursor xxxxx not found錯誤，嘗試加入keep_alive_ms 和 pipeline 也不能解決問題。

目前總資料量在10000條左右，從Mongodb中載入後交給Spark的NaiveBayes training.

    pipeline = {{ $limit: 5000 },{ $skip: 2000 }}
    has_train = spark_session.read.format("com.mongodb.spark.sql.DefaultSource") \
        .option("spark.mongodb.input.uri", "mongodb://mongo_and_spark_server:27017/resume_db.has_train") \
        .option("spark.mongodb.keep_alive_ms", "3600000") \
        .option("pipeline", pipeline) \
        .load()

在1.6之前，我們需要手動部署並指明額外載入第三方jar檔案路徑,在實驗2.1的時候，這些Package會自動下載

./spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
    --packages org.mongodb.spark:mongo-spark-connector_2.11:2.0.0 \
    --py-files ./utility.py \
    --files ./dicts/statistic_college.txt \
    --files ./dicts/degrees.txt \
    --files ./dicts/diming.txt \
    --files ./dicts/subjects.txt \
    --files ./dicts/training_org.txt \
    naive_bayes.py

Output:

# ./submit.sh 
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/home/pluto/spark/spark-2.1.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found org.mongodb.spark#mongo-spark-connector_2.11;2.0.0 in central
    found org.mongodb#mongo-java-driver;3.2.2 in central
:: resolution report :: resolve 221ms :: artifacts dl 4ms
    :: modules in use:
    org.mongodb#mongo-java-driver;3.2.2 from central in [default]
    org.mongodb.spark#mongo-spark-connector_2.11;2.0.0 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 2 already retrieved (0kB/7ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

當資料量增大時，總是遇到如下錯誤，目前還沒有排查出錯誤根源 :(

17/05/29 01:23:16 INFO MongoClientCache: Closing MongoClient: [mongo_and_spark_server:27017]
17/05/29 01:23:16 INFO connection: Closed connection [connectionId{localValue:2, serverValue:42}] to mongo_and_spark_server:27017 because the pool has been closed.
17/05/29 01:23:16 INFO MongoClientCache: Closing MongoClient: [mongo_and_spark_server:27017]
17/05/29 01:23:16 INFO connection: Closed connection [connectionId{localValue:4, serverValue:46}] to mongo_and_spark_server:27017 because the pool has been closed.
17/05/29 01:27:56 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 5, mongo_and_spark_server, executor 2): com.mongodb.MongoCursorNotFoundException: Query failed with error code -5 and error message 'Cursor 34611963569 not found on server mongo_and_spark_server:27017' on server mongo_and_spark_server:27017
    at com.mongodb.operation.QueryHelper.translateCommandException(QueryHelper.java:27)
    at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:215)
    at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
    at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
    at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
    at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

17/05/29 01:27:56 INFO TaskSetManager: Starting task 3.1 in stage 2.0 (TID 6, mongo_and_spark_server, executor 2, partition 3, ANY, 6787 bytes)
17/05/29 01:29:01 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 4, mongo_and_spark_server, executor 0): com.mongodb.MongoCursorNotFoundException: Query failed with error code -5 and error message 'Cursor 34615739977 not found on server mongo_and_spark_server:27017' on server mongo_and_spark_server:27017
    at com.mongodb.operation.QueryHelper.translateCommandException(QueryHelper.java:27)
    at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:215)
    at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:103)
    at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:120)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
    at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
    at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)

17/05/29 01:29:01 INFO TaskSetManager: Starting task 2.1 in stage 2.0 (TID 7, mongo_and_spark_server, executor 2, partition 2, ANY, 6799 bytes)

參考：

https://docs.mongodb.com/spark-connector/v2.0/configuration/#spark-input-conf
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
http://www.mongoing.com/tj/mongodb_shanghai_spark

Spark連線MongoDB之Scala
2017-07-14
SparkMongoDB
canvas小球碰壁效果
2018-06-02
Canvas
JavaScript小球碰壁詳解
2018-09-08
JavaScript
JavaScript隨機漂浮碰壁效果
2018-07-06
JavaScript隨機
canvas小球碰壁隨機運動
2018-08-31
Canvas隨機
“秒回”習慣碰壁的思考
2020-12-12
Spark讀取MongoDB資料的方法與優化
2022-01-13
SparkMongoDB優化
樹莓派安裝 ubuntu 20.04 LTS 碰壁指南
2021-01-27
樹莓派Ubuntu
Spark之spark shell
2018-09-13
Spark
【Spark篇】---Spark初始
2018-02-01
Spark
Spark on Yarn 和Spark on Mesos
2018-11-20
SparkYarn
Spark系列 - (3) Spark SQL
2023-02-23
SparkSQL
Spark學習進度-Spark環境搭建&Spark shell
2021-01-04
Spark
【Spark】Spark容錯機制
2021-09-09
Spark
Spark導論（Spark自學一）
2017-10-10
Spark
spark
2022-07-05
Spark
spark學習筆記--Spark SQL
2018-07-13
Spark筆記SQL
Spark記錄（一）：Spark全景概述
2021-11-06
Spark
Spark開發-Spark核心細說
2017-09-21
Spark
Spark開發-spark環境搭建
2017-09-10
Spark
Spark streaming + Kafka 流式資料處理，結果儲存至MongoDB、Solr、Neo4j（自用）
2018-03-01
SparkKafkaMongoDBSolr
【mongodb】mongodb的安裝
2011-08-08
MongoDB
Spark文件閱讀之一：Spark Overview
2020-06-15
SparkView
Spark 系列（九）—— Spark SQL 之 Structured API
2019-08-13
SparkSQLStructAPI
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
Spark 系列（十五）—— Spark Streaming 整合 Flume
2019-08-15
Spark
Hello Spark! | Spark，從入門到精通
2018-09-18
Spark
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Spark SQL | Spark，從入門到精通
2019-01-21
SparkSQL
spark學習筆記-- Spark Streaming
2018-08-03
Spark筆記
Spark —— Spark OOM Error問題排查定位
2020-11-22
SparkOOMError
【Spark篇】---Spark中控制運算元
2018-02-02
Spark
【Spark篇】---Spark中Action運算元
2018-02-02
Spark
【Spark篇】---Spark故障解決（troubleshooting）
2018-03-04
Spark
初學Spark——關於安裝Spark
2015-07-01
Spark
Spark Stage
2021-09-09
Spark
spark with hive
2018-11-21
SparkHive
Spark 加入
2022-09-23
Spark

碰壁Spark + Mongodb

相關文章