一、spark structured-streaming 介紹
我們都知道spark streaming 在v2.4.5 之後 就進入了維護階段,不再有新的大版本出現,而且 spark streaming 一直是按照微批來處理streaming 資料的,只能做到準實時,無法像flink一樣做到資料的實時資料處理。所以在spark streaming 進入到不再更新的維護階段後,spark 推出了 structured-streaming 來同flink 進行競爭,structured-streaming 支援視窗計算,watermark、state 等flink 一樣的特性。
二、spark structured-streaming 如何對接kafka 資料來源
jar依賴:
groupId = org.apache.spark artifactId = spark-sql-kafka-0-10_2.12 version = 3.2.0
1、從kafka資料來源讀取資料
// Subscribe to 1 topic val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to 1 topic, with headers val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .option("includeHeaders", "true") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers") .as[(String, String, Array[(String, Array[Byte])])] // Subscribe to multiple topics val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to a pattern val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]
2、Creating a Kafka Source for Batch Queries
// Subscribe to 1 topic defaults to the earliest and latest offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to multiple topics, specifying explicit Kafka offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to a pattern, at the earliest and latest offsets val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]
schema:
Column | Type |
---|---|
key | binary |
value | binary |
topic | string |
partition | int |
offset | long |
timestamp | timestamp |
timestampType | int |
headers (optional) | array |
Option引數:
Option | value | meaning |
---|---|---|
assign | json string {"topicA":[0,1],"topicB":[2,4]} | Specific TopicPartitions to consume. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source. |
subscribe | A comma-separated list of topics | The topic list to subscribe. Only one of "assign", "subscribe" or "subscribePattern" options can be specified for Kafka source. |
subscribePattern | Java regex string | The pattern used to subscribe to topic(s). Only one of "assign, "subscribe" or "subscribePattern" options can be specified for Kafka source. |
kafka.bootstrap.servers | A comma-separated list of host:port | The Kafka "bootstrap.servers" configuration. |
Option可選引數:
Option | value | default | query type | meaning |
---|---|---|---|---|
startingTimestamp | timestamp string e.g. "1000" | none (next preference is startingOffsetsByTimestamp ) |
streaming and batch | The start point of timestamp when a query is started, a string specifying a starting timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
startingOffsetsByTimestamp | json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ | none (next preference is startingOffsets ) |
streaming and batch | The start point of timestamp when a query is started, a json string specifying a starting timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the behavior will follow to the value of the option startingOffsetsByTimestampStrategy
Note1: Note2: For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
startingOffsets | "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """ | "latest" for streaming, "earliest" for batch | streaming and batch | The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest. |
endingTimestamp | timestamp string e.g. "1000" | none (next preference is endingOffsetsByTimestamp ) |
batch query | The end point when a batch query is ended, a json string specifying an ending timestamp for all partitions in topics being subscribed. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest.
Note:
|
endingOffsetsByTimestamp | json string """ {"topicA":{"0": 1000, "1": 1000}, "topicB": {"0": 2000, "1": 2000}} """ | none (next preference is endingOffsets ) |
batch query | The end point when a batch query is ended, a json string specifying an ending timestamp for each TopicPartition. Please refer the details on timestamp offset options below. If Kafka doesn't return the matched offset, the offset will be set to latest.
Note: |
endingOffsets | latest or json string {"topicA":{"0":23,"1":-1},"topicB":{"0":-1}} | latest | batch query | The end point when a batch query is ended, either "latest" which is just referred to the latest, or a json string specifying an ending offset for each TopicPartition. In the json, -1 as an offset can be used to refer to latest, and -2 (earliest) as an offset is not allowed. |
failOnDataLoss | true or false | true | streaming and batch | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. |
kafkaConsumer.pollTimeoutMs | long | 120000 | streaming and batch | The timeout in milliseconds to poll data from Kafka in executors. When not defined it falls back to spark.network.timeout . |
fetchOffset.numRetries | int | 3 | streaming and batch | Number of times to retry before giving up fetching Kafka offsets. |
fetchOffset.retryIntervalMs | long | 10 | streaming and batch | milliseconds to wait before retrying to fetch Kafka offsets |
maxOffsetsPerTrigger | long | none | streaming query | Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. |
minOffsetsPerTrigger | long | none | streaming query | Minimum number of offsets to be processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume. Note, if the maxTriggerDelay is exceeded, a trigger will be fired even if the number of available offsets doesn't reach minOffsetsPerTrigger. |
maxTriggerDelay | time with units | 15m | streaming query | Maximum amount of time for which trigger can be delayed between two triggers provided some data is available from the source. This option is only applicable if minOffsetsPerTrigger is set. |
minPartitions | int | none | streaming and batch | Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint : the number of Spark tasks will be approximately minPartitions . It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data. |
groupIdPrefix | string | spark-kafka-source | streaming and batch | Prefix of consumer group identifiers (group.id ) that are generated by structured streaming queries. If "kafka.group.id" is set, this option will be ignored. |
kafka.group.id | string | none | streaming and batch | The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to be very small. When this is set, option "groupIdPrefix" will be ignored. |
includeHeaders | boolean | false | streaming and batch | Whether to include the Kafka headers in the row. |
startingOffsetsByTimestampStrategy | "error" or "latest" | "error" | streaming and batch | The strategy will be used when the specified starting offset by timestamp (either global or per partition) doesn't match with the offset Kafka returned. Here's the strategy name and corresponding descriptions:
"error": fail the query and end users have to deal with workarounds requiring manual steps. "latest": assigns the latest offset for these partitions, so that Spark can read newer records from these partitions in further micro-batches. |
Consumer Caching
It’s time-consuming to initialize Kafka consumers, especially in streaming scenarios where processing time is a key factor. Because of this, Spark pools Kafka consumers on executors, by leveraging Apache Commons Pool.
The caching key is built up from the following information:
- Topic name
- Topic partition
- Group ID
The following properties are available to configure the consumer pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.consumer.cache.capacity | 64 | The maximum number of consumers cached. Please note that it's a soft limit. | 3.0.0 |
spark.kafka.consumer.cache.timeout | 5m (5 minutes) | The minimum amount of time a consumer may sit idle in the pool before it is eligible for eviction by the evictor. | 3.0.0 |
spark.kafka.consumer.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for consumer pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
spark.kafka.consumer.cache.jmx.enable | false | Enable or disable JMX for pools created with this configuration instance. Statistics of the pool are available via JMX instance. The prefix of JMX name is set to "kafka010-cached-simple-kafka-consumer-pool". | 3.0.0 |
The size of the pool is limited by spark.kafka.consumer.cache.capacity
, but it works as “soft-limit” to not block Spark tasks.
Idle eviction thread periodically removes consumers which are not used longer than given timeout. If this threshold is reached when borrowing, it tries to remove the least-used entry that is currently not in use.
If it cannot be removed, then the pool will keep growing. In the worst case, the pool will grow to the max number of concurrent tasks that can run in the executor (that is, number of task slots).
If a task fails for any reason, the new task is executed with a newly created Kafka consumer for safety reasons. At the same time, we invalidate all consumers in pool which have same caching key, to remove consumer which was used in failed execution. Consumers which any other tasks are using will not be closed, but will be invalidated as well when they are returned into pool.
Along with consumers, Spark pools the records fetched from Kafka separately, to let Kafka consumers stateless in point of Spark’s view, and maximize the efficiency of pooling. It leverages same cache key with Kafka consumers pool. Note that it doesn’t leverage Apache Commons Pool due to the difference of characteristics.
The following properties are available to configure the fetched data pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.consumer.fetchedData.cache.timeout | 5m (5 minutes) | The minimum amount of time a fetched data may sit idle in the pool before it is eligible for eviction by the evictor. | 3.0.0 |
spark.kafka.consumer.fetchedData.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for fetched data pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
特別注意:
// "kafka.auto.offset.reset" -> "latest", //latest自動重置偏移量為最新的偏移量earliest,structured-streaming not support
"auto.offset.reset" -> "latest", //latest自動重置偏移量為最新的偏移量earliest 無效,應該寫為 kafka.auto.offset.reset
// "kafka.enable.auto.commit" -> "true", //如果是true,則這個消費者的偏移量會在後臺自動提交 structured-streaming not support ,通過.option("startingOffsets","latest")
"enable.auto.commit" -> "true", //如果是true,則這個消費者的偏移量會在後臺自動提交 無效,應該寫為kafka.enable.auto.commit
其他的一些可選引數:比如連線kafka走ssl,
"kafka.security.protocol" -> "SSL",
"kafka.ssl.keystore.location" -> "/dbfs/FileStore/xxxxx/client_keystore.jks",
"kafka.ssl.keystore.password" -> "xxxxx",
"kafka.ssl.key.password" -> "xxxxxx",
"kafka.ssl.truststore.location" -> "/dbfs/FileStore/xxxxx/client_truststore.jks",
"kafka.ssl.truststore.password" -> "kafka",
"kafka.ssl.endpoint.identification.algorithm" -> "",
"kafka.request.timeout.ms" -> "100000",
"kafka.max.poll.records" -> "100",
"kafka.fetch.max.bytes" -> "12428800"
1、可以通過.option("maxOffsetsPerTrigger","30000") 限制每秒從kafka讀取的最大資料條數。
2、.option("startingOffsets", "latest") 只針對首次沒有進行過任何消費的時候生效,如果之前已經提交過offset,那麼此時不會再從latest進行消費,而是從上次的offser進行消費。
3、切記,structured-streaming 實際無法支援"enable.auto.commit" -> "true",都是由structured-streaming自身進行維護的,在 streaming 寫入sink的時候,設定.option("checkpointLocation", "")
,類似flink的checkpoint 一樣。設定後,structured-streaming 會自己負責維護offset提交到checkpointLocation路徑下面。
3、streaming資料寫入kafka sink
Kafka Sink for Streaming
/ Write key-value data from a DataFrame to a specific Kafka topic specified in an option val ds = df .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start() // Write key-value data from a DataFrame to Kafka using a topic specified in the data val ds = df .selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") .writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .start()
Writing the output of Batch Queries to Kafka
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save() // Write key-value data from a DataFrame to Kafka using a topic specified in the data df.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") .write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .save()
The Dataframe being written to Kafka should have the following columns in schema:
Column | Type |
---|---|
key (optional) | string or binary |
value (required) | string or binary |
headers (optional) | array |
topic (*optional) | string |
partition (optional) | int |
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null
valued key column will be automatically added (see Kafka semantics on how null
valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column. If a “partition” column is not specified (or its value is null
) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class
option. If not present, Kafka default partitioner will be used.
The following options must be set for the Kafka sink for both batch and streaming queries.
Option | value | meaning |
---|---|---|
kafka.bootstrap.servers | A comma-separated list of host:port | The Kafka "bootstrap.servers" configuration. |
The following configurations are optional:
Option | value | default | query type | meaning |
---|---|---|---|---|
topic | string | none | streaming and batch | Sets the topic that all rows will be written to in Kafka. This option overrides any topic column that may exist in the data. |
includeHeaders | boolean | false | streaming and batch | Whether to include the Kafka headers in the row. |
Producer Caching
Given Kafka producer instance is designed to be thread-safe, Spark initializes a Kafka producer instance and co-use across tasks for same caching key.
The caching key is built up from the following information:
- Kafka producer configuration
This includes configuration for authorization, which Spark will automatically include when delegation token is being used. Even we take authorization into account, you can expect same Kafka producer instance will be used among same Kafka producer configuration. It will use different Kafka producer when delegation token is renewed; Kafka producer instance for old delegation token will be evicted according to the cache policy.
The following properties are available to configure the producer pool:
Property Name | Default | Meaning | Since Version |
---|---|---|---|
spark.kafka.producer.cache.timeout | 10m (10 minutes) | The minimum amount of time a producer may sit idle in the pool before it is eligible for eviction by the evictor. | 2.2.1 |
spark.kafka.producer.cache.evictorThreadRunInterval | 1m (1 minute) | The interval of time between runs of the idle evictor thread for producer pool. When non-positive, no idle evictor thread will be run. | 3.0.0 |
Idle eviction thread periodically removes producers which are not used longer than given timeout. Note that the producer is shared and used concurrently, so the last used timestamp is determined by the moment the producer instance is returned and reference count is 0.
Kafka Specific Configurations
Kafka’s own configurations can be set via DataStreamReader.option
with kafka.
prefix, e.g, stream.option("kafka.bootstrap.servers", "host:port")
. For possible kafka parameters, see Kafka consumer config docs for parameters related to reading data, and Kafka producer config docs for parameters related to writing data.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
- group.id: Kafka source will create a unique group id for each query automatically. The user can set the prefix of the automatically generated group.id’s via the optional source option
groupIdPrefix
, default value is “spark-kafka-source”. You can also set “kafka.group.id” to force Spark to use a special group id, however, please read warnings for this option and use it with caution. - auto.offset.reset: Set the source option
startingOffsets
to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note thatstartingOffsets
only applies when a new streaming query is started, and that resuming will always pick up from where the query left off. Note that when the offsets consumed by a streaming application no longer exist in Kafka (e.g., topics are deleted, offsets are out of range, or offsets are removed after retention period), the offsets will not be reset and the streaming application will see data loss. In extreme cases, for example the throughput of the streaming application cannot catch up the retention speed of Kafka, the input rows of a batch might be gradually reduced until zero when the offset ranges of the batch are completely not in Kafka. EnablingfailOnDataLoss
option can ask Structured Streaming to fail the query for such cases. - key.deserializer: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the keys.
- value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
- key.serializer: Keys are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the keys into either strings or byte arrays.
- value.serializer: values are always serialized with ByteArraySerializer or StringSerializer. Use DataFrame operations to explicitly serialize the values into either strings or byte arrays.
- enable.auto.commit: Kafka source doesn’t commit any offset.
- interceptor.classes: Kafka source always read keys and values as byte arrays. It’s not safe to use ConsumerInterceptor as it may break the query.
三、spark structured-streaming 如何對接Azure Event hub資料來源
jar依賴:
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-eventhubs-spark_${scala.binary.version}</artifactId>
<version>2.3.21</version>
<scope>compile</scope>
</dependency>
1、從event hub資料來源讀取資料
val eventHubsConf = EventHubsConf("xxxx")//連線字串 //https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf .setConsumerGroup("xxxx") .setMaxEventsPerTrigger(3000) // .setStartingPosition(EventPosition.fromEndOfStream) //EventPosition.fromOffset("246812") //.setStartingPosition(EventPosition.fromStartOfStream) .setReceiverTimeout(30)
val df = spark
.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load().withColumnRenamed("value", "body").select(col("body").cast("string"))
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition } // To connect to an Event Hub, EntityPath is required as part of the connection string. // Here, we assume that the connection string from the Azure portal does not have the EntityPath part. val connectionString = ConnectionStringBuilder("{EVENT HUB CONNECTION STRING FROM AZURE PORTAL}") .setEventHubName("{EVENT HUB NAME}") .build val eventHubsConf = EventHubsConf(connectionString) .setStartingPosition(EventPosition.fromEndOfStream) var streamingInputDF = spark.readStream .format("eventhubs") .options(eventHubsConf.toMap) .load()
2、streaming資料寫入 event hub
df.writeStream .format("eventhubs") .outputMode("append") .options(eventHubsConfWrite.toMap) .option("checkpointLocation", "xxxxxxxxxxx") .trigger(ProcessingTime("25 seconds")) .start().awaitTermination()
import org.apache.spark.sql.streaming.Trigger.ProcessingTime val query = streamingSelectDF .writeStream .format("parquet") .option("path", "/mnt/sample/test-data") .option("checkpointLocation", "/mnt/sample/check") .partitionBy("zip", "day") .trigger(ProcessingTime("25 seconds")) .start()
import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition } import org.apache.spark.sql.streaming.Trigger.ProcessingTime // The connection string for the Event Hub you will WRTIE to. val connString = "{EVENT HUB CONNECTION STRING}" val eventHubsConfWrite = EventHubsConf(connString) val source = spark.readStream .format("rate") .option("rowsPerSecond", 100) .load() .withColumnRenamed("value", "body") .select($"body" cast "string") val query = source .writeStream .format("eventhubs") .outputMode("update") .options(eventHubsConfWrite.toMap) .trigger(ProcessingTime("25 seconds")) .option("checkpointLocation", "/checkpoint/") .start()
JDBC Sink
import java.sql._ class JDBCSink(url:String, user:String, pwd:String) extends ForeachWriter[(String, String)] { val driver = "com.mysql.jdbc.Driver" var connection:Connection = _ var statement:Statement = _ def open(partitionId: Long,version: Long): Boolean = { Class.forName(driver) connection = DriverManager.getConnection(url, user, pwd) statement = connection.createStatement true } def process(value: (String, String)): Unit = { statement.executeUpdate("INSERT INTO zip_test " + "VALUES (" + value._1 + "," + value._2 + ")") } def close(errorOrNull: Throwable): Unit = { connection.close } }
val url="jdbc:mysql://<mysqlserver>:3306/test" val user ="user" val pwd = "pwd" val writer = new JDBCSink(url,user, pwd) val query = streamingSelectDF .writeStream .foreach(writer) .outputMode("update") .trigger(ProcessingTime("25 seconds")) .start()
更多詳情請參考:https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#eventhubsconf
https://docs.microsoft.com/zh-cn/azure/databricks/_static/notebooks/structured-streaming-event-hubs-integration.html
注意:可以通過.setMaxEventsPerTrigger(3000) 控制每次從event hub讀取的資料條數
四、spark structured-streaming 的write 輸出模式
There are a few types of output modes.
-
Append mode (default) - This is the default mode, where only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change. Hence, this mode guarantees that each row will be output only once (assuming fault-tolerant sink). For example, queries with only
select
,where
,map
,flatMap
,filter
,join
, etc. will support Append mode. -
Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.
-
Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink. More information to be added in future releases.
Different types of streaming queries support different output modes. Here is the compatibility matrix.
Query Type | Supported Output Modes | Notes | |
---|---|---|---|
Queries with aggregation | Aggregation on event-time with watermark | Append, Update, Complete | Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed). See the Late Data section for more details.Update mode uses watermark to drop old aggregation state. Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table. |
Other aggregations | Complete, Update | Since no watermark is defined (only defined in other category), old aggregation state is not dropped. Append mode is not supported as aggregates can update thus violating the semantics of this mode. |
|
Queries with mapGroupsWithState |
Update | Aggregations not allowed in a query with mapGroupsWithState . |
|
Queries with flatMapGroupsWithState |
Append operation mode | Append | Aggregations are allowed after flatMapGroupsWithState . |
Update operation mode | Update | Aggregations not allowed in a query with flatMapGroupsWithState . |
|
Queries with joins |
Append | Update and Complete mode not supported yet. See the support matrix in the Join Operations section for more details on what types of joins are supported. | |
Other queries | Append, Update | Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table. |
五、spark structured-streaming 的其他sink
Sink | Supported Output Modes | Options | Fault-tolerant | Notes |
---|---|---|---|---|
File Sink | Append | path : path to the output directory, must be specified.retention : time to live (TTL) for output files. Output files which batches were committed older than TTL will be eventually excluded in metadata log. This means reader queries which read the sink's output directory may not process them. You can provide the value as string format of the time. (like "12h", "7d", etc.) By default it's disabled.For file-format-specific options, see the related methods in DataFrameWriter (Scala/Java/Python/R). E.g. for "parquet" format options see DataFrameWriter.parquet() |
Yes (exactly-once) | Supports writes to partitioned tables. Partitioning by time may be useful. |
Kafka Sink | Append, Update, Complete | See the Kafka Integration Guide | Yes (at-least-once) | More details in the Kafka Integration Guide |
Foreach Sink | Append, Update, Complete | None | Yes (at-least-once) | More details in the next section |
ForeachBatch Sink | Append, Update, Complete | None | Depends on the implementation | More details in the next section |
Console Sink | Append, Update, Complete | numRows : Number of rows to print every trigger (default: 20)truncate : Whether to truncate the output if too long (default: true) |
No | |
Memory Sink | Append, Complete | None | No. But in Complete Mode, restarted query will recreate the full table. | Table name is the query name. |
// ========== DF with no aggregations ========== val noAggDF = deviceDataDf.select("device").where("signal > 10") // Print new data to console noAggDF .writeStream .format("console") .start() // Write new data to Parquet files noAggDF .writeStream .format("parquet") .option("checkpointLocation", "path/to/checkpoint/dir") .option("path", "path/to/destination/dir") .start() // ========== DF with aggregation ========== val aggDF = df.groupBy("device").count() // Print updated aggregations to console aggDF .writeStream .outputMode("complete") .format("console") .start() // Have all the aggregates in an in-memory table aggDF .writeStream .queryName("aggregates") // this query name will be the table name .outputMode("complete") .format("memory") .start() spark.sql("select * from aggregates").show() // interactively query in-memory table
Using Foreach and ForeachBatch
The foreach
and foreachBatch
operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach
allows custom write logic on every row, foreachBatch
allows arbitrary operations and custom logic on the output of each micro-batch. Let’s understand their usages in more detail.
ForeachBatch
foreachBatch(...)
allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => // Transform and write batchDF }.start() streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.persist() batchDF.write.format(...).save(...) // location 1 batchDF.write.format(...).save(...) // location 2 batchDF.unpersist() }
Foreach
If foreachBatch
is not an option (for example, corresponding batch data writer does not exist, or continuous processing mode), then you can express your custom writer logic using foreach
. Specifically, you can express the data writing logic by dividing it into three methods: open
, process
, and close
. Since Spark 2.4, foreach
is available in Scala, Java and Python.
streamingDatasetOfString.writeStream.foreach( new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { // Open connection } def process(record: String): Unit = { // Write string to connection } def close(errorOrNull: Throwable): Unit = { // Close the connection } } ).start() val spark: SparkSession = ... // Create a streaming DataFrame val df = spark.readStream .format("rate") .option("rowsPerSecond", 10) .load() // Write the streaming DataFrame to a table df.writeStream .option("checkpointLocation", "path/to/checkpoint/dir") .toTable("myTable") // Check the table result spark.read.table("myTable").show() // Transform the source dataset and write to a new table spark.readStream .table("myTable") .select("value") .writeStream .option("checkpointLocation", "path/to/checkpoint/dir") .format("parquet") .toTable("newTable") // Check the new table result spark.read.table("newTable").show()
六、spark structured-streaming 的State Store
State store is a versioned key-value store which provides both read and write operations. In Structured Streaming, we use the state store provider to handle the stateful operations across batches. There are two built-in state store provider implementations. End users can also implement their own state store provider by extending StateStoreProvider interface.
HDFS state store provider
The HDFS backend state store provider is the default implementation of [[StateStoreProvider]] and [[StateStore]] in which all the data is stored in memory map in the first stage, and then backed by files in an HDFS-compatible file system. All updates to the store have to be done in sets transactionally, and each set of updates increments the store’s version. These versions can be used to re-execute the updates (by retries in RDD operations) on the correct version of the store, and regenerate the store version.
RocksDB state store implementation
As of Spark 3.2, we add a new built-in state store implementation, RocksDB state store provider.
If you have stateful operations in your streaming query (for example, streaming aggregation, streaming dropDuplicates, stream-stream joins, mapGroupsWithState, or flatMapGroupsWithState) and you want to maintain millions of keys in the state, then you may face issues related to large JVM garbage collection (GC) pauses causing high variations in the micro-batch processing times. This occurs because, by the implementation of HDFSBackedStateStore, the state data is maintained in the JVM memory of the executors and large number of state objects puts memory pressure on the JVM causing high GC pauses.
In such cases, you can choose to use a more optimized state management solution based on RocksDB. Rather than keeping the state in the JVM memory, this solution uses RocksDB to efficiently manage the state in the native memory and the local disk. Furthermore, any changes to this state are automatically saved by Structured Streaming to the checkpoint location you have provided, thus providing full fault-tolerance guarantees (the same as default state management).
To enable the new build-in state store implementation, set spark.sql.streaming.stateStore.providerClass
to org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
.
在databricks的商業spark中,可以通過spark.conf.set( "spark.sql.streaming.stateStore.providerClass", "com.databricks.sql.streaming.state.RocksDBStateStoreProvider"),更多請參考:https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/production#--configure-rocksdb-state-store
Here are the configs regarding to RocksDB instance of the state store provider:
Config Name | Description | Default Value |
---|---|---|
spark.sql.streaming.stateStore.rocksdb.compactOnCommit | Whether we perform a range compaction of RocksDB instance for commit operation | False |
spark.sql.streaming.stateStore.rocksdb.blockSizeKB | Approximate size in KB of user data packed per block for a RocksDB BlockBasedTable, which is a RocksDB's default SST file format. | 4 |
spark.sql.streaming.stateStore.rocksdb.blockCacheSizeMB | The size capacity in MB for a cache of blocks. | 8 |
spark.sql.streaming.stateStore.rocksdb.lockAcquireTimeoutMs | The waiting time in millisecond for acquiring lock in the load operation for RocksDB instance. | 60000 |
spark.sql.streaming.stateStore.rocksdb.resetStatsOnLoad | Whether we resets all ticker and histogram stats for RocksDB on load. | True |
State Store and task locality
The stateful operations store states for events in state stores of executors. State stores occupy resources such as memory and disk space to store the states. So it is more efficient to keep a state store provider running in the same executor across different streaming batches. Changing the location of a state store provider requires the extra overhead of loading checkpointed states. The overhead of loading state from checkpoint depends on the external storage and the size of the state, which tends to hurt the latency of micro-batch run. For some use cases such as processing very large state data, loading new state store providers from checkpointed states can be very time-consuming and inefficient.
The stateful operations in Structured Streaming queries rely on the preferred location feature of Spark’s RDD to run the state store provider on the same executor. If in the next batch the corresponding state store provider is scheduled on this executor again, it could reuse the previous states and save the time of loading checkpointed states.
However, generally the preferred location is not a hard requirement and it is still possible that Spark schedules tasks to the executors other than the preferred ones. In this case, Spark will load state store providers from checkpointed states on new executors. The state store providers run in the previous batch will not be unloaded immediately. Spark runs a maintenance task which checks and unloads the state store providers that are inactive on the executors.
By changing the Spark configurations related to task scheduling, for example spark.locality.wait
, users can configure Spark how long to wait to launch a data-local task. For stateful operations in Structured Streaming, it can be used to let state store providers running on the same executors across batches.
Specifically for built-in HDFS state store provider, users can check the state store metrics such as loadedMapCacheHitCount
and loadedMapCacheMissCount
. Ideally, it is best if cache missing count is minimized that means Spark won’t waste too much time on loading checkpointed state. User can increase Spark locality waiting configurations to avoid loading state store providers in different executors across batches.
七、spark structured-streaming Join
未完待續
更多詳情請參考:https://spark.apache.org/docs/3.1.2/structured-streaming-programming-guide.html