Flink1.7.2 DataStream Operator 示例
Flink1.7.2 DataStream Operator 示例
原始碼
- https://github.com/opensourceteams/flink-maven-scala
- https://github.com/opensourceteams/flink-maven-scala/tree/master/src/main/scala/com/opensourceteams/module/bigdata/flink/example/datastream/operator
map
- 處理所有元素
- 輸入資料
模板
- 程式
模板
- 輸出資料
模板
map
- 處理所有元素
-
輸入資料
你好 傳送資料
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.map import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.streaming.api.scala._ /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val dataStream = env.socketTextStream("localhost", port, '\n') val dataStreamMap = dataStream.map(x => x + " 增加的資料") dataStreamMap.print() if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
-
輸出資料
1> 你好 增加的資料 2> 傳送資料 增加的資料
flatMap
- 處理所有元素,並且把每行中的子集合,彙總成一個大集合
-
輸入資料
a b c e f g
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.flatmap import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 // get the execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val dataStream = env.socketTextStream("localhost", port, '\n') val dataStream2 = dataStream.flatMap(x => x.split(" ")) dataStream2.print() if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
-
輸出資料
a b c e f g
filter
- 過濾資料
-
輸入資料
a b c a c b b d d
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.filter import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment} /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 // get the execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val dataStream = env.socketTextStream("localhost", port, '\n') val dataStreamMap = dataStream.filter( x => (x.contains("a"))) dataStreamMap.print() if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
-
輸出資料
3> a b c 4> a c
keyBy
- 指定某列為key,一般按key分組時用
-
輸入資料
c a b a
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} import org.apache.flink.streaming.api.windowing.time.Time /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 // get the execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val dataStream = env.socketTextStream("localhost", port, '\n') val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1)) .keyBy(0) //dataStream.keyBy("someKey") // Key by field "someKey" //dataStream.keyBy(0) // Key by the first element of a Tuple .timeWindow(Time.seconds(2))//每2秒滾動視窗 .sum(1) dataStream2.print() if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
- 輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
-
預設並行度
6> (a,2) 4> (c,1) 2> (b,1)
-
並行度為1,就先去重,取第一個元素,再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a
(c,1) (b,1) (a,2)
sum
- keyBy指定某列為key,一般按key分組時用,sum按key分組後求合
-
輸入資料
c a b a
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} import org.apache.flink.streaming.api.windowing.time.Time /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 // get the execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val dataStream = env.socketTextStream("localhost", port, '\n') val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1)) .keyBy(0) //dataStream.keyBy("someKey") // Key by field "someKey" //dataStream.keyBy(0) // Key by the first element of a Tuple .timeWindow(Time.seconds(2))//每2秒滾動視窗 .sum(1) dataStream2.print() if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
- 輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
-
預設並行度
6> (a,2) 4> (c,1) 2> (b,1)
-
並行度為1,就先去重,取第一個元素,再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a
(c,1) (b,1) (a,2)
reduce
- keyBy指定某列為key,一般按key分組時用,對相同的key,元素之間進行的函式運算
-
輸入資料
a b b c
-
程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.reduce import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _} import org.apache.flink.streaming.api.windowing.time.Time /** * nc -lk 1234 輸入資料 */ object Run { def main(args: Array[String]): Unit = { val port = 1234 // get the execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) //設定並行度 val dataStream = env.socketTextStream("localhost", port, '\n') val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,2)) .keyBy(0) //dataStream.keyBy("someKey") // Key by field "someKey" //dataStream.keyBy(0) // Key by the first element of a Tuple .timeWindow(Time.seconds(2))//每2秒滾動視窗 .reduce((a,b) => (a._1,a._2 * b._2) ) dataStream2.print() println("=======================列印StreamPlanAsJSON=======================\n") println("JSON轉圖線上工具: https://flink.apache.org/visualizer") println(env.getStreamGraph.getStreamingPlanAsJSON) println("==================================================================\n") if(args == null || args.size ==0){ env.execute("預設作業") }else{ env.execute(args(0)) } println("結束") } }
- 輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
- 預設並行度
6> (a,2)
4> (c,1)
2> (b,1)
- 並行度為1,就先去重,取第一個元素,再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a
(a,2)
(c,2)
(b,4)
fold
- 按key進行處理,第一個引數,是字串,放在每次處理的最前面第二個是表示式,第二個表示式有兩個引數,第一個引數,就是第一個引數的值,第二個引數,我每次迴圈key時,迭代的下一個元素
- 輸入資料
a a b c c
- 程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.fold
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
//dataStream.keyBy("someKey") // Key by field "someKey"
//dataStream.keyBy(0) // Key by the first element of a Tuple
.timeWindow(Time.seconds(2))//每2秒滾動視窗
.fold("開始字串")((str, i) => { str + "-" + i} )
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
開始字串-(a,1)-(a,1)
開始字串-(c,1)-(c,1)
開始字串-(b,1)
Aggregations (sum ,max,min)
sum
- 處理所有元素,相同key進行累加
- 輸入資料
a a c b c
- 程式
package com.opensourceteams.module.bigdata.flink.example.stream.operator.aggregations.sum
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//env.setParallelism(1) //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒,就是最高並行度為8
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
//dataStream.keyBy("someKey") // Key by field "someKey"
//dataStream.keyBy(0) // Key by the first element of a Tuple
.timeWindow(Time.seconds(2))//每2秒滾動視窗
.sum(1)
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
4> (c,2)
2> (b,1)
6> (a,2)
min(和minBy一樣,沒發現區別)
- 處理所有元素,對相同key的元素,進行求最小的值
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//env.setParallelism(1) //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒,就是最高並行度為8
val dataStream = env.socketTextStream("localhost", port, '\n')
var i = 0
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
i = i + 1
(x,i)
})
.keyBy(0)
.timeWindow(Time.seconds(2))//每2秒滾動視窗
.min(1)
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
- 最終輸出資料,順序取決於執行緒的呼叫
2> (b,1)
6> (a,2)
- 中間輸出資料
6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)
max(和maxBy一樣)
- 處理所有元素,對相同key的元素,進行求最大的值
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//env.setParallelism(1) //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒,就是最高並行度為8
val dataStream = env.socketTextStream("localhost", port, '\n')
var i = 0
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
i = i + 1
(x,i)
})
.keyBy(0)
.timeWindow(Time.seconds(2))//每2秒滾動視窗
.max(1)
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
- 最終輸出資料,順序取決於執行緒的呼叫
2> (b,6)
6> (a,5)
- 中間輸出資料
6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)
Window
window
- 定義window,並指定分配元素到window的方式
- 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最後5秒內到達的資料)對每個金鑰中的資料進行分組。 有關視窗的完整說明,請參見windows。
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.{ TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
/**
* 定義window,並指定分配元素到window的方式
* 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最後5秒內到達的資料)對每個金鑰中的資料進行分組。 有關視窗的完整說明,請參見windows。
*/
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.sum(1)
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
(b,3)
(a,3)
WindowAll
- 配合process.ProcessAllWindowFunction函式,該函式的引數elements: Iterable[(String, Int)] 即為當前window的所有元素,可以進行處理,再發給下游sink
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.windowAll
import org.apache.flink.streaming.api.scala.function.ProcessAllWindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
.windowAll( TumblingProcessingTimeWindows.of(Time.seconds(2)))
.process(new ProcessAllWindowFunction[(String, Int),(String, Int),TimeWindow] {
override def process(context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
//可以對當前window中的所有元素進行操作,處理後,再傳送給Sink
for(element <- elements) out.collect(element)
}
})
.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
8> (a,1)
7> (b,1)
3> (a,1)
2> (a,1)
1> (b,1)
4> (b,1)
Window.apply
- 對一批window進行元素處理
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.apply
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.{TimeWindow, Window}
import org.apache.flink.util.Collector
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
/**
* 定義window,並指定分配元素到window的方式
* 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最後5秒內到達的資料)對每個金鑰中的資料進行分組。 有關視窗的完整說明,請參見windows。
*/
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
/**
* * @tparam IN The type of the input value.
* * @tparam OUT The type of the output value.
* * @tparam KEY The type of the key.
*/
.apply(new WindowFunction[(String,Int),(String,Int),Tuple,TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit ={
//對window的所有元素進行處理
for(element <- input) out.collect(element)
}
})
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
2> (a,1)
3> (a,1)
7> (b,1)
4> (b,1)
1> (b,1)
8> (a,1)
Window.reduce
- 處理所有元素,對window中相同key的元素進行,函式表示式計算
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.reduce
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
/**
* 定義window,並指定分配元素到window的方式
* 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最後5秒內到達的資料)對每個金鑰中的資料進行分組。 有關視窗的完整說明,請參見windows。
*/
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
/**
* * @tparam IN The type of the input value.
* * @tparam OUT The type of the output value.
* * @tparam KEY The type of the key.
*/
.reduce((a,b) => (a._1,a._2 +b._2))
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
2> (b,3)
6> (a,3)
Window.fold
- 按key進行處理,第一個引數,是字串,放在每次處理的最前面第二個是表示式,第二個表示式有兩個引數,第一個引數,就是第一個引數的值,第二個引數,我每次迴圈key時,迭代的下一個元素
- 輸入資料
b a b a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.fold
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
/**
* 定義window,並指定分配元素到window的方式
* 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵(例如,在最後5秒內到達的資料)對每個金鑰中的資料進行分組。 有關視窗的完整說明,請參見windows。
*/
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
/**
* 按key進行處理,第一個引數,是字串,放在每次處理的最前面第二個是表示式,第二個表示式有兩個引數,第一個引數,就是第一個引數的值,第二個引數,我每次迴圈key時,迭代的下一個元素
*/
.fold("字串開始")((str, i) => { str + "-" + i} )
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
6> 字串開始-(a,1)-(a,1)-(a,1)
2> 字串開始-(b,1)-(b,1)-(b,1)
DataStream union DataStream
- 對兩個DataStream進行合併,合併之後不能再次進行計算
- 輸入資料
a a b
c c a
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.union
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val dataStream1 = getDataStream(env,1234,"localhost")
val dataStream2 = getDataStream(env,12345,"localhost")
/**
* 只是將兩個流的資料,union在一起,之後,不能再進行操作了
*/
val dataStream3 = dataStream1.union(dataStream2)
dataStream3.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={
//env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream(host, port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
.timeWindow(Time.seconds(5))//每2秒滾動視窗
.sum(1)
dataStream2
}
}
- 輸出資料
6> (a,2)
4> (c,2)
2> (b,1)
6> (a,1)
DataStream join DataStram
- join,兩上流根據key相等進行連線,然後呼叫apply函式,進行具體的相同key進行計算
- 輸入資料
a a b
c c c a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.join
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val dataStream1 = getDataStream(env,1234,"localhost")
val dataStream2 = getDataStream(env,12345,"localhost")
/**
* 只是將兩個流的資料,union在一起,之後,不能再進行操作了
*/
val dataStream3 = dataStream1.join(dataStream2)
dataStream3.where(x => x._1).equalTo(x => x._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.apply((a,b) => (a._1,a._2 + b._2) )
.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={
//env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream(host, port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
.timeWindow(Time.seconds(5))//每2秒滾動視窗
.sum(1)
dataStream2
}
}
- 輸出資料
6> (a,3)
2> (b,2)
DataStream.intervalJoin
- 兩流,根據key,連線,找到兩流都有的共同元素,進行函式處理process()
- 輸入資料
c c a
a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.intervaljoin
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
val dataStream1 = getDataStream(env,1234,"localhost")
val dataStream2 = getDataStream(env,12345,"localhost")
val dataStream3 = dataStream1.keyBy(0).intervalJoin(dataStream2.keyBy(0))
dataStream3.between(Time.seconds(-5), Time.seconds(5))
//.upperBoundExclusive(true) // optional
//.lowerBoundExclusive(true) // optional
.process(new ProcessJoinFunction[(String,Int),(String,Int),String] {
override def processElement(left: (String, Int), right: (String, Int), ctx: ProcessJoinFunction[(String, Int), (String, Int), String]#Context, out: Collector[String]): Unit = {
println(left + "," + right)
out.collect( left + "," + right)
}
})
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={
//env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream(host, port, '\n')
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
dataStream2
}
}
- 輸出資料
(a,1),(a,1)
(a,1),(a,1)
DataStream.cogroup
- 對兩個資料流進行處理,按key去重取兩個流的並集,再按key分別統計每一個流的元素,進行彙總處理
- 第一個流的元素為 c,a 第二個流的元素為 a b ,所以一個統計了四個元素,每一個元素在每個流中千是多少,也統計出來了
- 輸入資料
c c a
a a b
- 程式
模板
- 輸出資料
==============開始
first
[(a,1)]
second
[(a,1), (a,1)]
==============結束
==============開始
first
[(c,1), (c,1)]
second
[]
==============結束
==============開始
first
[]
second
[(b,1)]
==============結束
DataStraem.connect
- 相當於兩個數的資料,都通過 ConnectStream.函式來處理,函式都有兩個方法,一個處理流一,一個處理流二
- 輸入資料
c c a
a a b
- 程式
模板
- 輸出資料
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)
DataStraem.coMap
- 相當於兩個數的資料,都通過 ConnectStream.函式來處理,函式都有兩個方法,一個處理流一,一個處理流二
- 輸入資料
c c a
a a b
- 程式
模板
- 輸出資料
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)
DataStraem.coFlatMap
- 相當於兩個數的資料,都通過 ConnectStream.函式來處理,函式都有兩個方法,一個處理流一,一個處理流二
- 輸入資料
c c a
a a b
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.coFlatMap
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
env.setParallelism(1) //由於想看資料結構,所以先設為1,這樣
val dataStream1 = getDataStream(env,1234,"localhost")
val dataStream2 = getDataStream(env,12345,"localhost")
val dataStream3 = dataStream1.connect(dataStream2)
dataStream3
.flatMap(x => x.toString.split(" ") , x => x.toString.split(" "))
.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[String]={
//env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream(host, port, '\n')
// val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
dataStream
}
}
- 輸出資料
a
a
b
c
c
a
Datastream.assignAscendingTimestamps
- 指定時間戳
- 輸入資料
c c a
- 程式
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.assignTimestamps
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
/**
* nc -lk 1234 輸入資料
*/
object Run {
def main(args: Array[String]): Unit = {
val port = 1234
// get the execution environment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) //設定並行度
val dataStream = env.socketTextStream("localhost", port, '\n')
dataStream.assignAscendingTimestamps(x => System.currentTimeMillis())
val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
.keyBy(0)
.timeWindow(Time.seconds(2))//每2秒滾動視窗
.sum(1)
dataStream2.print()
println("=======================列印StreamPlanAsJSON=======================\n")
println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
println(env.getStreamGraph.getStreamingPlanAsJSON)
println("==================================================================\n")
if(args == null || args.size ==0){
env.execute("預設作業")
}else{
env.execute(args(0))
}
println("結束")
}
}
- 輸出資料
(c,2)
(a,1)
相關文章
- gpt給出的operator簡單示例GPT
- 為 Memcached 構建基於 Go 的 Operator 示例Go
- DataStream Api和Data SourcesASTAPI
- 5-3socket建立datastreamAST
- Flink DataStream 程式設計入門AST程式設計
- flink:StreamExecutionEnvironment、DataStream和Transformation與StreamOperatorASTORM
- Operator介紹
- FLINKCDC3.0 datastream程式執行測試AST
- compact (): Undefined variable: operatorUndefined
- 談談 Kubernetes Operator
- Kubernetes裡的Operator
- RuntimeError: No such operator torchvision::nmsError
- Apache-Flink深度解析-DataStream-Connectors之KafkaApacheASTKafka
- 第04講:Flink 常用的 DataSet 和 DataStream APIASTAPI
- python operator和Branching and cyclingPython
- 精讀《pipe operator for JavaScript》JavaScript
- TypeScript 之 Typeof Type OperatorTypeScript
- TypeScript 之 Keyof Type OperatorTypeScript
- Zalando Postgres Operator 快速上手
- flnkcdc+datastream實現mysql到mysql資料同步ASTMySql
- Kubernetes Operator基礎入門
- 12 . Kubernetes之Statefulset 和 Operator
- Prometheus-operator新增basic authPrometheus
- Nebula Operator 雲上實踐
- Kubernetes Operator 開發入門
- RxJs map operator 工作原理分析JS
- Flink -- Operator操作運算元
- 大資料開發-Flink-資料流DataStream和DataSet大資料AST
- JS: Object.assign() Vs Spread OperatorJSObject
- std::sort 錯誤"Expression : invalid operator <"Express
- Python中operator 模組的用法Python
- 使用 Prometheus-Operator 監控 CalicoPrometheus
- PostgreSQL DBA(46) - PG Operator classes and familiesSQL
- kubebuilder operator的執行邏輯UI
- MySQL Operator容器化方案解讀MySql
- Prometheus Operator自定義監控項Prometheus
- 詳解 Flink DataStream中min(),minBy(),max(),max()之間的區別AST
- 從零開始入門 K8s | Kubernetes API 程式設計利器:Operator 和 Operator FrameworkK8SAPI程式設計Framework