Flink1.7.2 DataStream Operator 示例

weixin_34127717發表於2019-03-08

原文網址 : https://blog.csdn.net/weixin_34127717/article/details/89569443

Flink1.7.2 DataStream Operator 示例

原始碼

map

處理所有元素
輸入資料

模板

程式

模板

輸出資料

模板

map

處理所有元素
輸入資料
```
你好
傳送資料
```

程式


package com.opensourceteams.module.bigdata.flink.example.stream.operator.map

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val dataStream = env.socketTextStream("localhost", port, '\n')

    val dataStreamMap = dataStream.map(x => x + " 增加的資料")

    dataStreamMap.print()


    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }

}

輸出資料

    1> 你好 增加的資料
    2> 傳送資料 增加的資料

flatMap

處理所有元素,並且把每行中的子集合，彙總成一個大集合
輸入資料
```
a b c
e f g
```

程式

package com.opensourceteams.module.bigdata.flink.example.stream.operator.flatmap

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val dataStream = env.socketTextStream("localhost", port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" "))

    dataStream2.print()



    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料
```
a
b
c
e
f
g
```

filter

過濾資料
輸入資料
```
a b c
a c
b b
d d
```

程式

package com.opensourceteams.module.bigdata.flink.example.stream.operator.filter

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment}

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val dataStream = env.socketTextStream("localhost", port, '\n')

    val dataStreamMap = dataStream.filter( x => (x.contains("a")))

    dataStreamMap.print()


    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }



}

輸出資料
```
3> a b c
4> a c
```

keyBy

指定某列為key,一般按key分組時用
輸入資料
```
c a b a
```

程式

    package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum
    
    import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
    import org.apache.flink.streaming.api.windowing.time.Time
    
    /**
      * nc -lk 1234  輸入資料
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        // get the execution environment
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
    
        val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
          .keyBy(0)
    
          //dataStream.keyBy("someKey") // Key by field "someKey"
          //dataStream.keyBy(0) // Key by the first element of a Tuple
    
          .timeWindow(Time.seconds(2))//每2秒滾動視窗
          .sum(1)
    
    
    
    
    
        dataStream2.print()
    
    
    
        if(args == null || args.size ==0){
          env.execute("預設作業")
        }else{
          env.execute(args(0))
        }
    
        println("結束")
    
      }
    
    
    }

輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
預設並行度
```
    6> (a,2)
    4> (c,1)
    2> (b,1)
```
並行度為1，就先去重，取第一個元素，再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a
```
 (c,1)
 (b,1)
 (a,2)
 
```

sum

keyBy指定某列為key,一般按key分組時用,sum按key分組後求合
輸入資料
```
c a b a
```

程式

    package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum
    
    import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
    import org.apache.flink.streaming.api.windowing.time.Time
    
    /**
      * nc -lk 1234  輸入資料
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        // get the execution environment
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
    
        val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
          .keyBy(0)
    
          //dataStream.keyBy("someKey") // Key by field "someKey"
          //dataStream.keyBy(0) // Key by the first element of a Tuple
    
          .timeWindow(Time.seconds(2))//每2秒滾動視窗
          .sum(1)
    
    
    
    
    
        dataStream2.print()
    
    
    
        if(args == null || args.size ==0){
          env.execute("預設作業")
        }else{
          env.execute(args(0))
        }
    
        println("結束")
    
      }
    
    
    }

輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
預設並行度
```
    6> (a,2)
    4> (c,1)
    2> (b,1)
```
並行度為1，就先去重，取第一個元素，再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a
```
 (c,1)
 (b,1)
 (a,2)
 
```

reduce

keyBy指定某列為key,一般按key分組時用,對相同的key，元素之間進行的函式運算
輸入資料
```
a b b c
```

程式

package com.opensourceteams.module.bigdata.flink.example.stream.operator.reduce

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,2))
      .keyBy(0)

      //dataStream.keyBy("someKey") // Key by field "someKey"
      //dataStream.keyBy(0) // Key by the first element of a Tuple

      .timeWindow(Time.seconds(2))//每2秒滾動視窗
        .reduce((a,b) =>  (a._1,a._2 * b._2) )





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料,資料輸出順序多執行緒是不固定的,但也是一樣的規則取
預設並行度

    6> (a,2)
    4> (c,1)
    2> (b,1)

並行度為1，就先去重，取第一個元素，再按從最後一個開始,即 c a b a 變為 c a b 然後變成 c b a


(a,2)
(c,2)
(b,4)

fold

按key進行處理，第一個引數，是字串，放在每次處理的最前面第二個是表示式，第二個表示式有兩個引數，第一個引數，就是第一個引數的值，第二個引數，我每次迴圈key時，迭代的下一個元素
輸入資料

a a b c c

程式

package com.opensourceteams.module.bigdata.flink.example.stream.operator.fold

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      //dataStream.keyBy("someKey") // Key by field "someKey"
      //dataStream.keyBy(0) // Key by the first element of a Tuple

      .timeWindow(Time.seconds(2))//每2秒滾動視窗
      .fold("開始字串")((str, i) => { str + "-" + i} )





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

開始字串-(a,1)-(a,1)
開始字串-(c,1)-(c,1)
開始字串-(b,1)

Aggregations (sum ,max,min)

sum

處理所有元素,相同key進行累加
輸入資料

a a c b c

程式

package com.opensourceteams.module.bigdata.flink.example.stream.operator.aggregations.sum

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒，就是最高並行度為8
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      //dataStream.keyBy("someKey") // Key by field "someKey"
      //dataStream.keyBy(0) // Key by the first element of a Tuple

      .timeWindow(Time.seconds(2))//每2秒滾動視窗
      .sum(1)





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

4> (c,2)
2> (b,1)
6> (a,2)

min(和minBy一樣,沒發現區別)

處理所有元素，對相同key的元素，進行求最小的值
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum


import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒，就是最高並行度為8
    val dataStream = env.socketTextStream("localhost", port, '\n')

    var i  = 0

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
      i = i + 1
      (x,i)
    })
      .keyBy(0)

      .timeWindow(Time.seconds(2))//每2秒滾動視窗
      .min(1)





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料
最終輸出資料，順序取決於執行緒的呼叫

2> (b,1)
6> (a,2)

中間輸出資料

6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)

max(和maxBy一樣)

處理所有元素，對相同key的元素，進行求最大的值
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum


import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //設定並行度,不設定就是預設最高並行度為的cpu ,我的四核8執行緒，就是最高並行度為8
    val dataStream = env.socketTextStream("localhost", port, '\n')

    var i  = 0

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
      i = i + 1
      (x,i)
    })
      .keyBy(0)

      .timeWindow(Time.seconds(2))//每2秒滾動視窗
      .max(1)





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料
最終輸出資料，順序取決於執行緒的呼叫


2> (b,6)
6> (a,5)

中間輸出資料


6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)

Window

window

定義window,並指定分配元素到window的方式
可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵（例如，在最後5秒內到達的資料）對每個金鑰中的資料進行分組。有關視窗的完整說明，請參見windows。
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.{ TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定義window,並指定分配元素到window的方式
        * 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵（例如，在最後5秒內到達的資料）對每個金鑰中的資料進行分組。 有關視窗的完整說明，請參見windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      .sum(1)





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

(b,3)
(a,3)

WindowAll

配合process.ProcessAllWindowFunction函式，該函式的引數elements: Iterable[(String, Int)] 即為當前window的所有元素，可以進行處理，再發給下游sink
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.windowAll

import org.apache.flink.streaming.api.scala.function.ProcessAllWindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      .windowAll( TumblingProcessingTimeWindows.of(Time.seconds(2)))
        .process(new ProcessAllWindowFunction[(String, Int),(String, Int),TimeWindow] {
          override def process(context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
            //可以對當前window中的所有元素進行操作，處理後，再傳送給Sink
            for(element <- elements) out.collect(element)
          }
        })

        .print()





    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")












  }


}

輸出資料


8> (a,1)
7> (b,1)
3> (a,1)
2> (a,1)
1> (b,1)
4> (b,1)

Window.apply

對一批window進行元素處理
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.apply

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.{TimeWindow, Window}
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定義window,並指定分配元素到window的方式
        * 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵（例如，在最後5秒內到達的資料）對每個金鑰中的資料進行分組。 有關視窗的完整說明，請參見windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      /**
        * * @tparam IN The type of the input value.
        * * @tparam OUT The type of the output value.
        * * @tparam KEY The type of the key.
        */
      .apply(new WindowFunction[(String,Int),(String,Int),Tuple,TimeWindow] {
      override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit ={
        //對window的所有元素進行處理
        for(element <- input) out.collect(element)
      }

    })






    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

2> (a,1)
3> (a,1)
7> (b,1)
4> (b,1)
1> (b,1)
8> (a,1)

Window.reduce

處理所有元素,對window中相同key的元素進行，函式表示式計算
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.reduce

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定義window,並指定分配元素到window的方式
        * 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵（例如，在最後5秒內到達的資料）對每個金鑰中的資料進行分組。 有關視窗的完整說明，請參見windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      /**
        * * @tparam IN The type of the input value.
        * * @tparam OUT The type of the output value.
        * * @tparam KEY The type of the key.
        */
      .reduce((a,b) => (a._1,a._2 +b._2))








    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

2> (b,3)
6> (a,3)

Window.fold

按key進行處理，第一個引數，是字串，放在每次處理的最前面第二個是表示式，第二個表示式有兩個引數，第一個引數，就是第一個引數的值，第二個引數，我每次迴圈key時，迭代的下一個元素
輸入資料

b a b a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.fold

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定義window,並指定分配元素到window的方式
        * 可以在已經分割槽的KeyedStream上定義Windows。 Windows根據某些特徵（例如，在最後5秒內到達的資料）對每個金鑰中的資料進行分組。 有關視窗的完整說明，請參見windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))


      /**
        * 按key進行處理，第一個引數，是字串，放在每次處理的最前面第二個是表示式，第二個表示式有兩個引數，第一個引數，就是第一個引數的值，第二個引數，我每次迴圈key時，迭代的下一個元素
        */
      .fold("字串開始")((str, i) => { str + "-" + i} )








    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

6> 字串開始-(a,1)-(a,1)-(a,1)
2> 字串開始-(b,1)-(b,1)-(b,1)

DataStream union DataStream

對兩個DataStream進行合併，合併之後不能再次進行計算
輸入資料

a a b

c c a

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.union

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    /**
      * 只是將兩個流的資料，union在一起，之後，不能再進行操作了
      */
    val dataStream3 = dataStream1.union(dataStream2)




    dataStream3.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      .timeWindow(Time.seconds(5))//每2秒滾動視窗
      .sum(1)

    dataStream2

  }


}

輸出資料

6> (a,2)
4> (c,2)
2> (b,1)
6> (a,1)

DataStream join DataStram

join,兩上流根據key相等進行連線，然後呼叫apply函式，進行具體的相同key進行計算
輸入資料

a a b

c c c a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.join

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    /**
      * 只是將兩個流的資料，union在一起，之後，不能再進行操作了
      */
    val dataStream3 = dataStream1.join(dataStream2)




    dataStream3.where(x => x._1).equalTo(x => x._1)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
        .apply((a,b) => (a._1,a._2 + b._2) )

        .print()






    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      .timeWindow(Time.seconds(5))//每2秒滾動視窗
      .sum(1)

    dataStream2

  }


}

輸出資料

6> (a,3)
2> (b,2)

DataStream.intervalJoin

兩流，根據key，連線，找到兩流都有的共同元素，進行函式處理process()
輸入資料

c c a

a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.intervaljoin

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)


   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    val dataStream3 = dataStream1.keyBy(0).intervalJoin(dataStream2.keyBy(0))




    dataStream3.between(Time.seconds(-5), Time.seconds(5))
      //.upperBoundExclusive(true) // optional
      //.lowerBoundExclusive(true) // optional
        .process(new ProcessJoinFunction[(String,Int),(String,Int),String] {
      override def processElement(left: (String, Int), right: (String, Int), ctx: ProcessJoinFunction[(String, Int), (String, Int), String]#Context, out: Collector[String]): Unit = {
        println(left + "," + right)
        out.collect( left + "," + right)
      }
    })





    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))

    dataStream2

  }


}

輸出資料

(a,1),(a,1)
(a,1),(a,1)

DataStream.cogroup

對兩個資料流進行處理，按key去重取兩個流的並集，再按key分別統計每一個流的元素，進行彙總處理
第一個流的元素為 c,a 第二個流的元素為 a b ,所以一個統計了四個元素，每一個元素在每個流中千是多少，也統計出來了
輸入資料

c c a

a a b

程式

模板

輸出資料

==============開始
first
[(a,1)]
second
[(a,1), (a,1)]
==============結束
==============開始
first
[(c,1), (c,1)]
second
[]
==============結束
==============開始
first
[]
second
[(b,1)]
==============結束

DataStraem.connect

相當於兩個數的資料，都通過 ConnectStream.函式來處理，函式都有兩個方法，一個處理流一，一個處理流二
輸入資料

c c a

a a b

程式

模板

輸出資料

(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)

DataStraem.coMap

相當於兩個數的資料，都通過 ConnectStream.函式來處理，函式都有兩個方法，一個處理流一，一個處理流二
輸入資料

c c a

a a b

程式

模板

輸出資料

(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)

DataStraem.coFlatMap

相當於兩個數的資料，都通過 ConnectStream.函式來處理，函式都有兩個方法，一個處理流一，一個處理流二
輸入資料

c c a

a a b

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.coFlatMap

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
    env.setParallelism(1) //由於想看資料結構，所以先設為1，這樣


   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    val dataStream3 = dataStream1.connect(dataStream2)




    dataStream3
        .flatMap(x => x.toString.split(" ") , x => x.toString.split(" "))
        .print()



    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[String]={


    //env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream(host, port, '\n')

   // val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))

    dataStream

  }


}

輸出資料

a
a
b
c
c
a

Datastream.assignAscendingTimestamps

指定時間戳
輸入資料

c c a

程式

package com.opensourceteams.module.bigdata.flink.example.datastream.operator.assignTimestamps

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  輸入資料
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //設定並行度
    val dataStream = env.socketTextStream("localhost", port, '\n')

    dataStream.assignAscendingTimestamps(x => System.currentTimeMillis())

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      .timeWindow(Time.seconds(2))//每2秒滾動視窗
      .sum(1)





    dataStream2.print()




    println("=======================列印StreamPlanAsJSON=======================\n")
    println("JSON轉圖線上工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("預設作業")
    }else{
      env.execute(args(0))
    }

    println("結束")

  }


}

輸出資料

(c,2)
(a,1)

gpt給出的operator簡單示例
2024-08-21
GPT
為 Memcached 構建基於 Go 的 Operator 示例
2021-09-08
Go
DataStream Api和Data Sources
2020-10-05
ASTAPI
5-3socket建立datastream
2020-10-02
AST
Flink DataStream 程式設計入門
2019-06-19
AST程式設計
flink：StreamExecutionEnvironment、DataStream和Transformation與StreamOperator
2020-11-30
ASTORM
Operator介紹
2022-06-26
FLINKCDC3.0 datastream程式執行測試
2024-05-05
AST
compact (): Undefined variable: operator
2019-05-07
Undefined
談談 Kubernetes Operator
2022-03-22
Kubernetes裡的Operator
2022-05-20
RuntimeError: No such operator torchvision::nms
2020-12-04
Error
Apache-Flink深度解析-DataStream-Connectors之Kafka
2019-03-18
ApacheASTKafka
第04講：Flink 常用的 DataSet 和 DataStream API
2022-01-25
ASTAPI
flnkcdc+datastream實現mysql到mysql資料同步
2024-03-10
ASTMySql
python operator和Branching and cycling
2019-05-13
Python
精讀《pipe operator for JavaScript》
2022-02-07
JavaScript
TypeScript 之 Typeof Type Operator
2021-11-30
TypeScript
TypeScript 之 Keyof Type Operator
2021-11-29
TypeScript
Zalando Postgres Operator 快速上手
2022-04-08
Kubernetes Operator基礎入門
2020-09-01
12 . Kubernetes之Statefulset 和 Operator
2020-07-04
Prometheus-operator新增basic auth
2024-07-19
Prometheus
Nebula Operator 雲上實踐
2021-08-18
Kubernetes Operator 開發入門
2021-10-27
RxJs map operator 工作原理分析
2021-06-05
JS
Flink -- Operator操作運算元
2020-12-23
大資料開發-Flink-資料流DataStream和DataSet
2021-05-12
大資料AST
JS: Object.assign() Vs Spread Operator
2019-01-10
JSObject
std::sort 錯誤"Expression : invalid operator <"
2024-03-26
Express
Python中operator 模組的用法
2024-04-17
Python
使用 Prometheus-Operator 監控 Calico
2020-06-29
Prometheus
PostgreSQL DBA(46) - PG Operator classes and families
2019-07-01
SQL
kubebuilder operator的執行邏輯
2022-03-05
UI
MySQL Operator容器化方案解讀
2022-11-29
MySql
Prometheus Operator自定義監控項
2020-11-30
Prometheus
詳解 Flink DataStream中min(),minBy(),max(),max()之間的區別
2020-06-10
AST
從零開始入門 K8s | Kubernetes API 程式設計利器：Operator 和 Operator Framework
2020-02-25
K8SAPI程式設計Framework

Flink1.7.2 DataStream Operator 示例

Flink1.7.2 DataStream Operator 示例

原始碼

map

map

flatMap

filter

keyBy

sum

reduce

fold

Aggregations (sum ,max,min)

sum

min(和minBy一樣,沒發現區別)

max(和maxBy一樣)

Window

window

WindowAll

Window.apply

Window.reduce

Window.fold

DataStream union DataStream

DataStream join DataStram

DataStream.intervalJoin

DataStream.cogroup

DataStraem.connect

DataStraem.coMap

DataStraem.coFlatMap

Datastream.assignAscendingTimestamps

相關文章