MR hadoop streaming job的學習 combiner

weixin_34162629發表於2018-04-14

原文網址 : https://blog.csdn.net/weixin_34162629/article/details/85982779

Hadoop

程式碼已經拷貝到了公司電腦的：

/Users/baidu/Documents/Data/Work/Code/Self/hadoop_mr_streaming_jobs

首先是主控指令碼 main.sh

呼叫的是 extract.py

然後發現寫的不太好。其中有一個combiner，可以看這裡：

https://blog.csdn.net/u010700335/article/details/72649186

streaming 指令碼的時候，是以管道為基礎的：

（5） Python指令碼

import sys

for line in sys.stdin:

.......

#!/usr/bin/env python
 
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words while removing any empty strings
    words = filter(lambda word: word, line.split())
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)
#---------------------------------------------------------------------------------------------------------
#!/usr/bin/env python
 
from operator import itemgetter
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split()
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        pass
 
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
 
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

hadoop基礎學習三十一（spark-streaming）
2020-11-09
HadoopSpark
Spark-Streaming的學習使用
2019-04-12
Spark
Spark Streaming學習——DStream
2019-04-05
Spark
Hadoop Streaming 讀ORC檔案
2018-11-18
Hadoop
Quartz job scheduler 學習
2019-01-19
quartz
Spark學習進度11-Spark Streaming&Structured Streaming
2021-01-15
SparkStruct
spark學習筆記-- Spark Streaming
2018-08-03
Spark筆記
Hadoop學習
2024-07-30
Hadoop
談談Hadoop MapReduce和Spark MR實現
2020-07-27
HadoopSpark
Spark學習筆記（三）-Spark Streaming
2020-06-24
Spark筆記
Hadoop學習——MapReduce
2019-04-06
Hadoop
如何學習Hadoop
2020-11-05
Hadoop
combiner函式
2018-03-26
函式
Hadoop學習筆記——————1、Hadoop概述
2018-07-16
Hadoop筆記
hadoop學習記錄
2024-07-20
Hadoop
XXL-JOB學習筆記（二）
2020-11-24
筆記
hadoop學習-HDFS的詳細概述
2020-12-17
Hadoop
2024.7.13（hadoop學習總結）
2024-07-13
Hadoop
Hadoop學習筆記—HDFS
2021-04-03
Hadoop筆記
Hadoop學習筆記—Yarn
2021-04-10
Hadoop筆記Yarn
Apache Hadoop Yarn 學習（二）
2020-12-28
ApacheHadoopYarn
Hadoop學習之路（十）HDFS API的使用
2018-03-21
HadoopAPI
Hadoop學習——Client原始碼分析
2019-04-06
Hadoopclient原始碼
Hadoop技術學習報告
2024-07-13
Hadoop
使用 ES-Hadoop 將 Spark Streaming 流資料寫入 ES
2019-01-05
HadoopSpark
flink學習（加餐）——job任務狀態變化
2020-12-03
hadoop window 遠端提交job到叢集並執行
2019-05-11
Hadoop
Hadoop學習之路（六）HDFS基礎
2019-01-23
Hadoop
學習hadoop需要什麼基礎?
2018-09-07
Hadoop
Hadoop學習（二）——MapReduce\Yarn架構
2019-02-20
HadoopYarn架構
學習Hadoop最佳書籍推薦
2021-12-29
Hadoop
hadoop基礎學習三十七（kylin的安裝）
2020-12-04
Hadoop
[live streaming] CG&WebGL&Threejs知識以及學習思維分享。
2022-03-17
WebJS
hadoop 偽分散式模式學習筆記
2018-10-16
Hadoop分散式模式筆記
大資料之 Hadoop學習筆記
2018-12-14
大資料Hadoop筆記
Hadoop大資料開發框架學習
2018-08-31
Hadoop大資料框架
Hadoop 從 0 到 1 學習 ——第一章 Hadoop 介紹
2020-10-29
Hadoop
初學者入門大資料，學習Hadoop的5個理由
2018-09-02
大資料Hadoop

MR hadoop streaming job的學習 combiner

相關文章