MapReduce程式設計(四) 求均值

weixin_34117211發表於2017-03-31

原文網址 : https://blog.csdn.net/weixin_34117211/article/details/89747446

一、問題描述

三個檔案中分別儲存了學生的語文、數學和英語成績，輸出每個學生的平均分。

資料格式如下：
Chinese.txt

張三    78
李四    89
王五    96
趙六    67

Math.txt

張三    88
李四    99
王五    66
趙六    77

English.txt

張三    80
李四    82
王五    84
趙六    86

二、MapReduce程式設計

package com.javacore.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


/**
 * Created by bee on 3/29/17.
 */
public class StudentAvgDouble {

    public static class MyMapper extends Mapper<Object, Text, Text, DoubleWritable> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           String eachline = value.toString();
           StringTokenizer tokenizer = new StringTokenizer(eachline, "\n");
            while (tokenizer.hasMoreElements()) {
                StringTokenizer tokenizerLine = new StringTokenizer(tokenizer
                        .nextToken());
                String strName = tokenizerLine.nextToken();
                String strScore = tokenizerLine.nextToken();
                Text name = new Text(strName);
                IntWritable score = new IntWritable(Integer.parseInt(strScore));
                context.write(name, score);
            }
        }
    }

    public static class MyReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
        public void reduce(Text key, Iterable<DoubleWritable> values, Context
                context) throws IOException, InterruptedException {
            double sum = 0.0;
            int count = 0;
            for (DoubleWritable val : values) {
                sum += val.get();
                count++;
            }
            DoubleWritable avgScore = new DoubleWritable(sum / count);
            context.write(key, avgScore);
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //刪除output資料夾
        FileUtil.deleteDir("output");
        Configuration conf = new Configuration();
        String[] otherArgs = new String[]{"input/studentAvg", "output"};
        if (otherArgs.length != 2) {
            System.out.println("引數錯誤");
            System.exit(2);
        }

        Job job = Job.getInstance();
        job.setJarByClass(StudentAvgDouble.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

三、StringTokenizer和Split的用法對比

map函式裡按行讀入，每行按空格切開，之前我採用的split()函式切分，程式碼如下。

 String eachline = value.toString();
 for (String eachline : lines) {
                System.out.println("eachline:\t"+eachline);
                String[] words = eachline.split("\\s+");
                Text name = new Text(words[0]);
                IntWritable score = new IntWritable(Integer.parseInt(words[1]));
                context.write(name, score);
            }

這種方式簡單明瞭，但是也存在缺陷，對於非正常編碼的空格有時候會出現切割失敗的情況。
StringTokenizer是java.util包中分割解析類，StringTokenizer類的建構函式有三個:

StringTokenizer（String str）：java預設的分隔符是“空格”、“製表符（‘\t’）”、“換行符(‘\n’）”、“回車符（‘\r’）。
StringTokenizer（String str,String delim）:可以構造一個用來解析str的StringTokenizer物件，並提供一個指定的分隔符。
StringTokenizer（String str,String delim,boolean returnDelims）：構造一個用來解析str的StringTokenizer物件，並提供一個指定的分隔符，同時，指定是否返回分隔符。

StringTokenizer和Split都可以對字串進行切分，StringTokenizer的效能更高一些，分隔符如果用到一些特殊字元，StringTokenizer的處理結果更好。

四、執行結果

張三  82.0
李四  90.0
王五  82.0
趙六  76.66666666666667

【轉載】MapReduce程式設計 Intellij Idea配置MapReduce程式設計環境
2020-04-07
程式設計IntelliJIdea
MapReduce--程式設計模板
2020-12-02
程式設計
MapReduce 程式設計模型 & WordCount 示例
2019-08-01
程式設計模型
mapreduce的程式設計模型，計數器
2018-07-02
程式設計模型
MFC程式設計（四）
2020-10-28
C程式程式設計
求陣列平均值
2019-09-21
陣列
詳解MapReduce中的五大程式設計模型
2019-03-18
程式設計模型
MapReduce1架構設計
2018-05-28
架構
併發程式設計（四）
2022-01-24
程式設計
Python求均值，方差，標準差
2021-09-09
Python
好程式設計師大資料培訓分享MapReduce理解
2020-11-26
程式設計師大資料
施耐德UNITY下使用ST程式設計計算最近一小時的均值
2024-11-12
Unity程式設計
cuda程式設計與gpu平行計算（四）：cuda程式設計模型
2020-12-31
程式設計GPU模型
基礎程式設計題(PTA) 7-35 有理數均值 (20分)
2020-11-17
程式設計
好程式設計師大資料學習路線分享MAPREDUCE
2019-08-22
程式設計師大資料
好程式設計師大資料學習路線之mapreduce概述
2019-08-13
程式設計師大資料
好程式設計師大資料培訓教你快速學習MapReduce
2020-10-20
程式設計師大資料
程式設計師“求包養”攻略揭秘
2020-03-19
程式設計師
四. 文字程式設計--Windows程式設計課程學習筆記
2018-08-23
程式設計Windows筆記
Hadoop之MapReduce2架構設計
2018-05-28
Hadoop架構
四、Spring-面向切面程式設計
2020-05-28
Spring程式設計
c# 程式設計學習（四）
2019-05-06
C#程式設計
程式設計師修煉之道~四
2024-11-30
程式設計師
SpringMVC 解析（四）程式設計式路由
2022-04-04
SpringMVC程式設計路由
【分散式系統設計簡卷（0）】MapReduce
2022-11-24
分散式
GPU程式設計--OpenCL四大模型
2018-05-24
GPU程式設計大模型
結對程式設計-四則運算
2024-04-15
程式設計
區塊鏈程式設計go（四）-交易
2024-08-17
區塊鏈程式設計Go
PAT-B 1054 求平均值【模擬】
2019-02-24
python輸入一個列表求平均值
2021-09-11
Python
043-socket程式設計傳送GET請求
2020-11-21
程式設計
JS物件導向程式設計（四）：繼承
2018-11-25
JS物件程式設計繼承
shell程式設計（四）| grep正規表示式
2018-06-21
程式設計
shell程式設計（四）| sed 正規表示式
2018-06-21
程式設計
Windows CE 程式設計第四版
2019-02-19
Windows程式設計
程式設計基礎四大件
2024-07-31
程式設計
scipy.stats 庫的使用，np求均值和方差
2019-01-01
避免溢位求平均值的演算法
2024-07-12
演算法
MapReduce程式執行流程
2018-06-12

MapReduce程式設計(四) 求均值

一、問題描述

二、MapReduce程式設計

三、StringTokenizer和Split的用法對比

四、執行結果

相關文章