Hadoop-Map/Reduce之單表連線的實現

林六天發表於2014-10-30

Hadoop

MapReduce程式就是根據其特性對資料進行一個簡單的邏輯處理，其中最為重要的一個特性就是根據key值將value值進行合併，其次就是在shuffle階段有排序。

遇到一個MR程式就是要巧妙利用合併、排序的特性。

單表關聯就是根據利用了合併的原理。

先上測試資料

child parent

Tom Lucy

Tom Jack

Lucy Marry

Lucy Ben

Jack Alice

Jack Jesse

結果資料

grandchild grandparent

Tom Marry

Tom Ben

Tom Alice

Tom Jesse

原理說明：

從要求中我們很容易想到利用parent作為key，這樣就能夠把grandchild和grandparent放到valuelist中。對valueList中的值進行一個笛卡爾積就能夠得到最終結果。

單表連線中，左表和右表都是自身，我們用c#區分左表，用p#區分右表

map\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

context.write(" Lucy", " C#Tom") context.write(" Jack", " C#Tom") context.write(" Marry", " C#Lucy") context.write(" Alice", " C#Jack") ......

context.write(" Tom", " P#Lucy") context.write(" Tom", " P#Jack") context.write(" Lucy", " P#Marry") context.write(" Jack", " P#Alice") ......

<" Lucy" , {" C#Tom", " P#Marry", " P#Ben"}> <" Jack" , {" C#Tom", " P#Alice", " P#Jesse"}> <" Marry" , { " C#Lucy"}> <" Alice" , { " C#Jack"}> <" Tom" , {" P#Lucy"," P#Jack"}>

Reduce\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

context.write(" Tom", " Marry") context.write(" Tom", " Ben") context.write(" Tom", " Alice") context.write(" Tom", " Jesse")

程式碼奉上

package cn.genekang.hadoop.test;

import java.io.IOException;
import java.util.ArrayList;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class STjoin {
    /*
     * child parentTom LucyTom JackLucy MarryLucy BenJack AliceJack Jesse* *
     */
    // 單表連線
    public static class StjoinMap extends
            Mapper<LongWritable, Text, Text, Text> {

        private Text kText = new Text();
        private Text vText = new Text();

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String[] lineSplit = value.toString().split("\t");
            // c#代表的是左表 p#代表的是右表
            // 右表
            kText.set(lineSplit[1]);
            vText.set("p#" + lineSplit[0]);
            context.write(kText, vText);

            // 左表
            kText.set(lineSplit[0]);
            vText.set("c#" + lineSplit[1]);
            context.write(kText, vText);

        }

    }

    public static class StjoinReduce extends Reducer<Text, Text, Text, Text> {
        private Text kText = new Text();
        private Text vText = new Text();

        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            ArrayList<String> cList = new ArrayList<String>();
            ArrayList<String> pList = new ArrayList<String>();
            for (Text v : values) {
                if (v.toString().contains("c#")) {
                    cList.add(v.toString().substring(2));
                } else if (v.toString().contains("p#")) {
                    pList.add(v.toString().substring(2));

                }
            }

            if (!cList.isEmpty() && !pList.isEmpty()) {
                for (String c : cList) {
                    for (String p : pList) {
                        kText.set(c);
                        vText.set(p);
                        context.write(kText, vText);
                    }
                }
            }

            // 清空list
            cList.clear();
            pList.clear();
        }

    }

    public static void main(String[] args) throws IOException,
            ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(STjoin.class);

        job.setMapperClass(StjoinMap.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setReducerClass(StjoinReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}

Hadoop-Map/Reduce實現實現倒排索引
2014-07-07
Hadoop索引
Hadoop--Map/Reduce實現多表連結
2014-07-07
Hadoop
c/c++ 線性表之單向連結串列
2018-06-23
C++
表與表之間連線式on，using以及自然連線的區別
2011-11-22
Hadoop--map/reduce實現單詞計數
2014-07-07
Hadoop
rpc框架之thrift連線池實現
2015-09-27
RPC框架
map/reduce實現排序
2014-07-07
排序
c/c++ 線性表之單向迴圈連結串列
2018-06-27
C++
線性表中的單向連結串列的簡單操作
2019-06-09
hibernate之關於使用連線表實現多對一關聯對映
2015-06-04
資料庫表連線的簡單解釋
2019-01-15
資料庫
Swoole MySQL 連線池的實現
2019-05-27
MySql
mysql的jdbc連線java實現
2024-06-13
MySqlJDBCJava
【SQL】表連線 --半連線
2010-09-06
SQL
not in 用外連線實現
2012-02-01
表的連線是指在一個SQL語句中通過表與表之間的關連
2022-05-14
SQL
流程圖——正交連線的演算法的一種簡單實現
2018-08-15
流程圖演算法
Array.prototype.reduce 的理解與實現
2018-12-16
socketserver實現多使用者連線簡單程式碼
2024-10-16
Server
python 資料結構之單連結串列的實現
2013-11-08
Python資料結構
單連結串列實現
2017-06-14
線性表及其實現
2018-10-30
Oracle的表連線方法（三）雜湊連線
2017-06-05
Oracle
golang實現的長連線服務
2016-12-07
Golang
連結串列-單連結串列實現
2024-06-03
MapReduce實現之Reduce端重分割槽Join操作最佳化！
2018-11-07
實現單項鍊表
2020-10-06
jquery實現重置表單
2014-04-29
jQuery
表連線cost
2016-07-06
Oracle 表連線
2015-08-12
Oracle
表連線方法
2011-06-24
表連線概念
2010-05-13
資料結構之php實現單向連結串列
2021-01-15
資料結構PHP
【資料結構】線性表-單連結串列
2024-03-31
資料結構
Oracle的表連線方法（一）排序合併連線
2017-06-01
Oracle排序
線性表的使用——順序實現
2018-11-16
兩表連線三：合併連線
2014-12-04
2、實現tomcat+mysql實現jps的連線
2017-11-12
TomcatMySql

Hadoop-Map/Reduce之單表連線的實現

相關文章