MapReduce業務－圖片關聯計算

哥不是小蘿莉發表於2015-09-02

1.概述

　　最近在和人交流時談到資料相似度和資料共性問題，而剛好在業務層面有類似的需求，今天和大家分享這類問題的解決思路，分享目錄如下所示：

業務背景
編碼實踐
預覽截圖

　　下面開始今天的內容分享。

2.業務背景

　　目前有這樣一個背景，在一大堆資料中，裡面存放著圖片的相關資訊，如下圖所示：

　　上圖只是給大家列舉的一個示例資料格式，第一列表示自身圖片，第二、第三......等列表示與第一列相關聯的圖片資訊。那麼我們從這堆資料中如何找出他們擁有相同圖片資訊的圖片。

2.1 實現思路

　　那麼，我們在明確了上述需求後，下面我們來分析它的實現思路。首先，我們通過上圖所要實現的目標結果，其最終計算結果如下所示：

pic_001pic_002 pic_003,pic_004,pic_005
pic_001pic_003 pic_002,pic_005
pic_001pic_004 pic_002,pic_005
pic_001pic_005 pic_002,pic_003,pic_004
......

　　結果如上所示，找出兩兩圖片之間的共性圖片，結果未列完整，只是列舉了部分，具體結果大家可以參考截圖預覽的相關資訊。

　　下面給大家介紹解決思路，通過觀察資料，我們可以發現在上述資料當中，我們要計算圖片兩兩的共性圖片，可以從關聯圖片入手，在關聯圖片中我們可以找到共性圖片的關聯資訊，比如：我們要計算pic001pic002圖片的共性圖片，我們可以在關聯圖片中找到兩者（pic001pic002組合）後對應的自身圖片（key），最後在將所有的key求並集即為兩者的共性圖片資訊，具體資訊如下圖所示：

　　通過上圖，我們可以知道具體的實現思路，步驟如下所示：

第一步：拆分資料，關聯資料兩兩組合作為Key輸出。
第二步：將相同Key分組，然後求並集得到計算結果。

　　這裡使用一個MR來完成此項工作，在明白了實現思路後，我們接下來去實現對應的編碼。

3.編碼實踐

拆分資料，兩兩組合。

public static class PictureMap extends Mapper<LongWritable, Text, Text, Text> {

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            StringTokenizer strToken = new StringTokenizer(value.toString());
            Text owner = new Text();

            Set<String> set = new TreeSet<String>();

            owner.set(strToken.nextToken());
            while (strToken.hasMoreTokens()) {
                set.add(strToken.nextToken());
            }

            String[] relations = new String[set.size()];
            relations = set.toArray(relations);

            for (int i = 0; i < relations.length; i++) {
                for (int j = i + 1; j < relations.length; j++) {
                    String outPutKey = relations[i] + relations[j];
                    context.write(new Text(outPutKey), owner);
                }

            }
        }
    }

按Key分組，求並集

public static class PictureReduce extends Reducer<Text, Text, Text, Text> {

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            String common = "";
            for (Text val : values) {
                if (common == "") {
                    common = val.toString();
                } else {
                    common = common + "," + val.toString();
                }
            }
            context.write(key, new Text(common));
        }
    }

完整示例

package cn.hadoop.hdfs.example;

import java.io.IOException;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.TreeSet;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import cn.hadoop.hdfs.util.HDFSUtils;
import cn.hadoop.hdfs.util.SystemConfig;

/**
 * @Date Aug 31, 2015
 *
 * @Author dengjie
 *
 * @Note Find picture relations
 */
public class PictureRelations extends Configured implements Tool {

    private static Logger log = LoggerFactory.getLogger(PictureRelations.class);
    private static Configuration conf;

    static {
        String tag = SystemConfig.getProperty("dev.tag");
        String[] hosts = SystemConfig.getPropertyArray(tag + ".hdfs.host", ",");
        conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://cluster1");
        conf.set("dfs.nameservices", "cluster1");
        conf.set("dfs.ha.namenodes.cluster1", "nna,nns");
        conf.set("dfs.namenode.rpc-address.cluster1.nna", hosts[0]);
        conf.set("dfs.namenode.rpc-address.cluster1.nns", hosts[1]);
        conf.set("dfs.client.failover.proxy.provider.cluster1",
                "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
        conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
    }

    public static class PictureMap extends Mapper<LongWritable, Text, Text, Text> {

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            StringTokenizer strToken = new StringTokenizer(value.toString());
            Text owner = new Text();

            Set<String> set = new TreeSet<String>();

            owner.set(strToken.nextToken());
            while (strToken.hasMoreTokens()) {
                set.add(strToken.nextToken());
            }

            String[] relations = new String[set.size()];
            relations = set.toArray(relations);

            for (int i = 0; i < relations.length; i++) {
                for (int j = i + 1; j < relations.length; j++) {
                    String outPutKey = relations[i] + relations[j];
                    context.write(new Text(outPutKey), owner);
                }

            }
        }
    }

    public static class PictureReduce extends Reducer<Text, Text, Text, Text> {

        @Override
        protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                throws IOException, InterruptedException {
            String common = "";
            for (Text val : values) {
                if (common == "") {
                    common = val.toString();
                } else {
                    common = common + "," + val.toString();
                }
            }
            context.write(key, new Text(common));
        }
    }

    public int run(String[] args) throws Exception {
        final Job job = Job.getInstance(conf);
        job.setJarByClass(PictureMap.class);
        job.setMapperClass(PictureMap.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(PictureReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, args[0]);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        int status = job.waitForCompletion(true) ? 0 : 1;
        return status;
    }

    public static void main(String[] args) {
        try {
            if (args.length != 1) {
                log.warn("args length must be 1 and as date param");
                return;
            }
            String tmpIn = SystemConfig.getProperty("hdfs.input.path.v2");
            String tmpOut = SystemConfig.getProperty("hdfs.output.path.v2");
            String inPath = String.format(tmpIn, "t_pic_20150801.log");
            String outPath = String.format(tmpOut, "meta/" + args[0]);

            // bak dfs file to old
            HDFSUtils.bak(tmpOut, outPath, "meta/" + args[0] + "-old", conf);

            args = new String[] { inPath, outPath };
            int res = ToolRunner.run(new Configuration(), new PictureRelations(), args);
            System.exit(res);
        } catch (Exception ex) {
            ex.printStackTrace();
            log.error("Picture relations task has error,msg is" + ex.getMessage());
        }

    }

}

4.截圖預覽

　　關於計算結果，如下圖所示：

5.總結

　　本篇部落格只是從思路上實現了圖片關聯計算，在資料量大的情況下，是有待優化的，這裡就不多做贅述了，後續有時間在為大家分析其中的細節。

6.結束語

　　這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行討論或傳送郵件給我，我會盡我所能為您解答，與君共勉！

MapReduce應用案例--單表關聯
2016-04-11
js計算圖片內點個數
2019-04-01
JS
大資料計算的基石——MapReduce
2020-09-01
大資料
第四篇：MapReduce計算模型
2017-05-19
模型
sql編寫與業務關聯
2010-07-05
SQL
iOS根據圖片比例計算顯示大小
2018-12-23
iOS
分散式計算技術（上）：經典計算框架MapReduce、Spark 解析
2023-04-10
分散式框架Spark
簡單的計算最值的MapReduce程式
2016-10-15
計算機視覺—圖片幾何變換（2）
2019-02-27
計算機視覺
如何使用圖片的exif資訊計算相機焦距
2024-08-01
【前端】壓縮圖片以及圖片相關概念
2020-12-07
前端
知物由學 | 關聯圖分析在反作弊業務中的應用
2021-05-18
【分散式計算】MapReduce的替代者-Parameter Server
2015-07-13
分散式Server
異構計算助力客戶春節webp圖片編碼
2018-03-22
Web
Hadoop 三劍客之 —— 分散式計算框架 MapReduce
2019-06-27
Hadoop分散式框架
CG_Hadoop：基於MapReduce的計算幾何
2014-12-26
Hadoop
用安全計算怎麼保護關鍵業務？快來學習！！！
2021-06-02
圖片縮放（不裁剪，按原來比例計算寬高）
2020-07-23
圖計算 on nLive：Nebula 的圖計算實踐
2022-02-22
圖片與向量的關係
2024-10-18
雲端計算開發與人工智慧的關係，雲端計算開發主宰物聯網
2019-05-15
人工智慧
如何解決報表關聯計算中的效能問題
2019-12-17
物聯網時代中關於邊緣計算那些事兒
2018-04-20
hadoop的mapreduce串聯執行
2018-09-01
Hadoop
jmeter 單介面測試方案（介面無業務關聯）
2016-11-14
JMeter
重要 | Spark和MapReduce的對比，不僅僅是計算模型？
2020-11-04
Spark模型
混合異構資料來源關聯計算最佳化方案
2019-11-22
ImageUtils-圖片相關工具類
2019-04-18
關於去除圖片上的水印
2024-06-05
?【圖】用圖片告訴你Java中的位運算
2019-12-30
Java
計算機圖形學學會：75%的宜家產品目錄都是電腦合成圖片
2014-08-30
計算機
詞頻統計mapreduce
2024-10-27
三因素助力中國雲端計算業務發展
2014-08-12
【大資料】離線批處理計算MapReduce | 複習筆記
2020-12-11
大資料筆記
Hadoop大資料實戰系列文章之Mapreduce 計算框架
2020-11-10
Hadoop大資料框架
頁面上怎麼從不同資料庫取數並關聯計算？
2020-06-12
資料庫
影片美顏sdk和計算機視覺技術有哪些關聯？
2023-02-16
計算機視覺
HSV顏色空間下的圖片相似性計算(python版)
2018-03-26
Python

MapReduce業務 － 圖片關聯計算