MapReduce實戰 - 根據文章記錄獲取時段內發帖頻率

dust1發表於2018-10-19

原文網址 : https://juejin.im/post/5bc88a5be51d45181c0437be

MapReduce簡介

MapReduce是一種分散式計算模型，是Google提出的，主要用於搜尋領域，解決海量資料的計算問題。
MR有兩個階段組成：Map和Reduce，使用者只需實現map()和reduce()兩個函式，即可實現分散式計算。

例子

資料來源結構

首先檢視資料來源結構:

CREATE TABLE `article` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `allowed_add_tag` int(2) DEFAULT NULL,
  `attitudes` varchar(255) DEFAULT NULL,
  `attitudes_id` int(11) DEFAULT NULL,
  `banana_count` int(11) DEFAULT NULL,
  `big_cover_image` varchar(255) DEFAULT NULL,
  `channel_id` int(11) DEFAULT NULL,
  `channel_name` varchar(255) DEFAULT NULL,
  `channel_path` varchar(255) DEFAULT NULL,
  `comment_count` int(11) DEFAULT NULL,
  `contribute_time` datetime DEFAULT NULL,
  `cover_image` varchar(255) DEFAULT NULL,
  `description` varchar(255) DEFAULT NULL,
  `essense` int(2) DEFAULT NULL,
  `favorite_count` int(11) DEFAULT NULL,
  `latest_active_time` datetime DEFAULT NULL,
  `latest_comment_time` datetime DEFAULT NULL,
  `like_count` int(11) DEFAULT NULL,
  `link` varchar(255) DEFAULT NULL,
  `parent_channel_id` int(11) DEFAULT NULL,
  `parent_channel_name` varchar(255) DEFAULT NULL,
  `parent_realm_id` int(11) DEFAULT NULL,
  `realm_id` int(11) DEFAULT NULL,
  `realm_name` varchar(255) DEFAULT NULL,
  `recommended` int(2) DEFAULT NULL,
  `status` int(11) DEFAULT NULL,
  `tag_list` varchar(255) DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `top_level` int(2) DEFAULT NULL,
  `tudou_domain` int(2) DEFAULT NULL,
  `type_id` int(11) DEFAULT NULL,
  `user_avatar` varchar(255) DEFAULT NULL,
  `user_id` int(11) DEFAULT NULL,
  `username` varchar(255) DEFAULT NULL,
  `view_count` int(11) DEFAULT NULL,
  `view_only` int(2) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13103 DEFAULT CHARSET=utf8mb4;
複製程式碼

這裡我將其中的資料匯出為csv檔案。

思路

在這個例子中，我要做的是根據帖子釋出時間，統計全天中每隔30分鐘的發帖個數。

由於當前我沒有重寫InputFormat介面，因此採用的是hadoop預設的按行讀取檔案方法。所以傳入引數為<0, [一行資料]>.

InputFormat 介面 - 該介面指定輸入檔案的內容格式。

其中getSplits函式將所有輸入資料分成numSplits個split，每個split交給一個map task處理。

getRecordReader函式提供一個使用者解析split的迭代器物件，它將split中的每個record解析成key/value對。

獲取資料中的發帖時間
計算髮帖時間在全天時間中的時間段並傳遞個reduce() - <時間段, 1>
reduce對時間段出現次數進行統計

util

首先先編寫工具類Times.java - period(str:String, format:String)方法，該方法的作用為：

根據傳入的字串和時間格式獲取一天中改時間的時間區間，如：

輸入："2018-10-18 22:05:11", "yyyy-MM-dd HH:mm:ss"

輸出: "201810182200-201810182230"

方法如下：

public static String period(String time, String format) {
    Objects.requireNonNull(time);
    DateTimeFormatter formatter = DateTimeFormatter.ofPattern(format);
    LocalDateTime dateTime = LocalDateTime.parse(time, formatter);
    int m = dateTime.getMinute();
    LocalDateTime start = dateTime.withMinute(m < 30 ? 0 : 30);
    LocalDateTime end = null;
    if (m < 30) {
        end = dateTime.withMinute(30);
    } else {
       end = dateTime.plusHours(1);
       end = end.withMinute(0);
    }

    DateTimeFormatter dateTimeFormatter = DateTimeFormatter.ofPattern("yyyyMMddHHmm");
    return start.format(dateTimeFormatter) + "-" + end.format(dateTimeFormatter);
}
複製程式碼

測試輸入:

period("2018-11-11 23:34", "yyyy-MM-dd HH:mm");

返回結果:

201811112330-201811120000

Map

TimeMapper.java程式碼為：

public class TimeMapper extends Mapper<LongWritable, Text, Text, LongWritable> {


    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String time = Matchers.stringCutBymatchers(value.toString(), "[0-9]{4}[/][0-9]{1,2}[/][0-9]{1,2}[ ][0-9]{1,2}[:][0-9]{1,2}[:][0-9]{1,2}");
        Objects.requireNonNull(time);
        String result = Times.period(time, "yyyy/MM/dd HH:mm:ss");
        context.write(new Text(result), new LongWritable(1));
    }
}
複製程式碼

由於按行讀取.csv檔案並且一行中的時間格式為yyyy/MM/dd HH:mm:ss,因此直接用正規表示式擷取時間。然後獲取時間區段，然後將<時間區段, 1>傳遞給reduce().

Matchers.stringCutBymatchers():

public static String stringCutBymatchers(String str, String mstr) {
    Pattern patternn = Pattern.compile(mstr);
    Matcher matcher = patternn.matcher(str);
    String result = null;
    if (matcher.find()) {
        result = matcher.group(0);
    }
    return result;
}
複製程式碼

Reduce

reduce()階段的操作就很簡單了，只要統計時間區段出現的次數就好了

TimeReduce.java:

public class TimeReduce extends Reducer<Text, LongWritable, Text, LongWritable> {

    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
        long counts = 0L;
        for (LongWritable val : values) {
            counts += val.get();
        }
        context.write(key, new LongWritable(counts));
    }
}
複製程式碼

main

main方法如下:

TimeApp.java:

public class TimeApp {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();
        if (otherArgs.length < 2) {
            System.out.println("請輸入input目錄和output目錄");
            System.exit(2);
        }

        Job job = Job.getInstance(conf, "acfun-time");
        job.setJarByClass(CSVApp.class);
        job.setMapperClass(TimeMapper.class);
        job.setReducerClass(TimeReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        for (int i = 0; i < otherArgs.length - 1; ++i) {
            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
        }

        Path path = new Path(otherArgs[otherArgs.length - 1]);// 取第1個表示輸出目錄引數（第0個引數是輸入目錄）
        FileSystem fileSystem = path.getFileSystem(conf);// 根據path找到這個檔案
        if (fileSystem.exists(path)) {
            fileSystem.delete(path, true);// true的意思是，就算output有東西，也一帶刪除
        }

        FileOutputFormat.setOutputPath(job, path);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    
}
複製程式碼

執行

最終檔案目錄如下：

其他package是為了之後繼承其他類準備，目前沒用。

這裡我採用和hadoop-example一樣的啟動方法，設定一個總Main.java

public class Main {

    public static void main(String[] args) {
        int exitCode = -1;
        ProgramDriver pgd = new ProgramDriver();
        try {
            pgd.addClass("citycount", CSVApp.class, "統計文章中出現的城市個數");
            pgd.addClass("timecount", TimeApp.class, "統計文章時段發帖數目");
            exitCode = pgd.run(args);
        } catch (Throwable e) {
            e.printStackTrace();
        }
        System.exit(exitCode);
    }

}
複製程式碼

根據命令引數來選擇需要執行的job。

打包並上傳後執行。

執行

yarn jar com.dust-1.0-SNAPSHOT.jar timecount /acfun/input/dust_acfun_article.csv /acfun/output
複製程式碼

等待job執行完成:

執行完成之後通過

hdfs dfs -cat /acfun/output/part-r-00000
複製程式碼

檢視結果

之後只要將該檔案的資料提取出來畫成圖表就能直觀地檢視發帖時段了。

Mapreduce中使用者可以進行操作的類：

InputFormat介面

使用者需要實現該介面以指定輸入檔案的內容格式。該介面有兩個方法

public interface InputFormat<K, V> {
 
     InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
 
     RecordReader<K, V> getRecordReader(InputSplit split,
 
     JobConf job,
 
     Reporter reporter) throws IOException;
 
}
複製程式碼

其中getSplits函式將所有輸入資料分成numSplits個split，每個split交給一個map task處理。getRecordReader函式提供一個使用者解析split的迭代器物件，它將split中的每個record解析成key/value對。

Hadoop本身提供了一些InputFormat：

InputFormat	介紹
TextInputFormat	文中的每一行都是記錄，即key - 行的偏移量；value - 行的內容. key:LongWritable - value:Text
KeyValueTextInputFormat	文字檔案中的每一行都是一條記錄。第一個分隔符分隔每一行。分隔符之前的所有內容都是鍵，後面的所有內容都是值。分隔符由鍵值設定。輸入行屬性中的分隔符，預設為tab[\t]字元.key:Text - value:Text
SequenceFileInputFormat<K, V>	用於讀取序列檔案的inputformat。鍵和值是使用者定義的。 sequence檔案是一個hadoop特定的壓縮二進位制檔案格式。它被優化用於在一個mapreduce作業的輸出之間傳遞資料到一些其他mapreduce作業的輸入. key:K - value:V
NLineInputFormat	每個分割都保證有正好N行，mapred行輸入格式linespermap屬性，預設為1，設定N.key:LongWritable - value:Text

2.Mapper介面使用者需繼承Mapper介面實現自己的Mapper，Mapper中必須實現的函式是

void map(K1 key,
 
    V1 value,
 
    OutputCollector<K2,V2> output,
 
    Reporter reporter
 
) throws IOException
複製程式碼

其中，是通過Inputformat中的RecordReader物件解析處理的，OutputCollector獲取map()的輸出結果，Reporter儲存了當前task處理進度。

Hadoop本身提供了一些Mapper供使用者使用：

Class	介紹
IdentityMapper<K, V>	實現Mapper <K，V，K，V>並將輸入直接對映到輸出
InverseMapper<K, V>	實現Mapper <K，V，V，K>並將輸入直接對映到輸出
RegexMapper	實現Mapper <K，Text，Text，LongWritable>併為每個正規表示式匹配生成（match，1）對
TokenCountMapper	實現Mapper <K，Text，Text，LongWritable>並在輸入值被標記化時生成（token，1）對

Partitioner介面

使用者需繼承該介面實現自己的Partitioner以指定map task產生的key/value對交給哪個reduce task處理，好的Partitioner能讓每個reduce task處理的資料相近，從而達到負載均衡。Partitioner中需實現的函式是

getPartition( K2 key, V2 value, int numPartitions)

該函式返回對應的reduce task ID。

使用者如果不提供Partitioner，Hadoop會使用預設的（實際上是個hash函式）。

Combiner

Combiner使得map task與reduce task之間的資料傳輸量大大減小，可明顯提高效能。大多數情況下，Combiner與Reducer相同。

Reducer介面

使用者需繼承Reducer介面實現自己的Reducer，Reducer中必須實現的函式是

void reduce(K2 key,
 
     Iterator<V2> values,
 
     OutputCollector<K3,V3> output,
 
     Reporter reporter
 
) throws IOException
複製程式碼

Hadoop本身提供了一些Reducer供使用者使用：

Class	介紹
IdentityReduce<K, V>	實現Reduce <K，V，K，V>並將輸入直接對映到輸出
LongSumReduce	實現Reduce <K，LongWritable，K，LongWritable>並確定與給定鍵對應的所有值的總和

OutputFormat

使用者通過OutputFormat指定輸出檔案的內容格式，不過它沒有split。每個reduce task將其資料寫入自己的檔案，檔名為part-nnnnn，其中nnnnn為reduce task的ID。

Hadoop本身提供了幾個OutputFormat:

OutputFormat	介紹
TextOutputFormat	將每條記錄寫為一行文字。鍵和值寫為字串，並由tab(\t)分隔，可在mapred中更改。textoutputformat分隔符屬性
SequenceFileOutputFormat	以hadoop的專有序列檔案格式寫入鍵/值對。與SequenceFileInputFormat一起使用
NullOutputFormat	不做輸出

MapReduce實戰 – 根據文章記錄獲取時段內發帖頻率
2019-03-03
nginx根據token做頻率限制
2018-09-15
Nginx
php 根據給定字串時間獲取時區
2019-02-16
PHP字串
根據 IP 獲取省市名稱
2020-12-10
C#根據經緯度獲取實體地址
2020-04-04
C#
activiti 根據流程例項ID 獲取發起人
2021-10-28
根據使用者來獲取渠道
2024-10-19
VUE elementUI calendar 根據日期段標記日期
2020-10-09
VueUI
Kubernetes 教程：根據 PID 獲取 Pod 名稱
2020-07-27
根據屬性字串獲取屬性值
2024-11-29
字串
js根據IP地址獲取當前的省市
2018-06-11
JS
js根據ip地址獲取省份城市的方法
2018-09-10
JS
根據id獲取元素的寬度的方法
2024-05-18
根據key集合批次從map中獲取value
2024-07-02
記錄一個簡陋的根據statspack查詢哪段時間事務量最大的指令碼
2019-03-03
指令碼
python 根據時間戳建立目錄操作
2024-10-06
Python時間戳
百度地圖：根據位置獲取座標
2024-03-11
地圖
js根據ip地址獲取城市地理位置
2020-07-29
JS
配置MapReduce時遇到的問題記錄
2018-11-18
根據兩個日期之間獲取LocalDate日曆列表
2024-05-22
LDA
易優CMS根據aid獲取文件頁面連結
2024-07-17
AI
Android使用getIdentifier()方法根據資源名來獲取資
2021-09-09
AndroidIDE
python根據字典內的值實現排序
2020-11-01
Python排序
聯發科AI會根據輸入實時生成影像
2024-03-02
AI
C#獲取根目錄的方法總結
2019-01-04
C#
根據微信code獲取換取使用者登入態資訊
2024-04-18
Hadoop大資料實戰系列文章之Mapreduce 計算框架
2020-11-10
Hadoop大資料框架
S根據年月實現獲得當月天數
2018-12-15
例項演示：Excel實現資料更新獲取，並根據公式計算後展示
2022-02-09
Excel公式
List根據時間排序
2024-06-14
排序
js根據時間排序
2024-06-14
JS排序
sql根據多個欄位查詢重複記錄
2021-09-18
SQL
VVICAPI介面解析，實現根據ID取商品詳情
2023-09-20
API
帝國cms把文章內容都存放到根目錄方法
2024-10-05
【大資料】MapReduce開發小實戰
2020-09-21
大資料
射頻收發記錄儀
2024-06-20
根據編號展開內容
2024-03-17
拼多多根據ID取商品詳情 API介面（item_get-根據ID取商品詳情）
2023-04-14
API