MapReduce之----往hbase資料庫寫入資料時, 出現資料異常

_Karl發表於2018-06-23

問題 :

寫入HBase的資料不對

讀入的資料是

hello   nihao   hadoop  hehe    byebye
hello   nihao   hadoop
spark   scale

存入資料庫就成了

hbase(main):038:0> scan 't_user2'
ROW                   COLUMN+CELL                                                  
 byebye               column=MR_column:wordcount, timestamp=1529670719617, value=1 
 hadoop               column=MR_column:wordcount, timestamp=1529670719617, value=2 
 heheop               column=MR_column:wordcount, timestamp=1529670719617, value=1 
 hellop               column=MR_column:wordcount, timestamp=1529670719617, value=2 
 nihaop               column=MR_column:wordcount, timestamp=1529670719617, value=2 
 scalep               column=MR_column:wordcount, timestamp=1529670719617, value=1 
 sparkp               column=MR_column:wordcount, timestamp=1529670719617, value=1 

資料出現了異常, 有沒有看到前面的rowkey長度都是一致的 ... 且與hadoop長度不一致的全在最後面補全hadoop後面的字母 ?

原因如下: 

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;


public class MRWriterToHbaseTableDriver {
static class MRWriterToHbaseTableMapper extends

Mapper<LongWritable, Text, Text, IntWritable> {

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] arr = line.split("\t");
for (String name : arr) {
context.write(new Text(name), new IntWritable(1));
}
}
}

static class MRWriterToHbaseTableReducer extends

TableReducer<Text, IntWritable, ImmutableBytesWritable> {

@Override
protected void reduce(Text key, Iterable<IntWritable> value,
Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable iw : value) {
count += iw.get();
}

// Put put = new Put(key.getBytes()); 

//                     這裡就是異常所在,換成下面語句即可解決

Put put = new Put(key.toString().getBytes());  //新增rowkey
put.addColumn("MR_column".getBytes(), "wordcount".getBytes(),
(count + "").getBytes());  //新增列族,列限定符和值

// context.write(new ImmutableBytesWritable(key.getBytes()), put); 

//                     括號裡面的ImmutableBytesWritable可以換成其他型別,例如null, Text型別都可以

//                     只要和上面的keyout泛型一致即可

context.write(new ImmutableBytesWritable(key.toString().getBytes()), put);
}
}


public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
conf.set("hbase.zookeeper.quorum", "min1");


Job job = Job.getInstance(conf, "myjob");
job.setJarByClass(MRWriterToHbaseTableDriver.class);


job.setMapperClass(MRWriterToHbaseTableMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);


FileInputFormat.setInputPaths(job, "/aa");
TableMapReduceUtil.initTableReducerJob("t_user2",
MRWriterToHbaseTableReducer.class, job);


boolean waitForCompletion = job.waitForCompletion(true);
System.exit(waitForCompletion ? 0 : 1);
}
}

分析 :

        Put型別裡String.getBytes() 和 Text.getBytes()兩個方法得到的位元組陣列的長度不同, String.getBytes()會動態獲得當前位元組陣列的長度,該是多長就是多長;   Text.getBytes()獲得的是當前位元組陣列, 但是會與上一個獲得的位元組陣列長度進行比較, 當長度比上一個短時會自動補全上一個位元組後面的內容至當前位元組, 若長度比上一個長時會獲得當前位元組陣列內容 . 再看下面的表進行驗證:

String.getBytes()寫入到hbase後內容:

 byebye               column=MR_column:wordcount, timestamp=1529674870277, value=1 
 hadoop               column=MR_column:wordcount, timestamp=1529674870277, value=2 
 hehe                 column=MR_column:wordcount, timestamp=1529674870277, value=1 
 hello                column=MR_column:wordcount, timestamp=1529674870277, value=2 
 looperqu             column=MR_column:wordcount, timestamp=1529674870277, value=1 
 nihao                column=MR_column:wordcount, timestamp=1529674870277, value=2 
 scale                column=MR_column:wordcount, timestamp=1529674870277, value=1 
 spark                column=MR_column:wordcount, timestamp=1529674870277, value=1 
 zookeeper            column=MR_column:wordcount, timestamp=1529674870277, value=1 
Text.getBytes()寫入到hbase後內容 :
 byebye               column=MR_column:wordcount, timestamp=1529675062950, value=1 
 hadoop               column=MR_column:wordcount, timestamp=1529675062950, value=2 
 heheop               column=MR_column:wordcount, timestamp=1529675062950, value=1 
 hellop               column=MR_column:wordcount, timestamp=1529675062950, value=2 
 looperqu             column=MR_column:wordcount, timestamp=1529675062950, value=1 
 nihaorqu             column=MR_column:wordcount, timestamp=1529675062950, value=2 
 scalerqu             column=MR_column:wordcount, timestamp=1529675062950, value=1 
 sparkrqu             column=MR_column:wordcount, timestamp=1529675062950, value=1 
 zookeeper            column=MR_column:wordcount, timestamp=1529675062950, value=1 




相關文章