HIVE自定義函式的擴充套件

餓了麼物流技術團隊發表於2018-12-30

原文網址 : https://juejin.im/post/5c2830e56fb9a04a0e2d4875

Hive函式套件

作者簡介

淳敏，物流架構師同時也是一位team leader，工作認真負責，曾在休假期間“面向大海程式設計”，不明覺厲

在Hive中，使用者可以自定義一些函式，用於擴充套件HiveQL的功能。Hive 自定義函式主要包含以下三種：

UDF（user-defined function）單獨處理一行，輸出也是以行輸出。許多Hive內建字串，數學函式，時間函式都是這種型別。大多數情況下編寫對應功能的處理函式就能滿足需求。如：concat, split, length ,rand等。這種UDF主要有兩種寫法：繼承實現UDF類和繼承GenericUDF類（通用UDF）。
UDAF（user-defined aggregate function）用於處理多行資料並形成累加結果。一般配合group by使用。主要用於累加操作，常見的函式有max， min， count， sum，collect_set等。這種UDF主要有兩種寫法：繼承實現 UDAF類和繼承實現AbstractGenericUDAFResolver類。
UDTF（user-defined table function）處理一行資料產生多行資料或者將一列打成多列。如explode, 通常配合Lateral View使用，實現列轉行的功能。parse_url_tuple將一列轉為多列。

Hive的UDF機制是需要使用者實現: Resolver和Evaluator，其中Resolver就用來處理輸入，呼叫Evaluator，Evaluator就是具體功能的實現。

自定義UDF實現和呼叫機制

Hadoop提供了一個基礎類org.apache.hadoop.hive.ql.exec.UDF，在這個類中含有了一個UDFMethodResolver的介面實現類DefaultUDFMethodResolver的物件。

public class UDF {
  private UDFMethodResolver rslv;

  public UDF() {
    this.rslv = new DefaultUDFMethodResolver(this.getClass());
  }
	......
}
複製程式碼

在DefaultUDFMethodResolver中，提供了一個getEvalMethod的方法，從切面呼叫UDF的evaluate方法

public class DefaultUDFMethodResolver implements UDFMethodResolver {
  private final Class<? extends UDF> udfClass;

  public DefaultUDFMethodResolver(Class<? extends UDF> udfClass) {
    this.udfClass = udfClass;
  }

  public Method getEvalMethod(List<TypeInfo> argClasses) throws UDFArgumentException {
    return FunctionRegistry.getMethodInternal(this.udfClass, "evaluate", false, argClasses);
  }
}
複製程式碼

自定義UDF的實現上以繼承org.apache.hadoop.hive.ql.exec.UDF為基礎，然後實現一個evaluate方法，該方法會被DefaultUDFMethodResolver物件執行。

Case Study: 判斷座標點是不是在圖形中

public class DAIsContainPoint extends UDF {

  public Boolean evaluate(Double longitude, Double latitude, String geojson) {

    Boolean isContained = false;
    try {
      Polygon polygon = JTSHelper.parse(geojson);
      Coordinate center = new Coordinate(longitude, latitude);
      GeometryFactory factory = new GeometryFactory();
      Point point = factory.createPoint(center);
      isContained = polygon.contains(point);
    }catch (Throwable e){
      isContained = false;
    }finally {
      return isContained;
    }
  }
}
複製程式碼

完成了程式碼定義之後需要對其進行打包，編譯成一個jar，注意: 最終的jar中需要包含所有依賴的jar，maven編譯上推薦使用maven-shade-plugin

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-shade-plugin</artifactId>
      <version>2.2</version>
      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>shade</goal>
          </goals>
          <configuration>
            <filters>
              <filter>
                <artifact>*:*</artifact>
                <excludes>
                  <exclude>META-INF/*.SF</exclude>
                  <exclude>META-INF/*.DSA</exclude>
                  <exclude>META-INF/*.RSA</exclude>
                </excludes>
              </filter>
            </filters>
          </configuration>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>
複製程式碼

最後產生的jar檔案需要在HIVE SQL中被引用

add jar hdfs://xxx/udf/ff8bd59f-d0a5-4b13-888b-5af239270869/udf.jar;
create temporary function is_in_polygon as 'me.ele.breat.hive.udf.DAIsContainPoint';

select lat, lng, geojson, is_in_polygon(lat, lng, geojson) as is_in from example;
複製程式碼

自定義UDAF和MapReduce

在Hive的聚合計算中，採用MapReduce的方式來加快聚合的速度，而UDAF就是用來撰寫聚合類自定義方法的擴充套件方式。關於MapReduce需要補充知識的請看這裡，為了更好的說明白UDAF我們需要知道一下MapReduce的流程

回到Hive中來，在UDAF的實現中，首先需要繼承org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver，並實現org.apache.hadoop.hive.ql.udf.generic.GenericUDAFResolver2介面。然後構造GenericUDAFEvaluator類，實現MapReduce的計算過程，其中有3個關鍵的方法

方法iterate：獲取mapper，輸送去做merge
方法merge：combiner合併mapper
方法terminate：合併所有combiner返回結果

然後再實現一個繼承AbstractGenericUDAFResolver的類，過載其getEvaluator的方法，返回一個GenericUDAFEvaluator的例項

Case Study：合併地理圍欄

public class DAJoinV2 extends AbstractGenericUDAFResolver implements GenericUDAFResolver2 {

  @Override
  public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo genericUDAFParameterInfo)
      throws SemanticException {
    return new DAJoinStringEvaluator();
  }

  public GenericUDAFEvaluator getEvaluator(TypeInfo[] typeInfos) throws SemanticException {
    if (typeInfos.length != 1) {
      throw new UDFArgumentTypeException(typeInfos.length - 1,
          "Exactly one argument is expected.");
    }

    if (typeInfos[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
      throw new UDFArgumentTypeException(0,
          "Only primitive type arguments are accepted but "
              + typeInfos[0].getTypeName() + " is passed.");
    }

    switch (((PrimitiveTypeInfo) typeInfos[0]).getPrimitiveCategory()) {
      case STRING:
        return new DAJoinStringEvaluator();
      default:
        throw new UDFArgumentTypeException(0,
            "Only numeric or string type arguments are accepted but "
                + typeInfos[0].getTypeName() + " is passed.");
    }
  }

  public static class DAJoinStringEvaluator extends GenericUDAFEvaluator {

    private PrimitiveObjectInspector mInput;
    private Text mResult;

    // 儲存Geometry join的值的類
    static class PolygonAgg implements AggregationBuffer {
      Geometry geometry;
    }

    //定義：UDAF的返回型別，確定了DAJoin自定義UDF的返回型別是Text型別
    @Override
    public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
      assert (parameters.length == 1);
      super.init(m, parameters);
      mResult = new Text();
      mInput = (PrimitiveObjectInspector) parameters[0];
      return PrimitiveObjectInspectorFactory.writableStringObjectInspector;
    }

    //記憶體建立，用來儲存mapper,combiner,reducer運算過程中的相加總和。
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
      PolygonAgg polygonAgg = new PolygonAgg();
      reset(polygonAgg);
      return polygonAgg;
    }

    public void reset(AggregationBuffer aggregationBuffer) throws HiveException {
      PolygonAgg polygonAgg = (PolygonAgg) aggregationBuffer;
      GeometryFactory factory = new GeometryFactory();
      polygonAgg.geometry = factory.createPolygon(new Coordinate[]{});
    }

    //map階段：獲取每個mapper，去進行merge
    public void iterate(AggregationBuffer aggregationBuffer, Object[] objects)
        throws HiveException {
      assert (objects.length == 1);
      merge(aggregationBuffer, objects[0]);
    }

    //在一個子的partial中combiner合併map返回結果
    public Object terminatePartial(AggregationBuffer aggregationBuffer) throws HiveException {
      return terminate(aggregationBuffer);
    }

    //combiner合併map返回結果
    public void merge(AggregationBuffer aggregationBuffer, Object partial) throws HiveException {
      if (partial != null) {
        try {
          PolygonAgg polygonAgg = (PolygonAgg) aggregationBuffer;
          String geoJson = PrimitiveObjectInspectorUtils.getString(partial, mInput);
          Polygon polygon = JTSHelper.parse(geoJson);
          polygonAgg.geometry = polygonAgg.geometry.union(polygon);
        } catch (Exception e){

        }
      }
    }

    //reducer合併所有combiner返回結果
    public Object terminate(AggregationBuffer aggregationBuffer) throws HiveException {
      try {
        PolygonAgg polygonAgg = (PolygonAgg) aggregationBuffer;
        Geometry buffer = polygonAgg.geometry.buffer(0);
        mResult.set(JTSHelper.convert2String(buffer.convexHull()));
        return mResult;
      }catch (Exception e) {
        return "";
      }
    }

  }
}
複製程式碼

打包之後將其用在HIVE SQL中執行

add jar hdfs://xxx/udf/ff8bd59f-d0a5-4b13-888b-5af239270869/udf.jar;
create temporary function da_join as 'me.ele.breat.hive.udf.DAJoinV2';

create table udaf_example as
select id, da_join(da_range) as da_union_polygon
  from example
group by id
複製程式碼

自定義UDTF

在UDTF的實現中，首先需要繼承org.apache.hadoop.hive.ql.udf.generic.GenericUDTF，實現process,initialize和close方法

initialize返回StructObjectInspector物件，決定最後輸出的column的名稱和型別
process是對每一個輸入record進行處理，產生出一個新陣列，傳遞到forward方法中進行處理
close關閉整個呼叫的回撥處，清理記憶體

Case Study: 輸入Polygon轉成一組S2Cell

public class S2SimpleRegionCoverV2 extends GenericUDTF {

  private final static  int LEVEL = 16;

  @Override
  public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
    List<String> structFieldNames = Lists.newArrayList("s2cellid");
    List<ObjectInspector> structFieldObjectInspectors = Lists.<ObjectInspector>newArrayList(
        PrimitiveObjectInspectorFactory.javaLongObjectInspector);

    return ObjectInspectorFactory
        .getStandardStructObjectInspector(structFieldNames, structFieldObjectInspectors);
  }

  @Override
  public void process(Object[] objects) throws HiveException {
    String json = String.valueOf(objects[0]);

    List<Long> s2cellids = toS2CellIds(json);

    for (Long s2cellid: s2cellids){
      forward(new Long[]{s2cellid});
    }
  }

  public static List<Long> toS2CellIds(String json) {
    GeometryFactory factory = new GeometryFactory();
    GeoJsonReader reader = new GeoJsonReader();

    Geometry geometry = null;
    try {
      geometry = reader.read(json);
    } catch (ParseException e) {
      geometry = factory.createPolygon(new Coordinate[]{});
    }

    List<S2Point> polygonS2Point = new ArrayList<S2Point>();
    for (Coordinate coordinate : geometry.getCoordinates()) {
      S2LatLng s2LatLng = S2LatLng.fromDegrees(coordinate.y, coordinate.x);
      polygonS2Point.add(s2LatLng.toPoint());
    }

    List<S2Point> points = polygonS2Point;

    if (points.size() == 0) {
      return Lists.newArrayList();
    }

    ArrayList<S2CellId> result = new ArrayList<S2CellId>();
    S2RegionCoverer
        .getSimpleCovering(new S2Polygon(new S2Loop(points)), points.get(0), LEVEL, result);

    List<Long> output = new ArrayList<Long>();
    for (S2CellId s2CellId : result) {
      output.add(s2CellId.id());
    }

    return output;
  }

  @Override
  public void close() throws HiveException {

  }
}
複製程式碼

在使用的時候和lateral view連在一起用

add jar hdfs://bipcluster/data/upload/udf/ff8bd59f-d0a5-4b13-888b-5af239270869/google_s2_udf.jar;
create temporary function da_cover as 'me.ele.breat.hive.udf.S2SimpleRegionCoverV2';

drop table if exists temp.cm_s2_id_cover_list;

create table temp.cm_s2_id_cover_list as
select tb_s2cellid.s2cellid, source.shop_id
from (
select
  geometry,
  shop_id
from
  example) source
lateral view da_cover(geometry) tb_s2cellid as s2cellid;
複製程式碼

參考

閱讀部落格還不過癮？

歡迎大家掃二維碼通過新增群助手，加入交流群，討論和部落格有關的技術問題，還可以和博主有更多互動

部落格轉載、線下活動及合作等問題請郵件至 shadowfly_zyl@hotmail.com 進行溝通

JMeter擴充套件開發：自定義函式
2022-10-13
JMeter套件函式
kotlin 擴充套件（擴充套件函式和擴充套件屬性）
2019-02-26
Kotlin套件函式
hive 3.0.0自定義函式
2018-09-06
Hive函式
Hive中自定義函式
2020-10-13
Hive函式
【Kotlin】擴充套件屬性、擴充套件函式
2024-04-08
Kotlin套件函式
Hive常用函式及自定義函式
2018-06-08
Hive函式
Kotlin擴充套件函式
2020-11-22
Kotlin套件函式
使用Kotlin擴充套件函式擴充套件Spring Data案例
2021-11-11
Kotlin套件函式Spring
Hive---＞建立自定義的UDTF函式
2020-11-27
Hive函式
Z 函式（擴充套件KMP）
2024-11-04
函式套件KMP
Kotlin的幾個擴充套件函式
2019-01-02
Kotlin套件函式
django實現自定義manage命令的擴充套件
2019-08-12
Django套件
.Net Core AutoMapper自定義擴充套件方法的使用
2020-02-04
APP套件
Hive函式（內建函式+自定義標準函式UDF）
2020-09-23
Hive函式
Android自定義字型--自定義TextView（可擴充套件不同ttf字
2021-09-09
Android自定義字型TextView套件
Kotlin-常用擴充套件函式
2019-03-02
Kotlin套件函式
es6-函式擴充套件
2018-09-18
函式套件
AbpVnext使用分散式IDistributedCache Redis快取(自定義擴充套件方法)
2021-09-02
分散式Redis快取套件
JMeter 擴充套件開發：自定義 Java Sampler
2022-10-20
JMeter套件Java
基於shiro的自定義註解的擴充套件
2018-08-09
套件
ES6之函式的擴充套件
2019-04-30
函式套件
PHP的Sodium加密擴充套件函式了解
2021-09-09
PHP加密套件函式
PHP的SPL擴充套件庫（四）函式
2021-10-27
PHP套件函式
開發函式計算的正確姿勢———為 PHP 執行時新增自定義擴充套件
2020-02-26
函式PHP套件
PHP的Mhash擴充套件函式的學習
2021-09-09
PHP套件函式
程式碼演示Mybatis-Generator 擴充套件自定義生成
2020-02-25
MyBatis套件
DcatAdmin 擴充套件: 自定義表單(動態表單)
2022-06-27
套件
tep0.9.5支援自定義擴充套件request
2022-01-22
套件
實戰 | 執行緒池的幾種自定義擴充套件
2021-10-28
執行緒套件
JMeter 擴充套件開發：自定義 JMeter 外掛的除錯
2022-12-15
JMeter套件除錯
ES6入門之函式的擴充套件
2019-04-26
函式套件
重學ES6 函式的擴充套件（上）
2019-04-18
函式套件
擴充套件表示式
2021-09-09
套件
form-create-designer中怎麼擴充套件自定義元件
2024-05-21
ORM套件元件
Laravel-admin 自定義擴充套件，jQuery 語法錯誤？
2019-07-30
Laravel套件jQuery
node-exporter 擴充套件用法 – shell 自定義 exporter 監控
2024-07-16
Export套件
day88-ElasticSearch-分詞- 自定義擴充套件詞庫
2020-12-21
Elasticsearch分詞套件
奇技淫巧之Kotlin 擴充套件函式(一)
2018-07-14
Kotlin套件函式