數倉工具—Hive的其他語言呼叫(15)

不二人生發表於2020-12-26

原文網址 : https://blog.csdn.net/king14bhhb/article/details/111770561

Hive的其他語言呼叫

訪問Hive

前面我們學習Hive的架構設計和Hive的服務HiveServer2,我們知道了HiveServer2服務的加入使得其他語言訪問Hive成為了可能，接下來我們學習一下其他語言訪問Hive ,開始之前我們複習一下下面的幾個概念

metadata ：hive後設資料，即hive定義的表名，欄位名，型別，分割槽，使用者這些資料。一般儲存關係型書庫mysql中，在測試階段也可以用hive內建Derby資料庫。
metastore ：hivestore服務端。主要提供將DDL，DML等語句轉換為MapReduce，提交到hdfs中。
hiveserver2：hive服務端。提供hive服務。客戶端可以通過beeline，jdbc（即用java程式碼連結）等多種方式連結到hive。
beeline：hive客戶端連結到hive的一個工具。可以理解成mysql的客戶端。如：navite cat 等。

其它語言訪問hive主要是通過HiveServer2服務，HiveServer2(HS2)是一種能使客戶端執行Hive查詢的服務。HiveServer2可以支援對 HiveServer2 的嵌入式和遠端訪問，支援多客戶端併發和身份認證。旨在為開放API客戶端（如JDBC和ODBC）提供更好的支援。

會啟動一個hive服務端預設埠為：10000，可以通過beeline，jdbc，odbc的方式連結到hive。hiveserver2啟動的時候會先檢查有沒有配置hive.metastore.uris，如果沒有會先啟動一個metastore服務，然後在啟動hiveserver2。如果有配置hive.metastore.uris。會連線到遠端的metastore服務。這種方式是最常用的。部署在圖如下：

Python訪問Hive

Python3訪問hive需要安裝的依賴有：

pip3 install thrift
pip3 install PyHive
pip3 install sasl
pip3 install thrift_sasl

這裡有一個Python訪問Hive的工具類：

# -*- coding:utf-8 -*-

from pyhive import hive


class HiveClient(object):
	"""docstring for HiveClient"""
	def __init__(self, host='hadoop-master',port=10000,username='hadoop',password='hadoop',database='hadoop',auth='LDAP'):
		""" 
		create connection to hive server2 
		"""  
		self.conn = hive.Connection(host=host,  
			port=port,  
			username=username,  
			password=password,  
			database=database,
			auth=auth) 

	def query(self, sql):
		""" 
		query 
		""" 
		with self.conn.cursor() as cursor: 
			cursor.execute(sql)
			return cursor.fetchall()

	def insert(self, sql):
		"""
		insert action
		"""
		with self.conn.cursor() as cursor:
			cursor.execute(sql)
			# self.conn.commit()
			# self.conn.rollback()

	def close(self):
		""" 
		close connection 
		"""  
		self.conn.close()

使用的時候，只需要匯入，然後建立一個物件例項即可，傳入sql呼叫query方法完成查詢。

# 拿一個連線
hclient = hive.HiveClient()

# 執行查詢操作
...

# 關閉連線
hclient.close()

注意：在insert插入方法中，我將self.conn.commit()和self.conn.rollback()即回滾註釋了，這是傳統關係型資料庫才有的事務操作，Hive中是不支援的。

Java連線Hive

Java作為大資料的基礎語言，連線hive自然是支援的很好的，這裡介紹通過jdbc和mybatis兩種方法連線hive。

1. Jdbc連線

java通過jdbc連線hiveserver，跟傳統的jdbc連線mysql方法一樣。

需要hive-jdbc依賴：

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.2.1</version>
</dependency>

程式碼跟連線mysql套路一樣，都是使用的DriverManager.getConnection(url, username, password)：

@NoArgsConstructor
@AllArgsConstructor
@Data
@ToString
public class HiveConfigModel {

    private String url = "jdbc:hive2://localhost:10000";
    private String username = "hadoop";
    private String password = "hadoop";
    
}

@Test
public void test(){
    // 初始化配置
    HiveConfigModel hiveConfigModel = ConfigureContext.getInstance("hive-config.properties")
            .addClass(HiveConfigModel.class)
            .getModelProperties(HiveConfigModel.class);

    try {
        Connection conn = DriverManager.getConnection(hiveConfigModel.getUrl(),
                hiveConfigModel.getUsername(), hiveConfigModel.getPassword());


        String sql = "show tables";
        PreparedStatement preparedStatement = conn.prepareStatement(sql);
        ResultSet rs = preparedStatement.executeQuery();
        List<String> tables = new ArrayList<>();
        while (rs.next()){
            tables.add(rs.getString(1));
        }

        System.out.println(tables);
    } catch (SQLException e) {
        e.printStackTrace();
    }
}

在hive-jdbc-1.2.1.jar的META-INF下有個services目錄，裡面有個java.sql.Driver檔案，內容是：

org.apache.hive.jdbc.HiveDriver

java.sql.DriverManager使用spi實現了服務介面與服務實現分離以達到解耦，在這裡jdbc的實現org.apache.hive.jdbc.HiveDriver根據java.sql.Driver提供的統一規範實現邏輯。客戶端使用jdbc時不需要去改變程式碼，直接引入不同的spi介面服務即可。

DriverManager.getConnection(url, username, password)

這樣即可拿到連線，前提是具體實現需要遵循相應的spi規範。

2. 整合mybatis

通常都會使用mybatis來做dao層訪問資料庫，訪問hive也是類似的。

配置檔案sqlConfig.xml：

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <environments default="production">
        <environment id="production">
            <transactionManager type="JDBC"/>
            <dataSource type="POOLED">
                <property name="driver" value="org.apache.hive.jdbc.HiveDriver"/>
                <property name="url" value="jdbc:hive2://master:10000/default"/>
                <property name="username" value="hadoop"/>
                <property name="password" value="hadoop"/>
            </dataSource>
        </environment>
    </environments>
    <mappers>
        <mapper resource="mapper/hive/test/test.xml"/>
    </mappers>
</configuration>

mapper程式碼省略，實現程式碼：

public classTestMapperImpl implements TestMapper {

    private static SqlSessionFactory sqlSessionFactory = HiveSqlSessionFactory.getInstance().getSqlSessionFactory();

    @Override
    public int getTestCount(String dateTime) {
        SqlSession sqlSession = sqlSessionFactory.openSession();
        TestMapper testMapper = sqlSession.getMapper(TestMapper.class);

        int count = testMapper.getTestCount(dateTime);

        sqlSession.close();

        return count;
    }
}

3. 整合SpringBoot

公司內部各個部門人員是層次不齊的，不可能都會使用大資料分析後臺，更不會寫sql，這時候可以開發一套自助取數系統，通過頁面操作即可獲取相應的資料，這時候通常需要使用SpringBoot連線mysql和Hive生成報表。SpringBoot整合Hive這裡整合了Druid連線池。

需要完成的任務

每個人都可以在web頁面寫sql，完成Hive查詢任務；
查詢資料量不能太大，不要超過60天資料量（那將是災難）；
提交查詢任務後，獲取yarn的資源情況，如果緊張，則拒絕；
後臺將異常，以及拒絕服務的原因通過丟擲異常，反饋資訊給前臺頁面；
如果前面有人查過了會將結果存入mysql，第二個人查詢，無需再查詢Hive，只需要從mysql裡面取；

1) 需要的依賴

為了節省篇幅，這裡給出hiveserver2方式連線hive主要的maven依賴，父工程springboot依賴省略。

<!-- 版本資訊 -->
<properties>
    <hadoop.version>2.6.5</hadoop.version>
    <mybatis.version>3.2.7</mybatis.version>
</properties>
<dependency>
    <groupId>org.mybatis</groupId>
    <artifactId>mybatis</artifactId>
    <version>${mybatis.version}</version>
</dependency>

<!-- hadoop依賴 -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>${hadoop.version}</version>
</dependency>

<!-- hive-jdbc -->
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.2.1</version>
</dependency>

<!-- 解析html -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.8.3</version>
</dependency>

2）application-test.yml檔案：

# Spring配置
spring:
  # 資料來源配置
  datasource:
    type: com.alibaba.druid.pool.DruidDataSource
    driverClassName: com.mysql.cj.jdbc.Driver
    druid:
      # 主庫資料來源
      master:
        url: jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf8&useSSL=true&serverTimezone=GMT%2B8
        username: root
        password: root
      # 從庫資料來源
      slave:
        # 從資料來源開關/預設關閉
        enabled: true
        url: jdbc:mysql://localhost:3306/test2?useUnicode=true&characterEncoding=utf8&useSSL=true&serverTimezone=GMT%2B8
        username: root
        password: root
      # 從庫資料來源2
      # ...省略...
      # hive資料來源
      slave3:
      # 從資料來源開關/預設關閉
        enabled: true
        driverClassName: org.apache.hive.jdbc.HiveDriver
        url: jdbc:hive2://master:10000/default
        username: hive
        password: hive
      # 初始連線數
      initialSize: 5
      # 最小連線池數量
      minIdle: 10
      # 最大連線池數量
      maxActive: 20
      # 配置獲取連線等待超時的時間
      maxWait: 60000
      # 配置間隔多久才進行一次檢測，檢測需要關閉的空閒連線，單位是毫秒
      timeBetweenEvictionRunsMillis: 60000
      # 配置一個連線在池中最小生存的時間，單位是毫秒
      minEvictableIdleTimeMillis: 300000
      # 配置一個連線在池中最大生存的時間，單位是毫秒
      maxEvictableIdleTimeMillis: 900000

這裡資料來源配置了mysql和Hive，預設情況下是使用主庫master資料來源，是訪問mysql的，使用的時候只需要在mapper層進行切換即可。

程式碼實現跟其它程式一樣，都是mapper、service、controller層，套路一模一樣。一共設定了實時和離線兩個yarn資源佇列，由於其它部門人使用可能存在佇列壓力過大的情況，需要對資料量按照每次查詢的資料範圍不超過60天來限制，和此時叢集使用資源不能大於55%，這裡重點說明一下controller層對資料量的預防。

實體類UserModel：

@NoArgsConstructor
@AllArgsConstructor
@Data
@ToString
public class UserModel extends BaseEntity{

    private String userId;
    private Integer count;
}

3) 叢集資源使用率不大於55%

因為很多業務查詢邏輯controller都要用到資料量防禦過大的問題，這裡使用了被Spring切面關聯的註解來標識controller。

定義切面YarnResourceAspect，並且關聯註解@YarnResource

@Target({ElementType.TYPE, ElementType.METHOD})
@Retention(RetentionPolicy.RUNTIME)
public @interface YarnResource {

}

@Aspect
@Component
public class YarnResourceAspect {

    private static final Logger log = LoggerFactory.getLogger(YarnResourceAspect.class);

    /**
     * 配置切入點
     */
    @Pointcut("@annotation(com.ruoyi.common.annotation.YarnResource)")
    public void yarnResourcdPointCut(){
    }

    /**
     * 檢查yarn的資源是否可用
     */
    @Before("yarnResourcdPointCut()")
    public void before(){
        log.info("************************************檢查yarn的資源是否可用*******************************");
        // yarn資源緊張
        if(!YarnClient.yarnResourceOk()){
            throw new InvalidStatusException();
        }
    }

}

獲取yarn的資源使用資料：

因為提交任務的時間是不定的，我們需要根據使用者提交時候的yarn資源狀態來判斷當前是否能執行Hive查詢，以免影響線上任務。

@Slf4j
public class YarnClient {

    /**
     * yarn資源不能超過多少
     */
    private static final int YARN_RESOURCE = 55;

    /**
     *
     * @return true : 表示資源正常， false: 資源緊張
     */
    public static boolean yarnResourceOk() {
        try {
            URL url = new URL("http://master:8088/cluster/scheduler");
            HttpURLConnection conn = null;
            conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");
            conn.setUseCaches(false);
            // 請求超時5秒
            conn.setConnectTimeout(5000);
            // 設定HTTP頭:
            conn.setRequestProperty("Accept", "*/*");
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36");
            // 連線併傳送HTTP請求:
            conn.connect();

            // 判斷HTTP響應是否200:
            if (conn.getResponseCode() != 200) {
                throw new RuntimeException("bad response");
            }
            // 獲取所有響應Header:
            Map<String, List<String>> map = conn.getHeaderFields();
            for (String key : map.keySet()) {
                System.out.println(key + ": " + map.get(key));
            }
            // 獲取響應內容:
            InputStream input = conn.getInputStream();
            byte[] datas = null;

            try {
                // 從輸入流中讀取資料
                datas = readInputStream(input);
            } catch (Exception e) {
                e.printStackTrace();
            }
            String result = new String(datas, "UTF-8");// 將二進位制流轉為String

            Document document = Jsoup.parse(result);

            Elements elements = document.getElementsByClass("qstats");

            String[] ratios = elements.text().split("used");

            return Double.valueOf(ratios[3].replace("%", "")) < YARN_RESOURCE;
        } catch (IOException e) {
            log.error("yarn資源獲取失敗");
        }

        return false;

    }

    private static byte[] readInputStream(InputStream inStream) throws Exception {
        ByteArrayOutputStream outStream = new ByteArrayOutputStream();
        byte[] buffer = new byte[1024];
        int len = 0;
        while ((len = inStream.read(buffer)) != -1) {
            outStream.write(buffer, 0, len);
        }
        byte[] data = outStream.toByteArray();
        outStream.close();
        inStream.close();
        return data;
    }
}

在controller上通過註解@YarnResource標識：

@Controller
@RequestMapping("/hero/hive")
public class HiveController {

    /**
     * html 檔案地址字首
     */
    private String prefix = "hero";

    @Autowired
    IUserService iUserService;

    @RequestMapping("")
    @RequiresPermissions("hero:hive:view")
    public String heroHive(){
        return prefix + "/hive";
    }

    @YarnResource
    @RequestMapping("/user")
    @RequiresPermissions("hero:hive:user")
    @ResponseBody
    public TableDataInfo user(UserModel userModel){
        DateCheckUtils.checkInputDate(userModel);

        PageInfo pageInfo = iUserService.queryUser(userModel);
        TableDataInfo tableDataInfo = new TableDataInfo();

        tableDataInfo.setTotal(pageInfo.getTotal());
        tableDataInfo.setRows(pageInfo.getList());

        return tableDataInfo;
    }
}

6) 查詢資料跨度不超過60天檢查

這樣每次請求進入controller的時候就會自動檢查查詢的日期是否超過60天了，防止載入資料過多，引發其它任務資源不夠。

public class DateCheckUtils {

    /**
     * 對前臺傳入過來的日期進行判斷，防止查詢大量資料，造成叢集負載過大
     * @param o
     */
    public static void checkInputDate(BaseEntity o){
        if("".equals(o.getParams().get("beginTime")) && "".equals(o.getParams().get("endTime"))){
            throw new InvalidTaskException();
        }

        String beginTime = "2019-01-01";
        String endTime = DateUtils.getDate();

        if(!"".equals(o.getParams().get("beginTime"))){
            beginTime = String.valueOf(o.getParams().get("beginTime"));
        }

        if(!"".equals(o.getParams().get("endTime"))){
            endTime = String.valueOf(o.getParams().get("endTime"));
        }

        // 查詢資料時間跨度大於兩個月
        if(DateUtils.getDayBetween(beginTime, endTime) > 60){
            throw new InvalidTaskException();
        }
    }
}

這裡訪問hive肯定需要切換資料來源的，因為其它頁面還有對mysql的資料訪問，需要注意一下。

7) 每次查詢結果都會入mysql

前面有人查詢過了，會將資料保持到mysql，再返回到頁面，後面另外部門第二個人查詢時候，先從mysql取資料，如果沒有，就從Hive裡面查詢。下面這部分程式碼也是controller裡面的，這裡單獨拎出來了。

// 首先從mysql查，沒有再從hive查，mysql相當於一個快取介質
PageInfo pageInfo = iToplocationService.queryToplocation(toplocationCountModel);
if(pageInfo.getList().size() > 0){
    log.info("資料exists， 直接從mysql獲取...");
    tableDataInfo.setTotal(pageInfo.getTotal());
    tableDataInfo.setRows(pageInfo.getList());
}else if(iToplocationService.queryExistsToplocation(toplocationCountModel) == null){
    log.info("從hive中查詢資料...");
    PageInfo pageInfo2 = iToplocationService.query(toplocationCountModel);

    // 儲存到mysql
    log.info("批量儲存到mysql...");
    List<ToplocationCountModel> toplocationCountModels = pageInfo2.getList();
    int i = 0;
    while (i < toplocationCountModels.size()){
        if(toplocationCountModels.size() - i > 10000){
            iToplocationService.insertToplocation(toplocationCountModels.subList(i, i + 10000));
        }else{
            iToplocationService.insertToplocation(toplocationCountModels.subList(i, toplocationCountModels.size()));
        }

        i = i + 10000;
    }

目前功能看起來很簡單，沒有用到什麼高大上的東西，後面慢慢完善。

總結

spring boot 的成功整合可以方便地讓我們將Hive整合到我們的資料平臺中去，而且也可以方面的開發出出其他基於Hive 的資料產品
雖然今天我們邁出的實一小步，但是後面我們要基於這個做很多好玩的事情

數倉工具—Hive語法之map join、reduce join、smb join(8)
2021-01-03
Hive
使用go語言開發hive匯出工具
2023-10-17
GoHive
go語言與c語言的相互呼叫
2019-04-09
GoC語言
15. 三數之和_(c語言版)
2020-05-21
C語言
存在可能替代C語言的其他語言嗎？ -Erik Engheim
2020-12-01
C語言
自然語言處理工具python呼叫hanlp的方法步驟
2019-07-17
自然語言處理PythonHanLP
python語言與其他語言有什麼不同?
2020-08-11
Python
python呼叫自然語言處理工具hanlp記錄
2018-10-31
Python自然語言處理HanLP
倉頡_Cangjie-函數語言程式設計
2024-11-15
函數程式設計
go語言呼叫everything的SDK介面
2021-09-09
Go
倉頡程式語言的一點期望
2021-11-20
關於python語言，其他的應用你知道嗎?
2020-07-21
Python
【C語言】函式的概念和函式的呼叫（引數傳遞）
2020-10-04
C語言函式
自然語言處理工具python呼叫hanlp中文實體識別
2019-02-13
自然語言處理PythonHanLP
有用的C語言工具
2018-05-17
C語言
Python語言與其他語言有什麼不同?優勢介紹
2021-06-23
Python
現在 opencv 對其他語言支援如何？
2020-05-30
OpenCV
C語言函式呼叫棧
2022-05-14
C語言函式
Spring Boot中對自然語言處理工具包hanlp的呼叫詳解
2018-11-21
Spring Boot自然語言處理HanLP
C語言關於多原始檔的呼叫
2024-04-29
C語言
函數語言程式設計之尾呼叫和尾遞迴
2019-01-11
函數程式設計遞迴
15分鐘學會Go語言
2022-06-07
Go
WebAssembly 基礎以及結合其他程式語言
2024-10-13
Web
sv新增工作負荷（新增其他程式語言）
2024-06-03
其它語言通過HiveServer2訪問Hive
2020-12-18
HiveServer
有哪一種程式語言比其他的更安全嗎？
2019-03-27
Python和其他流行的程式語言有什麼區別?
2021-02-25
Python
c語言是如何處理函式呼叫的？
2021-11-25
C語言函式
go與其他語言有什麼區別？學習go語言怎麼樣
2019-11-19
Go
FAIR & NYU開發XNLI語料庫：15種語言（含低資源語言）
2018-11-12
AI
《快學 Go 語言》第 15 課 —— 反射
2018-12-24
Go反射
基於Hive進行數倉建設的資源後設資料資訊統計：Hive篇
2021-04-06
Hive
不同程式語言在發生stackoverflow之前支援的呼叫棧最大巢狀層數
2018-11-20
巢狀
L1-003 個位數統計 (15分)（chen-c語言）
2020-10-24
C語言
開發自己的前端工具庫(二)：函數語言程式設計
2018-08-03
前端函數程式設計
C#以及其他語言環境下的堆和棧
2024-07-05
C#
一種面嚮物件語言的方法呼叫過程。
2019-12-11
物件
資料倉儲建模工具之一——Hive學習第四天
2024-07-18
Hive