例項講解hadoop中的hive查詢(python語言實現)薦

條件，假設配置好了hadoop和hive，並可以正常執行

首先，要外部查詢hive，你需要安裝thrift和fb303，或許有別的辦法，但我實際應用過程中看來，這是最簡單的途徑。hive本身提供了thrift的介面。檔案在hive解壓縮後如下路徑中

hive/hive-0.7.1/src/service/if/hive_service.thrift

然後複製四個檔案

metastore/if/hive_metastore.thrift

src/service/if/hive_service.thrift

ql/if/queryplan.thrift

和

thrift原始碼下fb303中的fb303.thrift。

編輯hive_service.thrift，將其中include其他三個檔案的路徑修改為正確的路徑。儲存退出，執行thrift

#thrift -r –gen python hive_service.thrift

將生成的python放入/usr/lib/python/site-packages下，然後編寫如下指令碼

#!/usr/bin/python
#-*-coding:UTF-8 -*-
import sys
import os
import string
import re
import MySQLdb

from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

def hiveExe(hsql,dbname):
#定義hive查詢函式
                try:
                                transport = TSocket.TSocket(`192.168.10.1`, 10000)
                                transport = TTransport.TBufferedTransport(transport)
                                protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = ThriftHive.Client(protocol)
transport.open()

client.execute(`ADD jar /opt/modules/hive/hive-0.7.1/lib/hive-contrib-0.7.1.jar`)

                                client.execute(“use “+dbname)
                                row = client.fetchOne()
                                #使用庫名，只需一次fetch，用fetchOne

                                client.execute(hsql)
                                return client.fetchAll()
                                #查詢所有資料，用fetchAll()

transport.close()

except Thrift.TException, tx:
print `%s` % (tx.message)

def mysqlExe(sql):
                try:
                                conn = MySQLdb.connect(user=”test”,passwd=”test123″,host=”127.0.0.1″,db=”active2_ip”,port=5029)
                except Exception,data:
                                print “Could not connect to MySQL server.:”,data
                try:
                                cursor = conn.cursor()
                                cursor.execute(sql)
                                return row
                                cursor.commit()
                                cursor.close()
                                conn.close()
                except Exception,data:
                                print “Could not Fetch anything:”,data

dbname = “active2”
date = os.popen(“date -d `1 day ago` +%Y%m%d”).read().strip()
#shell方式取昨天日期，讀取並去前後

date.close()

sql = “create table IF NOT EXISTS “+dbname+”_group_ip_”+date+” like “+dbname+”_group_ip;load data infile `/tmp/”+dbname+”_”+date+”.csv` into table “+dbname+”_group_ip_”+date+” FIELDS TERMINATED BY `,`”
#以模板表建立日期表，並load data到該表中

hsql = “insert overwrite local directory `/tmp/”+dbname+”_”+date+”` select count(version) as vc,stat_hour,type,version,province,city,isp from “+dbname+”_”+date+” group by province,city,version,type,stat_hour,isp”
#hive查詢，並將查詢結果匯出到本地/tmp/active2_20111129目錄下，可能生成多個檔案

hiveExe(hsql, dbname)
#執行查詢

os.system(“sudo cat /tmp/”+dbname+”_”+date+”/* > /tmp/tmplog “)
#將多個檔案通過shell合併為一個檔案tmplog

file1 = open(“/tmp/tmplog”, `r`)
#開啟合併後的臨時檔案
file2 = open(“/tmp/”+dbname+”_”+date+”.csv”,`w`)
#開啟另一個檔案，做文字替換。因為hive匯出結果，其分隔符為特殊字元。所以需要做替換，格式為csv，故用逗號分隔
sep = `,`
for line in file1:
                tmp = line[:-1].split(`x01`)
                #hive匯出檔案分隔符為ascii中的001，x01是16進位制，但其實也就是十進位制的1
                replace = sep.join(tmp)
                file2.write(replace+”
“)

file1.close()
file2.close()

os.system(“sudo rm -f /tmp/tmplog”)
#刪除臨時的tmplog

mysqlExe(sql)
#執行mysql查詢，建立表和載入資料。
os.system(“sudo rm -f /tmp/”+dbname+”_”+date)

其實難點主要是如何通過thrift來查詢hive。還有就是hive在dump資料時，如果寫insert overwrite directory，會dump到hdfs裡，必須寫insert overwrite local directory，才會dump到當前工作機。

再一個就是注意匯出檔案內容中的分隔符，其他都跟正常資料庫操作是一樣的。

PS:我是個python新手，剛用時間不長，所以用了很多命令列去獲取當前狀態值，這個熟悉python的人可以改一下。

例項講解hadoop中的hive查詢(python語言實現)薦

相關文章