條件,假設配置好了hadoop和hive,並可以正常執行
首先,要外部查詢hive,你需要安裝thrift和fb303,或許有別的辦法,但我實際應用過程中看來,這是最簡單的途徑。hive本身提供了thrift的介面。檔案在hive解壓縮後如下路徑中
hive/hive-0.7.1/src/service/if/hive_service.thrift
然後複製四個檔案
metastore/if/hive_metastore.thrift
src/service/if/hive_service.thrift
ql/if/queryplan.thrift
thrift原始碼下fb303中的fb303.thrift。
編輯hive_service.thrift,將其中include其他三個檔案的路徑修改為正確的路徑。儲存退出,執行thrift
#thrift -r –gen python hive_service.thrift

將生成的python放入/usr/lib/python/site-packages下,然後編寫如下指令碼
#!/usr/bin/python
#-*-coding:UTF-8 -*-
import sys
import os
import string
import re
import MySQLdb

from hive_service import ThriftHive
from hive_service.ttypes import HiveServerException
from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

def hiveExe(hsql,dbname):
#定義hive查詢函式
                try:
                                transport = TSocket.TSocket(`192.168.10.1`, 10000)
                                transport = TTransport.TBufferedTransport(transport)
                                protocol = TBinaryProtocol.TBinaryProtocol(transport)

                                client = ThriftHive.Client(protocol)
                                transport.open()

                                client.execute(`ADD jar /opt/modules/hive/hive-0.7.1/lib/hive-contrib-0.7.1.jar`)

                                client.execute(“use “+dbname)
                                row = client.fetchOne()
                                #使用庫名,只需一次fetch,用fetchOne

                                client.execute(hsql)
                                return client.fetchAll()
                                #查詢所有資料,用fetchAll()

                                transport.close()

                except Thrift.TException, tx:
                                print `%s` % (tx.message)

def mysqlExe(sql):
                try:
                                conn = MySQLdb.connect(user=”test”,passwd=”test123″,host=”127.0.0.1″,db=”active2_ip”,port=5029)
                except Exception,data:
                                print “Could not connect to MySQL server.:”,data
                try:
                                cursor = conn.cursor()
                                cursor.execute(sql)
                                return row
                                cursor.commit()
                                cursor.close()
                                conn.close()
                except Exception,data:
                                print “Could not Fetch anything:”,data

dbname = “active2”
date = os.popen(“date -d `1 day ago` +%Y%m%d”).read().strip()
#shell方式取昨天日期,讀取並去前後

date.close()

sql = “create table IF NOT EXISTS “+dbname+”_group_ip_”+date+” like “+dbname+”_group_ip;load data infile `/tmp/”+dbname+”_”+date+”.csv` into table “+dbname+”_group_ip_”+date+” FIELDS TERMINATED BY `,`”
#以模板表建立日期表,並load data到該表中

hsql = “insert overwrite local directory `/tmp/”+dbname+”_”+date+”` select count(version) as vc,stat_hour,type,version,province,city,isp from “+dbname+”_”+date+” group by province,city,version,type,stat_hour,isp”
#hive查詢,並將查詢結果匯出到本地/tmp/active2_20111129目錄下,可能生成多個檔案

hiveExe(hsql, dbname)
#執行查詢

os.system(“sudo cat /tmp/”+dbname+”_”+date+”/* > /tmp/tmplog “)
#將多個檔案通過shell合併為一個檔案tmplog

file1 = open(“/tmp/tmplog”, `r`)
#開啟合併後的臨時檔案
file2 = open(“/tmp/”+dbname+”_”+date+”.csv”,`w`)
#開啟另一個檔案,做文字替換。因為hive匯出結果,其分隔符為特殊字元。所以需要做替換,格式為csv,故用逗號分隔
sep = `,`
for line in file1:
                tmp = line[:-1].split(`x01`)
                #hive匯出檔案分隔符為ascii中的001,x01是16進位制,但其實也就是十進位制的1
                replace = sep.join(tmp)
                file2.write(replace+”
“)

file1.close()
file2.close()

os.system(“sudo rm -f /tmp/tmplog”)
#刪除臨時的tmplog

mysqlExe(sql)
#執行mysql查詢,建立表和載入資料。
os.system(“sudo rm -f /tmp/”+dbname+”_”+date)

其實難點主要是如何通過thrift來查詢hive。還有就是hive在dump資料時,如果寫insert overwrite directory,會dump到hdfs裡,必須寫insert overwrite local directory,才會dump到當前工作機。
再一個就是注意匯出檔案內容中的分隔符,其他都跟正常資料庫操作是一樣的。
PS:我是個python新手,剛用時間不長,所以用了很多命令列去獲取當前狀態值,這個熟悉python的人可以改一下。