pycurl實現hadoop的客戶端功能薦

liran728729發表於2013-03-11

Hadoop客戶端

pycurl實現hadoop的客戶端功能

目前在測試一個hadoop的功能，需要頻繁的和hadoop打交道。剛開始採用的python的subprocess模組來呼叫底層的hadoop提供的命令列工具實現的。

一，hadoop提供的命令列格式說明：

hadoop fs [cmd]具體的命令有:

hadoop fs [-fs <local | file system URI>] [-conf <configuration file>]

[-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>]

[-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skipTrash] <src>]

[-rmr [-skipTrash] <src>] [-put <localsrc> … <dst>] [-copyFromLocal <localsrc> … <dst>]

[-moveFromLocal <localsrc> … <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>

[-getmerge <src> <localdst> [addnl]] [-cat <src>]

[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>]

[-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>]

[-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>]

[-tail [-f] <path>] [-text <path>]

[-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]

[-chown [-R] [OWNER][:[GROUP]] PATH…]

[-chgrp [-R] GROUP PATH…]

[-count[-q] <path>]

[-help [cmd]]

從上面可以看出命令提供的功能還是挺強大的。包括了檔案和對目錄的各種操作。

舉個例子：

要列出hadoop的根目錄下面的檔案,具體命令如下：

#hadoop fs -ls hdfs://192.168.0.112:50081/

drwx—r-x – test test 0 2013-03-08 11:20 /static

drwx—r-x – test test 0 2013-02-19 15:40 /system

drwxrwxrwx – test test 0 2013-01-22 18:42 /video

其他的命令功能就不一一介紹了，相信看幫組文件自己也可以看懂。

這樣會有一個問題，每執行一個命令都會新生成一個jvm，對執行命令的機器造成很大的負擔，在命令多的情況下，檢視top可以看到java的程式會跑到99%，嚴重影響到的使用。於是有了下面的實現方法。

二，hadoop提供的web方式

在網上檢視官方的客戶端API，發現hadoop提供一個web REST API，既採用curl的方式可以輕鬆實現。官方文件連線為：http://hadoop.apache.org/docs/stable/webhdfs.html

上面對使用方式進行充分的說明。

curl的方式可以進行對hadoop中的檔案和目錄進行一些基本的操作。

目前官網上提供的有

1,建立並寫入檔案

2，追加檔案

3，開啟並讀入檔案

4，建立目錄

5，重新命名檔案或者目錄

6，刪除檔案或者目錄

7，列出檔案或者目錄狀態

8，列出目錄列表

下面提供一些具體的使用例子：

a，列出目錄的狀態

#curl -i http://192.168.0.112:50071/webhdfs/v1/?op=GETFILESTATUS

HTTP/1.1 200 OK

Content-Type: application/json

Transfer-Encoding: chunked

Server: Jetty(6.1.26)

{“FileStatus”:{“accessTime”:0,”blockSize”:0,”group”:”TEST”,”length”:0,”modificationTime”:1362812718704,”owner”:”TEST”,”pathSuffix”:””,”permission”:”705″,”replication”:0,”type”:”DIRECTORY”}}

b，重新命名目錄

#curl -i -X PUT http://192.168.0.112:50071/webhdfs/v1/test?op=RENAME&destination=/test1

HTTP/1.1 200 OK

Content-Type: application/json

Transfer-Encoding: chunked

{“boolean”:true}

其他的功能就不一一介紹了。具體的實現方式請看官方文件

三，由curl的方式想到的

因為我的程式是用python跑的，那麼採用curl命令列的方式同樣是呼叫底層命令，python的模組那麼多，那麼我如果使用python的curl庫那不是可以輕鬆實現python對hadoop中檔案和目錄的操作。

在經過查資料後，寫了一個基本的webhadoop的class，基本的功能大概完成了，其他的東西以後再加吧。

具體的程式碼如下：

#!/usr/bin/env python 
# -*- encoding:utf-8 -*- 
"""A library to access Hadoop HTTP REST API, 
   make sure you hadoop cluster open the http access . 
""" 
``` 
author : liran 
data   : 2013-03-11 
 
致謝：xwu 
     武漢雲雅科技有限公司 
     
``` 
import StringIO 
import pycurl 
import re 
import sys 
import logging 
import os 
 
class WebHadoop(object): 
    def __init__(self,host,port,username,logger,prefix="/webhdfs/v1"): 
        self.host = host 
        self.port = port 
        self.user = username 
        self.logger = logger 
        self.prefix = prefix 
        self.status = None 
        self.url = "http://%s:%s" % (host,port) 
        selfself.url_path = self.url + self.prefix  
 
 
 
    def checklink(self): 
        try: 
            b = StringIO.StringIO() 
            c = pycurl.Curl() 
            checkurl = self.url + "/dfsnodelist.jsp?whatNodes=LIVE" 
            c.setopt(pycurl.URL, checkurl) 
            c.setopt(pycurl.HTTPHEADER, ["Accept:"]) 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            self.status = c.getinfo(c.HTTP_CODE) 
            bbody = b.getvalue() 
            self.Write_Debug_Log(self.status,checkurl) 
            p = re.compile(r```Live Datanodes :(.*)</a```) 
            results = p.findall(body) 
            b.close() 
            if results[0] == "0": 
                self.logger.error("Sorry, There are not live datanodes in Hadoop Cluster!!!") 
                self.curlObj.close() 
                sys.exit(255) 
            return results[0] 
        except pycurl.error,e: 
            self.logger.error("Sorry, can not get the hadoop http link .Erros: %s" % e) 
            c.close() 
            b.close() 
            sys.exit(255) 
        finally: 
            c.close() 
            b.close() 
             
     
    def lsdir(self,path): 
        try: 
            b = StringIO.StringIO() 
            put_str = `[{"op":LISTSTATUS}]` 
 
            c = pycurl.Curl() 
                 
            lsdir_url = self.url_path + path + "?op=LISTSTATUS" 
            c.setopt(pycurl.URL, lsdir_url) 
            c.setopt(pycurl.HTTPHEADER, ["Accept:"]) 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            bbody = b.getvalue() 
            self.status = c.getinfo(c.HTTP_CODE) 
        except Exception,e: 
            print e 
        finally: 
            c.close() 
            b.close() 
         
         
        if self.status == 200: 
            data_dir = eval(body) 
            return data_dir[`FileStatuses`][`FileStatus`] 
             
        else: 
            self.logger.error("Sorry,can not list the dir or file status!!!") 
            self.Write_Debug_Log(self.status,lsdir_url) 
            return False 
         
              
    def lsfile(self,path): 
        try: 
            c = pycurl.Curl() 
            b = StringIO.StringIO() 
            put_str = `[{"op":LISTSTATUS}]` 
            lsdir_url = self.url_path + path + "?op=GETFILESTATUS" 
            c.setopt(pycurl.URL, lsdir_url) 
            c.setopt(pycurl.HTTPHEADER, ["Accept:"]) 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            bbody = b.getvalue() 
            self.status = c.getinfo(c.HTTP_CODE) 
        except Exception,e: 
            print e 
        finally: 
            c.close() 
            b.close() 
             
        if self.status == 200: 
            data_dir = eval(body) 
            if data_dir[`FileStatus`][`type`] == "DIRECTORY": 
                self.logger.error("Sorry,this file %s is a dir actually!!!" % (path)) 
                return False 
            else: 
                return data_dir[`FileStatus`] 
        else: 
            self.logger.error("Sorry,can not list the dir or file status!!!") 
            self.Write_Debug_Log(self.status,lsdir_url) 
            return False 
             
    def mkdir(self,path,permission="755"): 
        try: 
            print "yes ,mkdir function" 
            b = StringIO.StringIO() 
            c = pycurl.Curl() 
            mkdir_str = `[{"op":"MKDIRS","permission"=permission}]` 
            mkdir_url = "%s%s?op=MKDIRS&permission=%s" % (self.url_path,path,permission) 
            c.setopt(pycurl.URL, mkdir_url) 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(mkdir_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"PUT") 
            c.setopt(pycurl.POSTFIELDS,mkdir_str) 
           
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            self.status = c.getinfo(c.HTTP_CODE) 
            bbody = b.getvalue() 
            b.close() 
        except Exception,e: 
            print e 
        finally: 
            c.close() 
             
          
        if self.status == 200 : 
            if "true" in body: 
                self.logger.info("Great,Successfully Create dir %s in hadoop cluster!!" % (path)) 
                return True 
            elif "false" in body: 
                self.logger.info("Sorry,can`t create this %s dir in hadoop cluster!!1!!") 
                return False 
            else: 
                return False 
        else: 
            self.logger.error("Sorry,can`t create this %s dir in hadoop cluster!!1" % (path)) 
            self.Write_Debug_Log(self.status,mkdir_url)  
                     
 
    def remove(self,path,recursive="True"): 
        try: 
            c = pycurl.Curl() 
            b = StringIO.StringIO() 
            remove_str = `[{"op":"DELETE","recursive"=recursive}]` 
            remvoe_url = "%s%s?op=DELETE&recursive=%s" % (self.url_path,path,recursive) 
            c.setopt(pycurl.URL, remvoe_url) 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(remove_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"DELETE") 
            c.setopt(pycurl.POSTFIELDS,remove_str) 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            bbody = b.getvalue() 
            print type(body) 
            self.status = c.getinfo(c.HTTP_CODE)  
        except Exception,e: 
            print e 
        finally: 
            c.close() 
            b.close() 
        if self.status == 200 : 
            if "true" in body: 
                print "yes ,it in" 
                self.logger.info("Great,Successfully delete dir or file %s in hadoop cluster!!" % (path)) 
                return True 
            elif "false" in body: 
                print "no ,it is not" 
                self.logger.info("Sorry,can`t delete dir or file,maybe this dir is not exsited!!") 
                return False 
            else: 
                return False 
             
        else: 
            self.logger.error("Sorry,can`t create this %s dir in hadoop cluster!!1" % (path)) 
            self.Write_Debug_Log(self.status,remvoe_url) 
             
    def rename(self,src,dst): 
        try: 
            c = pycurl.Curl() 
            b = StringIO.StringIO() 
            rename_str = `[{"op":"RENAME"}]` 
            rename_url = "%s%s?op=RENAME&destination=%s" % (self.url_path,src,dst) 
            c.setopt(pycurl.URL, rename_url) 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(rename_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"PUT") 
            c.setopt(pycurl.POSTFIELDS,rename_str) 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            bbody = b.getvalue() 
            self.status = c.getinfo(c.HTTP_CODE)   
        except Exception,e: 
            print e 
        finally: 
            c.close() 
            b.close() 
        if self.status == 200 : 
            if "true" in body: 
                self.logger.info("Great,Successfully rename dir or file %s in hadoop cluster!!" % (rename_url)) 
                return True 
            elif "false" in body: 
                self.logger.info("Sorry,can`t rename dir or file,maybe this dir is not exsited!!") 
                return False 
            else: 
                return False 
             
        else: 
            self.logger.error("Sorry,can`t create this %s dir in hadoop cluster!!1" % (rename_url)) 
            self.Write_Debug_Log(self.status,rename_url)      
 
    def put_file(self,local_path,hdfs_path,overwrite="true",permission="755",buffersize="128"): 
        print "yes ,put fils ing!!!" 
        try: 
            c = pycurl.Curl() 
            put_str = `[{"op":"CREATE","overwrite":overwrite,"permission":permission,"buffersize":buffersize}]` 
            put_url = "%s%s?op=CREATE&overwrite=%s&permission=%s&buffersize=%s" % (self.url_path,hdfs_path,overwrite,permission,buffersize) 
            c.setopt(pycurl.URL, put_url) 
            header_str = StringIO.StringIO() 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(put_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"PUT") 
            c.setopt(pycurl.HEADER,1) 
            c.setopt(pycurl.HEADERFUNCTION,header_str.write) 
            c.setopt(pycurl.POSTFIELDS,put_str) 
            b = StringIO.StringIO() 
            c.setopt(pycurl.WRITEFUNCTION, b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            redirect_url = c.getinfo(pycurl.EFFECTIVE_URL) 
        except Exception,e: 
            print e 
         
        if os.path.isfile(local_path): 
            try: 
                f = file(local_path) 
                filesize = os.path.getsize(local_path) 
                c.setopt(pycurl.URL, redirect_url) 
                c.setopt(pycurl.HEADER,1) 
                c.setopt(pycurl.CUSTOMREQUEST,"PUT") 
                c.setopt(pycurl.PUT,1) 
                c.setopt(pycurl.INFILE,f) 
                c.setopt(pycurl.INFILESIZE,filesize) 
                c.setopt(pycurl.WRITEFUNCTION, b.write) 
                c.setopt(pycurl.FOLLOWLOCATION, 1) 
                c.setopt(pycurl.MAXREDIRS, 5) 
                c.perform() 
                print "yes.is ready to putting..." 
                self.status = c.getinfo(c.HTTP_CODE) 
                print b.getvalue() 
            except Exception,e: 
                print e 
          finally: 
                b.close() 
                header_str.close() 
                f.close() 
        else: 
            self.logger.error("Sorry,the %s is not existed,maybe it is not a file." % local_path) 
            return False 
         
 
        if self.status != 201: 
            print self.status 
            self.Write_Debug_Log(self.status,put_str) 
            return False 
        else: 
            self.logger.info("Great,successfully put file into hdfs %s " % hdfs_path) 
            return True 
 
    def append(self,local_path,hdfs_path,buffersize=None): 
        pass         
 
     
     
    def get_file(self, local_path, hdfs_path,buffersize="128"): 
 
        if not os.path.isfile(local_path): 
            print local_path 
            os.mknod(local_path) 
        c = pycurl.Curl() 
        f = file(local_path,`wb`) 
        put_str = `[{"op":"OPEN"}]` 
        put_url = "%s%s?op=OPEN&buffersize=%s" % (self.url_path,hdfs_path,buffersize)         
        try: 
            print "yes .aaaaaaaaaaaaaaaaaaaaa" 
            c.setopt(pycurl.URL, put_url) 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(put_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"GET") 
            f = file(local_path,`wb`) 
            c.setopt(pycurl.POSTFIELDS,put_str) 
            c.setopt(pycurl.WRITEFUNCTION,f.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.setopt(pycurl.CONNECTTIMEOUT,60) 
            c.setopt(pycurl.TIMEOUT,300)             
            c.perform() 
 
            print c.getinfo(pycurl.HTTP_CODE) 
            self.status = c.getinfo(pycurl.HTTP_CODE) 
        except Exception,e: 
            print e 
        finally: 
            c.close() 
            f.close() 
 
        if self.status != 200: 
            print self.status 
            self.Write_Debug_Log(self.status,put_str) 
            return False 
        else: 
            self.logger.info("Great,successfully put file into hdfs %s " % hdfs_path) 
            return True 
 
         
         
    def cat_file(self, hdfs_path,buffersize="128"): 
        c = pycurl.Curl() 
        b = StringIO.StringIO() 
        put_str = `[{"op":"OPEN"}]` 
        put_url = "%s%s?op=OPEN&buffersize=%s" % (self.url_path,hdfs_path,buffersize)         
        try: 
            print "yes .ready to open" 
            c.setopt(pycurl.URL, put_url) 
            c.setopt(pycurl.HTTPHEADER,[`Content-Type: application/json`,`Content-Length: `+str(len(put_str))]) 
            c.setopt(pycurl.CUSTOMREQUEST,"GET") 
 
            c.setopt(pycurl.POSTFIELDS,put_str) 
            c.setopt(pycurl.WRITEFUNCTION,b.write) 
            c.setopt(pycurl.FOLLOWLOCATION, 1) 
            c.setopt(pycurl.MAXREDIRS, 5) 
            c.perform() 
            self.status = c.getinfo(pycurl.HTTP_CODE) 
            print c.getinfo(pycurl.HTTP_CODE) 
            print "###-------------------------------------------###" 
            print b.getvalue() 
        except Exception,e: 
                    print e 
        finally: 
            c.close() 
            b.close() 
 
        if self.status != 200: 
            print self.status 
            self.Write_Debug_Log(self.status,put_str) 
            return False 
        else: 
            self.logger.info("Great,successfully put file into hdfs %s " % hdfs_path) 
            return True 
         
    def copy_in_hdfs(self,src,dst,overwrite="true",permission="755",buffersize="128"): 
        tmpfile = "/tmp/copy_inhdfs_tmpfile" 
        self.get_file(tmpfile,src) 
        if self.status == 200: 
            self.put_file(tmpfile,dst,overwrite="true") 
            if self.status == 201: 
                os.remove(tmpfile) 
                return True 
            else: 
                os.remove(tmpfile) 
                return False 
        else: 
            os.remove(tmpfile) 
            return False          
         
                  
    def Write_Debug_Log(self,status,url): 
        if status != 200 or status != 201 : 
            self.logger.error("Url : "%s" ,Exit code : %s"%(url,self.status)) 
            self.logger.error("fetch a error ,but don`t quit")

採用curl的方式實現的功能和java自帶的命令列工具比較，還是有些不足的

1，不支援hadoop內部檔案copy

2，不支援目錄上傳或者下載

3，測試的時候， shell的方式上傳，如果檔案已經存在回報錯；curl的方式上傳預設引數必須是overwrite=true，才能成功，不知道為什麼。

唯一的好處就是，執行的時間大大提高了。

同樣一個列出目錄列表的命令，

#time hadoop fs -ls hdfs://192.168.0.112:50081/

real 0m10.916s

user 0m4.082s

sys 0m6.799s

#time curl -i http://192.168.0.112:50071/webhdfs/v1/?op=LISTSTATUS

real 0m0.005s

user 0m0.002s

sys 0m0.000s

而採用python的方式呼叫pycurl的模式來看

執行時間應該在0.01s左右。

快了很多啊。類的程式碼還在繼續完善中。

繼續努力了！呵呵呵

jQuery實現客戶端CheckAll功能
2021-09-09
jQuery客戶端
實現客戶端加密，後臺解密薦
2007-10-31
客戶端加密解密
FTP客戶端c程式碼功能實現
2023-02-23
FTP客戶端C程式
VNC客戶端推薦，Windows系統下VNC客戶端推薦
2020-06-03
VNC客戶端Windows
客戶端骨架屏實現
2019-01-04
客戶端
Redis的Pub/Sub客戶端實現
2019-01-07
Redis客戶端
網頁SSH客戶端的實現
2022-02-28
網頁客戶端
Go 實現簡易的 Redis 客戶端
2019-04-05
GoRedis客戶端
Android-TCP客戶端的實現
2018-01-16
AndroidTCP客戶端
oracle RAC的客戶端HA配置薦
2007-07-31
Oracle客戶端
golang實現tcp客戶端服務端程式
2020-12-27
GolangTCP客戶端服務端
c#實現redis客戶端(一)
2015-01-12
C#Redis客戶端
實現客戶端與服務端的HTTP通訊
2018-07-11
客戶端服務端HTTP
Istio 中實現客戶端源 IP 的保持
2022-06-08
客戶端
Android實現Thrift服務端與客戶端
2017-08-12
Android服務端客戶端
Redis 6.0 客戶端快取的伺服器端實現
2020-09-29
Redis客戶端快取伺服器
IM撤回訊息-iOS客戶端實現
2018-12-16
iOS客戶端
Vue實現騰訊視訊Mac客戶端
2018-01-04
VueMac客戶端
Java的oauth2.0 服務端與客戶端的實現
2018-05-05
JavaOAuth服務端客戶端
博文推薦｜Pulsar 客戶端編碼最佳實踐
2021-11-21
客戶端
使用 Golang 實現 appium/WebDriverAgent 的客戶端庫
2020-05-10
GolangAPPWeb客戶端
Jmeter的客戶端實現與Keep-Alive
2021-02-02
JMeter客戶端Keep-Alive
RetrofitJs – TypeScript實現的宣告式HTTP客戶端
2019-02-14
JSTypeScriptHTTP客戶端
藍芽客戶端和伺服器的實現
2015-12-02
藍芽客戶端伺服器
C語言呼叫mysql資料庫API實現簡單的mysql客戶端的功能
2016-02-12
C語言MySql資料庫API客戶端
Golang 實現 Redis(6): 實現 pipeline 模式的 redis 客戶端
2020-11-24
GolangRedis模式客戶端
極簡的Restful框架推薦->Resty(服務端+客戶端)
2015-01-28
REST框架服務端客戶端
Redis 設計與實現（四）--事件、客戶端
2018-02-02
Redis事件客戶端
Java建立WebService服務及客戶端實現
2013-10-14
JavaWeb客戶端
C#實現組播源及客戶端
2014-08-11
C#客戶端
使用Oracle客戶端wallet實現匿名登入
2011-11-07
Oracle客戶端
Redis的釋出訂閱及.NET客戶端實現
2017-03-16
Redis客戶端
Qt實現網路聊天室（客戶端，服務端）
2021-06-23
QT客戶端服務端
HTML轉PDF的純客戶端和純服務端實現方案
2020-05-19
HTML客戶端服務端
SSLSocket實現服務端和客戶端雙向認證的例子
2016-06-03
服務端客戶端
實用的PostgreSQL客戶端：Postico for Mac
2022-07-21
SQL客戶端Mac
python 實現 TCP、UDP 客戶端最簡流程
2020-07-18
PythonTCPUDP客戶端
redis客戶端實現高可用讀寫分離
2021-07-02
Redis客戶端

pycurl實現hadoop的客戶端功能薦

相關文章