一個簡單的python爬蟲程式

lipeng08發表於2016-05-13

原文網址 : https://blog.csdn.net/lpstudy/article/details/51394837

#簡介
在每次論文被拒再投的過程中，都需要查詢最近的與自己論文相關的會議列表。每到這種情況，我一遍採用的是遍歷會伴www.myhuiban.com的網站，然後逐個檢視會議，關注的有三點，投稿日期，ccf類別，會議相關內容。

##TODO：需要加網站的截圖，更清晰的說明為什麼會有這個需求
思考下，也許自己可以寫一個簡單的python爬蟲程式，將所有的會議列表下載下來，然後在本地建立一個所有會議的索引，這樣就可以個性化的定義搜尋了。

昨天下午到今天中午，寫了大概400行的python程式碼，基本完成需求。遇到的主要問題有模擬網站登入，正規表示式搜尋， json的編碼和解碼，檔案的讀取與寫入，命令列引數相關。

#整體框架
Cookie模組：負責模擬使用者登入並儲存cookie以便於下次使用
爬蟲模組：首先爬取會議的第一頁的所有會議，並解析下一頁的網址，然後逐個爬取下一頁的會議列表，對每一個會議，根據其網址爬取其會議的詳細資訊，並儲存到記憶體字典以及檔案中
Json轉換模組：負責將會議字典物件轉換為字串並儲存到檔案，以及從檔案中載入字典建立會議列表
搜尋模組：構造命令列引數，遍歷會議進行模式匹配。
#具體模組
##Cookie模組
待爬蟲的網址需要使用者登入後才可以正常訪問會議列表資訊，因此需要手動模擬使用者登入，並儲存cookie，以便於下次訪問直接使用cookie訪問，而不需要再次登入了。

模擬登入並儲存cookie的程式碼片段

    def requestCookie(self):
        #宣告一個MozillaCookieJar物件例項來儲存cookie，之後寫入檔案
        self.cookie = cookielib.MozillaCookieJar(self.filename)
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))
        postdata = urllib.urlencode({
                    'LoginForm[email]':'xxxxxxx',
                    'LoginForm[password]':'xxxxxx',
                    'LoginForm[rememberMe]':'0',
                    'LoginForm[rememberMe]':'1',
                    'yt0':'登入'
                })
        loginUrl = 'http://www.myhuiban.com/login'
        #模擬登入，並把cookie儲存到變數
        result = opener.open(loginUrl,postdata)
        #儲存cookie到cookie.txt中
        self.cookie.save(ignore_discard=True, ignore_expires=True)

上述程式碼首先建立了MozillaCookie的物件，然後建立了opener，利用它開啟連線，它將通過POST請求將使用者名稱和密碼傳送給伺服器。
實用tips：
也許你可能需要爬取其他的需要登入的網頁，它需要POST的資料格式可能就會有變化，如何處理呢？第一，利用瀏覽器的自帶的開發者工具檢視當登入時候POST的資料是什麼，第二，手動構造postdata，利用上面的python程式碼發起登入請求，第三，對比瀏覽器返回的Cookie值與手動模擬登入的請求的Cookie值是否相同。

從檔案中直接載入Cookie程式碼

        self.cookie = cookielib.MozillaCookieJar()
        self.cookie.load(self.filename, ignore_discard=True, ignore_expires=True)
        for item in self.cookie:  
            print 'LoadCookie: Name = '+item.name  
            print 'LoadCookie:Value = '+item.value

通過列印對比一下是否和儲存的一致。

Cookie檔案載入和網路載入合併
通過以下的程式碼首先從檔案中載入，如果載入不到就從網路中請求，請求完畢後儲存到檔案。通過這種方式，相當於建立了一個快取機制，不需要每次都從網路中讀取Cookie了。

def getCookie(self):
    if(self.cookie == None):    
        self.loadCookie()
    if(self.cookie == None):
        self.requestCookie()
    return self.cookie

##會議物件
將所有會議儲存到會議物件中，需要定義會議物件。根據網頁上會議的相關屬性，定義如下的會議物件：

class Conference:
    #注意jsonDecoder這個函式的名字必須完全一致，傳入的引數名字也必須一致
    def __init__(self, ccf, core, qualis, simple, traditinal, delay, sub, note, conf, place, num, browser, content, rate, id):
        self.ccf = ccf
        self.core = core
        self.qualis = qualis
        self.simple = simple
        self.traditinal = traditinal
        self.delay = delay
        self.sub = sub
        self.note = note
        self.conf = conf
        self.place = place
        self.num = num
        self.browser = browser
        self.content = content
        self.id = id
        self.rate = rate
        
    def setContent(self, content):
        self.content = content
    def setAcceptRate(self, rate):
        self.rate = rate
    def printConf(self):
        print 'CCF CORE QUALIS ShortN Long Delay SubmissionDt Notification Conference Place NumberedHold Nread'
        #print('x=%s, y=%s, z=%s' % (x, y, z))
        print self.ccf, self.core , self.qualis , self.simple , self.traditinal , self.delay , self.sub , self.note , self.conf , self.place , self.num , self.browser
    def printDetailConf(self):
        self.printConf()
        print('Content:')
        print self.content
    def isValid(self):
        return self.id == -1
    def json(self):
        pass
    @staticmethod    
    def createConf(line):
        #u'<td></td><td>c</td><td>b2</td><td>IDEAL</td><td><a target="_blank" href="/conference/535">International Conference on Intelligent Data Engineering and Automated Learning</a></td><td></td><td>2016-05-15</td><td>2016-06-15</td><td>2016-10-12</td><td>Wroclaw, Poland</td><td>17</td><td>3932</td>'
        #CCF	CORE	QUALIS	簡稱	全稱	延期	截稿日期	通知日期	會議日期	會議地點	屆數	瀏覽
        pattern = r'<td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td><a target="_blank" href="(.*?)">(.*)</a></td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td>'
        info = re.findall(pattern, line, re.S)[0]
        
        conf = Conference()
        conf.ccf = info[0]
        conf.core = info[1]
        conf.qualis = info[2]
        conf.simple = info[3]
        conf.id = info[4]
        conf.traditinal = info[5]
        conf.delay = info[6]
        conf.sub = info[7]
        conf.note = info[8]
        conf.conf = info[9]
        conf.place = info[10]
        conf.num = info[11]
        conf.browser = info[12]
        #conf.printConf()
        return conf

此會議物件接收一段html程式碼，並進行解析，將相應的屬性解析出來。

##json相關
考慮到不能每次使用此程式都需要爬取所有的會議列表，因此需要將會議物件轉換為字串儲存到檔案中。常用的物件序列化方法之一就是json

於是借用網路上的一份物件和json串的轉換程式碼如下：
特此宣告：下面的轉換程式碼非本人所寫，copy於網路，連結地址為：http://www.cnblogs.com/coser/archive/2011/12/14/2287739.html

class MyEncoder(json.JSONEncoder):
    def default(self,obj):
        #convert object to a dict
        d = {}
        d['__class__'] = obj.__class__.__name__
        d['__module__'] = obj.__module__
        d.update(obj.__dict__)
        return d
 
class MyDecoder(json.JSONDecoder):
    def __init__(self):
        json.JSONDecoder.__init__(self,object_hook=self.dict2object)
    def dict2object(self,d):
        #convert dict to object
        if'__class__' in d:
            class_name = d.pop('__class__')
            module_name = d.pop('__module__')
            module = __import__(module_name)
            class_ = getattr(module,class_name)
            args = dict((key.encode('ascii'), value) for key, value in d.items()) #get args
            #pdb.set_trace()
            inst = class_(**args) #create new instance
        else:
            inst = d
        return inst

上面簡單的兩個類就可以用來將具體的字串和物件進行互轉。
例如建立了一個p物件，可使用d = MyEncoder().encode§將p編碼為json字串，然後再使用o = MyDecoder().decode(d)將字串進行解碼為p物件。
這個裡面需要注意的是dict2object解碼函式中，通過inst = class_(**args)建立物件，這會呼叫class_對應類的建構函式。對於本文的程式碼，也就是呼叫Conference類的__init__建構函式，由於args是一個字典型別，key為Conference類的屬性名字，這也就要求建構函式的引數名字與類的屬性名字一一對應。

##爬蟲模組
此模組邏輯相對複雜，下載會議主頁的html程式碼並解析出會議列表，然後對會議列表中的每一個會議繼續下載會議詳情內容，當前頁的會議全部下載完畢後，啟動下載下一頁的會議列表，直到所有頁處理完成。當所有頁面完畢後通過storeConf儲存所有的會議到檔案中，以便於下一次啟動後，直接從檔案中讀取。

程式碼如下：

class UrlCrawler:
    def __init__(self):
        self.rootUrl = "http://www.myhuiban.com"
        self.mgr = CookieMgr()
        self.curPage = 1
        self.endPage = -1
        self.confDic = {}
    def requestUrl(self, url):
        r = requests.get(url, cookies=self.mgr.getCookie(), headers=self.mgr.getHeader())
        r.encoding = 'utf-8'#設定編碼格式
        return r
    def setRoot(self, url):
        self.rootUrl = url
    def firstPage(self, url):
        self.pageHtml = self.requestUrl(url).text
        self.endPage = self.searchEndPage()
        self.procAPage()
    def hasNext(self):
        #pdb.set_trace()
        return self.curPage <= self.endPage
    def nextPage(self):
        #http://www.myhuiban.com/conferences?Conference_page=14
        self.pageHtml = self.requestUrl(self.rootUrl+"/conferences?Conference_page="+str(self.curPage)).text
        self.procAPage()
    def procAPage(self):
        ...parser pageHtml to table
        for i in range(len(table)):
            #print table[i]
            conf = Conference.createConf(table[i])
            self.parseConfContent(conf)
            self.confDic[conf.simple] = conf
        self.curPage += 1
    def parseConfContent(self, conf):
       ...
    def requestConf(self):
        #load first page
        self.firstPage(self.rootUrl+"/conferences")
        #print self.confDic
        while(self.hasNext()):
            self.nextPage()
    def crawlWeb(self):
        if(len(self.confDic) == 0):
            self.loadConf()
        if(len(self.confDic) == 0):
            self.requestConf()    
        self.storeConf()    
        
    def storeConf(self):
        f = file('conferences.txt', 'w')
        confJson = json.dumps(self.confDic, cls=MyEncoder)
        #print confJson
        f.write(confJson)
        f.close()
    def loadConf(self):
        f = file('conferences.txt', 'r')
        confJson = f.read()
        self.confDic = MyDecoder().decode(confJson)
        #pdb.set_trace()
        #print confJson
        f.close()
        
    def searchEndPage(self):
        pattern = r'<li class="last"><a href="/conferences\?Conference_page=">'
        pattern = r'<li class="last"><a href="/conferences\?Conference_page=(\d*)">'
        #print self.pageHtml
        return int(re.search(pattern, self.pageHtml).group(1))

##命令列選項模組
一個程式有一個良好的外部介面，方便使用者使用。因此，其使用getopt方式，傳遞命令列引數，以支援多種查詢行為。

主要程式碼如下：

main函式根據使用者的輸入，構造不同的查詢選項。
SearchEngine類將查詢選項傳遞給具體的會議物件，並呼叫match方法，判斷此會議是否符合查詢選項。如果符合，就列印當前會議資訊。

class SearchEngine:
    def __init__(self):
        self.crawler = UrlCrawler()
        self.crawler.crawlWeb()
    #date='6-8', ccf='a|b|c', conf='FAST',  key='xyz', showdetail=True    
    def searchConf(self, date='2016-0[6-8]', ccf='b', name='CIDR',  key='cloud', showdetail=True):
        pat = OnePattern(date, ccf, name, key, showdetail)
        matNum = 0
        for key, value in self.crawler.confDic.items():
            if(pat.match(value)):
                matNum += 1
        print 'Total Matched Conference Num: ', matNum

if __name__ == '__main__':
    opts, args = getopt.getopt(sys.argv[1:], "hd:c:n:k:s:", ['help', 'date=', 'ccf=', 'name=',  'key=', 'showDetail='])
    date = None
    ccf = None
    name = None
    key = None
    showdetail = False
    for op, value in opts:
        if op in ("-d", '--date'):
            date = value    
        elif op in ("-c", '--ccf'):
            ccf = value
        elif op in ("-n", '--name'):
            name = value
        elif op in ("-k", "--key"):
            key = value
        elif op in ("-s", "-showDetail"):
            showdetail = value
        elif op in ("-h", "--help"):
            print 'Usage:', sys.argv[0], 'options'
            print '    Mandatory arguments to long options are mandatory for short options too.'
            print '    -d, --date=DATE  search the given date, default none.'
            print '    -c, --ccf=[abc]  search the given CCF level, default none.'
            print '    -n, --name=NAME  search the given conference name(support shor and long name both).'
            print '    -k, --key=KEYWORD  search the given KEYWORD in conference description.'
            print '    -h, --help  print this help info.'
            print 'Example:'
            print '    ', sys.argv[0], '-d 2016-0[6-8] -c b -n CIDR -k cloud -s 1'
            print '        ', 'search the conference whose submission date from 2016-06 to 2016-08, level is CCF B, keyword is cloud and detail information is shown.'
            print '    ', sys.argv[0], '--name=CIDR'
            print '    ', sys.argv[0], '-n CIDR'
            print '        ', 'search the conference name which contains CIDR.'
            print '    ', sys.argv[0], '-n ^CIDR$ -s 1'
            print '        ', 'search the conference name which contains exactly CIDR and print the detail info.'
            print '    ', sys.argv[0], '-c [ab] -k "cloud computing"'
            print '        ', 'search the conference who is CCF A or B and is about cloud computing.'
            print '    ', sys.argv[0], '-k "(cloud computing) | (distributed systems)"'
            print '        ', 'search the conference which is about cloud computing or distributed systems.'
            print '    ', sys.argv[0], '-k "cloud | computing | distributed"'
            print '        ', 'search the conference which contains cloud or computing or distributed.'
            print '    ', sys.argv[0], '-k "(Cloud migration service).*(cloud computing)"'
            print '        ', 'search the conference which contains both and are ordered.'
            print 
            sys.exit()
    SearchEngine().searchConf(date, ccf, name, key, showdetail)
    #print date, ccf, name, key, showdetail
    sys.exit(0)

#爬蟲程式碼
我將完整的程式碼貼到這裡。這份程式碼歷時一天時間，並沒有經過本人嚴格的測試，肯定會有bug存在。我只能在以後使用的過程中根據發現的bug再逐步完善。

這個裡面用到了一些技巧，總結如下：

模擬網站登入程式碼
儲存cookie以方便下次直接使用，減少請求
發起網路你求，解析返回的html，正規表示式的解析
物件和json的相互轉化，自定義json編解碼函式
物件儲存到檔案，並從下次從檔案中直接載入，避免每次都需要網路請求
長短命令列引數，各種情況都支援

程式碼技巧：

getCookie方法的首先從檔案中，失敗再請求網路。
UrlCrawler的firstPage和nextPage方法，開發了類似於迭代器介面的方法，清晰明瞭。

可以提高的地方：

物件不儲存到檔案，儲存到資料庫，通過sql語句操縱資料庫進行查詢
Pattern物件的抽象層次，雖然想實現支援(a|b) & (c|d)這樣的表達，但是實在不想讓程式碼複雜化，就沒實現。
實現查詢介面，但是python使用介面(TK什麼的，本人未深入研究)需要本地啟動Xming的server，讓我很不爽，不打算詳細研究。本人希望的介面是可以直接寫完程式碼傳遞給python直譯器就可以看到介面的，不應該依賴額外什麼server。

額外宣告：
本人沒有深入研究過python，只是隨便看了一本最基本的python書籍(只看了前面的基本語法以及支援的資料結構)，可能python的特性並沒有完全把握。因此，以下的程式碼有些地方可能會顯得不合理（我自己並沒有這種感覺），畢竟我的思想來源於c++。

#!/usr/bin/python
# -*- coding:utf-8 -*-

import urllib
import urllib2
import cookielib
import requests
import sys
import os
import pdb
import re
import json
import getopt
#pdb.set_trace()

class CookieMgr:
    def __init__(self):
        self.cookie = None
        self.filename = 'cookie.txt'
    def getCookie(self):
        if(self.cookie == None):    
            self.loadCookie()
        if(self.cookie == None):
            self.requestCookie()
        return self.cookie
    def loadCookie(self):
        self.cookie = cookielib.MozillaCookieJar()
        self.cookie.load(self.filename, ignore_discard=True, ignore_expires=True)
        for item in self.cookie:  
            print 'LoadCookie: Name = '+item.name  
            print 'LoadCookie:Value = '+item.value
    def requestCookie(self):
        #宣告一個MozillaCookieJar物件例項來儲存cookie，之後寫入檔案
        self.cookie = cookielib.MozillaCookieJar(self.filename)
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))
        postdata = urllib.urlencode({
                    'LoginForm[email]':'xxxxxxx',
                    'LoginForm[password]':'xxxxxx',
                    'LoginForm[rememberMe]':'0',
                    'LoginForm[rememberMe]':'1',
                    'yt0':'登入'
                })
        loginUrl = 'http://www.myhuiban.com/login'
        #模擬登入，並把cookie儲存到變數
        result = opener.open(loginUrl,postdata)
        #儲存cookie到cookie.txt中
        self.cookie.save(ignore_discard=True, ignore_expires=True)
        for item in self.cookie:  
            print 'ReqCookie: Name = '+item.name  
            print 'ReqCookie:Value = '+item.value
    def badCode(self):
        #server return 500 internal error, not use, just show tips
        login_url = 'http://www.myhuiban.com/conferences'
        #利用cookie請求訪問另一個網址，此網址是成績查詢網址
        #gradeUrl = 'http://www.myhuiban.com/conferences'
        #請求訪問成績查詢網址
        #result = opener.open(gradeUrl)
        #print result.read()
    def test(self):
        mgr = CookieMgr()
        cookie = mgr.getCookie() 
    def directCookie(self):
        #在http://www.myhuiban.com/login登入頁面，發起網路請求，通過瀏覽器自帶的開發者工具，觀測網路傳送的資料包，獲取cookie
        cookies = {
            'PHPSESSID':'3ac588c9f271065eb3f1066bfb74f4e9',
            'cb813690bb90b0461edd205fc53b6b1c':'9b40de79863acaa3df499703611cdb1e123b15c9a%3A4%3A%7Bi%3A0%3Bs%3A14%3A%22lpstudy%40qq.com%22%3Bi%3A1%3Bs%3A14%3A%22lpstudy%40qq.com%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A3%3A%7Bs%3A2%3A%22id%22%3Bs%3A4%3A%221086%22%3Bs%3A4%3A%22name%22%3Bs%3A8%3A%22Lp+Study%22%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%22n%22%3B%7D%7D',
            '__utmt':'1',
            '__utma':'201260338.796552597.1428908352.1463018481.1463024893.21',
            '__utmb':'201260338.15.10.1463024893',
            '__utmc':'201260338',
            '__utmz':'201260338.1461551356.19.9.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',
        }
        return cookies
    def getHeader(self):
        headers = {
            'Host': 'www.myhuiban.com',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'zh-CN,zh;q=0.8',
        }
        return headers
       

confer_url = 'http://www.myhuiban.com/conferences'


#這是會議物件，對應某一個會議
#每一個會議對應的基本結構：
#CCF	CORE	QUALIS	簡稱	全稱	延期	截稿日期	通知日期	會議日期	會議地點	屆數	瀏覽
#具體會議對應的結構：
#http://www.myhuiban.com/conference/1733
#徵稿： 會議描述
#錄取率： 對應歷年的狀態 
class Conference:
    #注意jsonDecoder這個函式的名字必須完全一致，傳入的引數名字也必須一致
    def __init__(self, ccf, core, qualis, simple, traditinal, delay, sub, note, conf, place, num, browser, content, rate, id):
        self.ccf = ccf
        self.core = core
        self.qualis = qualis
        self.simple = simple
        self.traditinal = traditinal
        self.delay = delay
        self.sub = sub
        self.note = note
        self.conf = conf
        self.place = place
        self.num = num
        self.browser = browser
        self.content = content
        self.id = id
        self.rate = rate
        
    def setContent(self, content):
        self.content = content
    def setAcceptRate(self, rate):
        self.rate = rate
    def printConf(self):
        print 'CCF CORE QUALIS ShortN Long Delay SubmissionDt Notification Conference Place NumberedHold Nread'
        #print('x=%s, y=%s, z=%s' % (x, y, z))
        print self.ccf, self.core , self.qualis , self.simple , self.traditinal , self.delay , self.sub , self.note , self.conf , self.place , self.num , self.browser
    def printDetailConf(self):
        self.printConf()
        print('Content:')
        print self.content
    def isValid(self):
        return self.id == -1
    def json(self):
        pass
    @staticmethod    
    def createConf(line):
        #u'<td></td><td>c</td><td>b2</td><td>IDEAL</td><td><a target="_blank" href="/conference/535">International Conference on Intelligent Data Engineering and Automated Learning</a></td><td></td><td>2016-05-15</td><td>2016-06-15</td><td>2016-10-12</td><td>Wroclaw, Poland</td><td>17</td><td>3932</td>'
        #CCF	CORE	QUALIS	簡稱	全稱	延期	截稿日期	通知日期	會議日期	會議地點	屆數	瀏覽
        pattern = r'<td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td><a target="_blank" href="(.*?)">(.*)</a></td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td>'
        info = re.findall(pattern, line, re.S)[0]
        
        conf = Conference()
        conf.ccf = info[0]
        conf.core = info[1]
        conf.qualis = info[2]
        conf.simple = info[3]
        conf.id = info[4]
        conf.traditinal = info[5]
        conf.delay = info[6]
        conf.sub = info[7]
        conf.note = info[8]
        conf.conf = info[9]
        conf.place = info[10]
        conf.num = info[11]
        conf.browser = info[12]
        #conf.printConf()
        return conf
class MyEncoder(json.JSONEncoder):
    def default(self,obj):
        #convert object to a dict
        d = {}
        d['__class__'] = obj.__class__.__name__
        d['__module__'] = obj.__module__
        d.update(obj.__dict__)
        return d
 
class MyDecoder(json.JSONDecoder):
    def __init__(self):
        json.JSONDecoder.__init__(self,object_hook=self.dict2object)
    def dict2object(self,d):
        #convert dict to object
        if'__class__' in d:
            class_name = d.pop('__class__')
            module_name = d.pop('__module__')
            module = __import__(module_name)
            class_ = getattr(module,class_name)
            args = dict((key.encode('ascii'), value) for key, value in d.items()) #get args
            #pdb.set_trace()
            inst = class_(**args) #create new instance
        else:
            inst = d
        return inst
        

    
class UrlCrawler:
    def __init__(self):
        self.rootUrl = "http://www.myhuiban.com"
        self.mgr = CookieMgr()
        self.curPage = 1
        self.endPage = -1
        self.confDic = {}
    def requestUrl(self, url):
        r = requests.get(url, cookies=self.mgr.getCookie(), headers=self.mgr.getHeader())
        r.encoding = 'utf-8'#設定編碼格式
        return r
    def setRoot(self, url):
        self.rootUrl = url
    def firstPage(self, url):
        self.pageHtml = self.requestUrl(url).text
        self.endPage = self.searchEndPage()
        self.procAPage()
    def hasNext(self):
        #pdb.set_trace()
        return self.curPage <= self.endPage
    def nextPage(self):
        #http://www.myhuiban.com/conferences?Conference_page=14
        self.pageHtml = self.requestUrl(self.rootUrl+"/conferences?Conference_page="+str(self.curPage)).text
        self.procAPage()
    def procAPage(self):
        print 'process page: ', self.curPage, ' end:', self.endPage
        pattern = r'<tbody>\n(.*?)\n</tbody>'
        #print self.pageHtml
        table = re.findall(pattern, self.pageHtml, re.S)[0]
        #pdb.set_trace()
        re_class = re.compile(r'<tr class=.*?>\n')
        table = re_class.sub('', table)#去掉span標籤
        re_style=re.compile('<span.*?>',re.I)#不進行超前匹配
        table = re_style.sub('', table)
        re_style=re.compile('</span>',re.I)#去掉span
        table = re_style.sub('', table)
        
        table = table.split('\n')
        for i in range(len(table)):
            #print table[i]
            conf = Conference.createConf(table[i])
            self.parseConfContent(conf)
            self.confDic[conf.simple] = conf
        self.curPage += 1
    def parseConfContent(self, conf):
        html = self.requestUrl(self.rootUrl + conf.id).text
        #print html
        pattern = r'<div class="portlet-content">\n<pre>(.*?)</pre><div class'
        table = re.findall(pattern, html, re.S)[0]
        conf.setContent(table)
    def requestConf(self):
        #load first page
        self.firstPage(self.rootUrl+"/conferences")
        #print self.confDic
        while(self.hasNext()):
            self.nextPage()
    def crawlWeb(self):
        if(len(self.confDic) == 0):
            self.loadConf()
        if(len(self.confDic) == 0):
            self.requestConf()    
        self.storeConf()    
        
    def storeConf(self):
        f = file('conferences.txt', 'w')
        confJson = json.dumps(self.confDic, cls=MyEncoder)
        #print confJson
        f.write(confJson)
        f.close()
    def loadConf(self):
        f = file('conferences.txt', 'r')
        confJson = f.read()
        self.confDic = MyDecoder().decode(confJson)
        #pdb.set_trace()
        #print confJson
        f.close()
        
    def searchEndPage(self):
        pattern = r'<li class="last"><a href="/conferences\?Conference_page=">'
        pattern = r'<li class="last"><a href="/conferences\?Conference_page=(\d*)">'
        #print self.pageHtml
        return int(re.search(pattern, self.pageHtml).group(1))

class Pattern:
    def __init__(self):
        self.patList = list()
    def match(self, conf):
        
        return True
    def addPat(pat):
        self.patList.append(pat)
class OnePattern:
    def __init__(self, date, ccf, conf, key, showdetail):
        self.date = date
        self.ccf = ccf
        self.conf = conf#name
        self.key = key
        self.showdetail = showdetail
        
    def match(self, conf):
        res = True
        if(self.date):
            res = re.findall(self.date, conf.sub, re.S)
        if(self.conf):
            res = (re.findall(self.conf, conf.simple, re.S) or  re.findall(self.conf, conf.traditinal, re.S)) and res
        if(self.ccf):  
            #print conf.ccf, self.ccf, re.findall(self.ccf, conf.ccf, re.S)
            res = re.findall(self.ccf, conf.ccf, re.S) and res
        if(self.key):    
            res = re.findall(self.key, conf.content, re.S) and res    


        if(res):
            if(not self.showdetail):
                conf.printConf()
            else:
                conf.printDetailConf()
            return True
        else:
            return False
        
class TwoOpPattern(Pattern):
    def __init__(self, a, b):
        Pattern.__init__(self)
        self.patList.append(a)
        self.patList.append(b)        
class OrPattern(TwoOpPattern):
    def match(self, conf):
        return self.patList[0].match(conf) or self.patList[1].match(conf)
class AndPattern(TwoOpPattern):
    def match(self, conf):
        return self.patList[0].match(conf) and self.patList[1].match(conf)
class NotPattern(Pattern):
    def match(self, conf):
        return not self.patList[0].match(conf)
    
    
class SearchEngine:
    def __init__(self):
        self.crawler = UrlCrawler()
        self.crawler.crawlWeb()
    #date='6-8', ccf='a|b|c', conf='FAST',  key='xyz', showdetail=True    
    def searchConf(self, date='2016-0[6-8]', ccf='b', name='CIDR',  key='cloud', showdetail=True):
        pat = OnePattern(date, ccf, name, key, showdetail)
        matNum = 0
        for key, value in self.crawler.confDic.items():
            if(pat.match(value)):
                matNum += 1
        print 'Total Matched Conference Num: ', matNum

if __name__ == '__main__':
    opts, args = getopt.getopt(sys.argv[1:], "hd:c:n:k:s:", ['help', 'date=', 'ccf=', 'name=',  'key=', 'showDetail='])
    date = None
    ccf = None
    name = None
    key = None
    showdetail = False
    for op, value in opts:
        if op in ("-d", '--date'):
            date = value    
        elif op in ("-c", '--ccf'):
            ccf = value
        elif op in ("-n", '--name'):
            name = value
        elif op in ("-k", "--key"):
            key = value
        elif op in ("-s", "-showDetail"):
            showdetail = value
        elif op in ("-h", "--help"):
            print 'Usage:', sys.argv[0], 'options'
            print '    Mandatory arguments to long options are mandatory for short options too.'
            print '    -d, --date=DATE  search the given date, default none.'
            print '    -c, --ccf=[abc]  search the given CCF level, default none.'
            print '    -n, --name=NAME  search the given conference name(support shor and long name both).'
            print '    -k, --key=KEYWORD  search the given KEYWORD in conference description.'
            print '    -h, --help  print this help info.'
            print 'Example:'
            print '    ', sys.argv[0], '-d 2016-0[6-8] -c b -n CIDR -k cloud -s 1'
            print '        ', 'search the conference whose submission date from 2016-06 to 2016-08, level is CCF B, keyword is cloud and detail information is shown.'
            print '    ', sys.argv[0], '--name=CIDR'
            print '    ', sys.argv[0], '-n CIDR'
            print '        ', 'search the conference name which contains CIDR.'
            print '    ', sys.argv[0], '-n ^CIDR$ -s 1'
            print '        ', 'search the conference name which contains exactly CIDR and print the detail info.'
            print '    ', sys.argv[0], '-c [ab] -k "cloud computing"'
            print '        ', 'search the conference who is CCF A or B and is about cloud computing.'
            print '    ', sys.argv[0], '-k "(cloud computing) | (distributed systems)"'
            print '        ', 'search the conference which is about cloud computing or distributed systems.'
            print '    ', sys.argv[0], '-k "cloud | computing | distributed"'
            print '        ', 'search the conference which contains cloud or computing or distributed.'
            print '    ', sys.argv[0], '-k "(Cloud migration service).*(cloud computing)"'
            print '        ', 'search the conference which contains both and are ordered.'
            print 
            sys.exit()
    SearchEngine().searchConf(date, ccf, name, key, showdetail)
    #print date, ccf, name, key, showdetail
    sys.exit(0)

初探python之做一個簡單小爬蟲
2019-03-02
Python爬蟲
java實現一個簡單的爬蟲小程式
2020-08-11
Java爬蟲
簡單的爬蟲程式
2024-03-24
爬蟲
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
使用Python和requests庫的簡單爬蟲程式
2023-11-13
Python爬蟲
基於Python的簡單天氣爬蟲程式
2018-03-26
Python爬蟲
python簡單爬蟲(二)
2018-04-18
Python爬蟲
python爬蟲:爬蟲的簡單介紹及requests模組的簡單使用
2022-02-24
Python爬蟲
簡單瞭解python爬蟲
2020-10-13
Python爬蟲
一個簡單的爬蟲頭部構造
2020-11-22
爬蟲
使用nodeJS寫一個簡單的小爬蟲
2018-12-25
NodeJS爬蟲
python最簡單的爬蟲 , 一看就會
2018-06-14
Python爬蟲
Python代理IP爬蟲的簡單使用
2019-03-04
Python爬蟲
教你如何編寫第一個簡單的爬蟲
2020-02-16
爬蟲
一天時間入門python爬蟲，直接寫一個爬蟲案例，分享出來，很簡單
2018-12-02
Python爬蟲
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
Python爬蟲 --- 2.3 Scrapy 框架的簡單使用
2018-12-19
Python爬蟲框架
用PYTHON爬蟲簡單爬取網路小說
2021-09-11
Python爬蟲
python 爬蟲 mc 皮膚站 little skin 的簡單爬取
2019-08-02
Python爬蟲
Python爬蟲筆記（一）——基礎知識簡單整理
2018-07-08
Python爬蟲筆記
python爬蟲簡單實現逆向JS解密
2019-08-29
Python爬蟲JS解密
python如何實現簡單的爬蟲功能?Python學習教程!
2021-01-12
Python爬蟲
Python《成功破解簡單的動態載入的爬蟲》
2020-12-20
Python爬蟲
Python3 | 簡單爬蟲分析網頁元素
2018-11-30
Python爬蟲網頁
Python 開發簡單爬蟲 (學習筆記)
2019-08-05
Python爬蟲筆記
《Python開發簡單爬蟲》實踐筆記
2021-09-09
Python爬蟲筆記
我的第一個Python爬蟲——談心得
2018-03-30
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
python 爬蟲簡單實現百度翻譯
2020-04-14
Python爬蟲
python爬蟲-1w+套個人簡歷模板爬取
2021-03-05
Python爬蟲
使用python的scrapy來編寫一個爬蟲
2019-03-14
Python爬蟲
Python 爬蟲零基礎教程(1)：爬單個圖片
2024-03-13
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲深造篇(四)——Scrapy爬蟲框架啟動一個真正的專案
2021-11-08
Python爬蟲框架
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
python爬蟲-33個Python爬蟲專案實戰(推薦)
2020-10-28
Python爬蟲
Python爬取中國銀行外匯牌價(爬蟲 + PyFlux簡單預測分析)--(一)
2018-11-07
Python爬蟲UX
10個高效的Python爬蟲框架
2024-09-27
Python爬蟲框架

一個簡單的python爬蟲程式

相關文章