Python 自用程式碼（知網會議論文網頁原始碼清洗）

右介發表於2017-07-17

#coding=utf-8
from pymongo import MongoClient
from lxml import etree
import requests

jigou = u"\r\n      【機構】\r\n      "
zuozhe = u"\r\n        【作者】\r\n          "

# 獲取資料庫
def get_db():
    client = MongoClient('localhost', 27017)
    db = client.cnki
    db.authenticate("使用者名稱","密碼") 
    return db

# 獲取第num條資料
def get_data(table, num):
    i = 1
    for item in table.find({}, {"html":1,"_id":0}):
        if i==num:
            if item.has_key('html') and item['html']:
                return item['html']
        else:
            i+=1
            continue

# 列表首元素轉字串
def list_str(list):
    if len(list)!=0:
        return list[0]
    else:
        return ""

# 作者英文名，機構英文名
def en_ls(list, length1, length2):
    if len(list)!=0:
        list = list[0].replace(u"【Author】","").replace("\r\n","").strip().split(";")
        if len(list)==(length2+length1)+1:
            return list2str(list[:length1]), list2str(list[length1:-1])
        else:
            return "", ""
    else:
        return "", ""

def hyxx(list):
    if len(list)!=0:
        hylmc,hymc,hysj,hydd,flh,zbdw = "","","","",[],""
        for item in list:
            if u"【會議錄名稱】" in item:
                hylmc = item.replace(u"【會議錄名稱】","").replace("\r\n","").strip()
                continue
            if u"【會議名稱】" in item:
                hymc = item.replace(u"【會議名稱】","").replace("\r\n","").strip()
                continue
            if u"【會議時間】" in item:
                hysj = item.replace(u"【會議時間】","").replace("\r\n","").strip()
                continue
            if u"【會議地點】" in item:
                hydd = item.replace(u"【會議地點】","").replace("\r\n","").strip()
                continue
            if u"【分類號】" in item:
                flh = item.replace(u"【分類號】","").replace("\r\n","").strip()
                continue
            if u"【主辦單位】" in item:
                zbdw = item.replace(u"【主辦單位】","").replace(u"、",";").replace("\r\n","").strip()
                continue
        return hylmc,hymc,hysj,hydd,flh,zbdw
    else:
        return "","","","","",""

# 列表轉字串
def list2str(list):
    if len(list)!=0:
        return ";".join(list)
    else:
        return ""    

# 構造論文入庫字典
def standard_dict(html):
    dc = {}
    print 1
    # print html
    tree = etree.HTML(html)
    # 論文名稱
    dc["title"] = list_str(tree.xpath("//span[@id='chTitle']/text()"))
    # 外文名稱
    dc["title_eng"] = list_str(tree.xpath("//span[@id='enTitle']/text()"))
    # 作者
    dc["author"] = list2str(tree.xpath("//p[text()='%s']/a/text()"%zuozhe))
    # 作者數量
    length1 = len(tree.xpath("//p[text()='%s']/a/text()"%zuozhe))
    # 機構名稱
    dc["organization"] = list2str(tree.xpath("//p[text()='%s']/a/text()"%jigou))
    # 機構數量
    length2 = len(tree.xpath("//p[text()='%s']/a/text()"%jigou))
    # 作者英文名, 機構英文名
    dc["author_eng"], dc["organization_eng"] = en_ls(tree.xpath("//p[@id='au_en']/text()"), length1, length2)
    # 摘要
    dc["summary"] = list_str(tree.xpath("//span[@id='ChDivSummary']/text()"))
    # 英文摘要
    dc["summary_eng"] = list_str(tree.xpath("//span[@id='EnChDivSummary']/text()"))
    # 關鍵詞
    dc["keywords"] = list2str(tree.xpath("//div[@class='keywords']/span[1]/a/text()"))
    # 英文關鍵詞
    dc["keywords_eng"] = list2str(tree.xpath("//div[@class='keywords']/span[2]/a/text()"))
    # 會議資訊
    dc["proceeding_title"],dc["conference_title"],dc["conference_date"],dc["conference_place"],dc["huiyflh"],dc["conference_org"] = hyxx(tree.xpath("//div[@class='summary']/ul/li/text()"))
    if dc["proceeding_title"]=="":
        print 2
        dc["proceeding_title"] = list_str(tree.xpath("//div[@class='summary']/ul[1]/li/a/text()"))
    
    return dc

# 主函式
def main():
    db = get_db()
    collection=db.conference
    collection2 = db.conference_cleaned
    for item in collection.find({}, {"html":1,"_id":0}):
        if item.has_key('html') and item['html']:
            dc = standard_dict(item['html'])
            collection2.insert(dc)


if __name__ == '__main__':
    main()
    # 以下程式碼用於測試清洗特定一條資料
    # db = get_db()
    # collection=db.conference
    # data = get_data(collection, 1)
    # dc = standard_dict(data)
    # for k,v in dc.items():
    #     print k,v

Python 自用程式碼（某方標準類網頁原始碼清洗）
2017-06-27
Python網頁原始碼
Python 自用程式碼（遞迴清洗採標情況）
2017-06-28
Python遞迴
查詢論文原始碼網站
2019-03-18
原始碼網站
爬取外文工業技術期刊網頁原始碼（自用）
2018-03-12
網頁原始碼
Python 自用程式碼（scrapy多級頁面(三級頁面)爬蟲）
2017-05-09
Python爬蟲
找論文程式碼
2018-04-28
Python 自用程式碼（調整日期格式）
2017-06-28
Python
Python 自用程式碼（拆分txt檔案）
2017-06-08
Python
網頁地址編碼解碼（網頁地址明文密文轉換）url編碼解碼 Python3
2018-08-27
網頁Python
深度學習論文和開原始碼
2020-04-06
深度學習原始碼
自動下載MarkDown格式會議論文的程式
2021-11-13
知網文獻下載助手 ——油猴指令碼推薦
2024-07-01
指令碼
設為首頁程式碼(html原始碼)
2011-02-17
HTML原始碼
一文帶你入門圖論和網路分析（附Python程式碼）
2018-08-06
圖論Python
網頁黑白程式碼
2024-04-27
網頁
QWebView獲取網頁原始碼
2018-11-01
WebView網頁原始碼
抓取網頁中的原始碼.
2011-03-24
網頁原始碼
兒童攝影網-網頁原始碼全
2020-12-27
網頁原始碼
JAVA畢設代做（專案+論文+原始碼）
2024-10-03
Java原始碼
網頁程式碼(主頁)（初始版）:
2024-04-06
網頁
網頁常用JavaScript程式碼
2015-10-01
網頁JavaScript
高效獲取網頁原始碼COM
2014-03-12
網頁原始碼
如何隱藏和解網頁原始碼
2009-12-29
網頁原始碼
icml和nips等會議論文地址
2020-06-28
【論文】核心電腦科學會議排名
2020-10-06
好看的404頁面html原始碼網站404原始碼分享
2022-04-12
HTML原始碼網站
【Python】Python抓取分享頁面的原始碼示例
2019-06-27
Python原始碼
frank程式碼網為網頁前端人員提供建站常用的網頁js程式碼
2019-05-11
網頁前端JS
免費404頁面程式碼分享 404錯誤頁面原始碼
2022-04-21
原始碼
查詢CV頂會ICCV,CVPR,ECCV論文方法以及sota實現程式碼
2019-11-14
噹噹網首頁——CSS程式碼
2016-10-28
CSS
python3使用requests包抓取並儲存網頁原始碼
2017-05-16
Python網頁原始碼
Python程式碼混淆工具，Python原始碼保密、加密、混淆
2024-02-05
Python原始碼加密
動網論壇密碼暴力破解程式程式碼 (轉)
2007-10-17
密碼
特別好看個人主頁官網展示開源程式原始碼
2021-11-14
原始碼
開原始碼文獻
2016-01-18
原始碼
EAdmin極簡社群論壇程式原始碼
2019-05-11
原始碼
手把手教你利用爬蟲爬網頁（Python程式碼）
2019-05-14
爬蟲網頁Python

Python 自用程式碼（知網會議論文網頁原始碼清洗）

相關文章