Python 自用程式碼（某方標準類網頁原始碼清洗）

右介發表於2017-06-27

用於mongodb中“標準”資料的清洗，資料為網頁原始碼，須從中提取：

標準名稱,標準外文名稱,標準編號,釋出單位,釋出日期,狀態,實施日期,開本頁數,採用關係,中圖分類號,中國標準分類號,國際標準分類號,國別,關鍵詞,摘要,替代標準。

提取後組成字典存入另一集合。

#coding=utf-8
from pymongo import MongoClient
from lxml import etree
import requests

s = [u'標準編號：',u'釋出單位：',u'釋出日期：',u'狀態：',u'實施日期：',u'開本頁數：',u'採用關係：',
    u'中圖分類號：',u'中國標準分類號：',u'國際標準分類號：',u'國別：',u'關鍵詞：',u'摘要：']

# 獲取資料庫
def get_db():
    client = MongoClient('IP', 27017)
    db = client.wanfang
    db.authenticate("使用者名稱","密碼") 
    return db

# 獲取第num條資料
def get_data(table, num):
    i = 1
    for item in table.find({}, {"content":1,"_id":0}):
        if i==num:
            if item.has_key('content') and item['content']:
                return item['content']
        else:
            i+=1
            continue

# 列表轉字串
def list_str(list):
    if len(list)!=0:
        return list[0]
    else:
        return ""

# 提取分類號
def code_ls(list):
    if len(list)!=0:
        ls = list[0].split()
        shanchu = []
        for i in ls:
            if ("("in i) or (")"in i) or ("（"in i) or("）"in i):
                shanchu.append(i)
        for i in shanchu:
            ls.remove(i)
        return ls
    else:
        return ""

# 構造關鍵詞列表
def keywords_ls(list):
    if len(list)!=0:
        return list
    else:
        return ""

# 替代標準
def replace_str(replace):
    if replace!="":
        ls = [i.strip().replace("\r\n", "") for i in replace]
        if len(ls)!=0:
            return ls[0][5:]
        else:
            return ""
    else:
        return ""

# 提取摘要
def summary_str(list):
    if len(list)!=0:
        if list[0][0]!="<":
            return list[0]
        else:
            return ""
    else:
        return ""

# 調整日期格式
def date_str(list):
    if len(list)!=0:
        year = list[0].find(u'年')
        month = list[0].find(u'月')
        day = list[0].find(u'日')
        if month-year==2:
            list[0] = list[0].replace(u"年",u"年0")
        if day-month==2:
            list[0] = list[0].replace(u"月",u"月0")
        return list[0].replace(u"日","").replace(u"月","-").replace(u"年","-")
    else:
        return ""

# 調整採標格式
def adopted_ls(string, ls):
    dc = {}
    loc = string.find(',')
    if loc==-1:
        return ls
    else:
        dc["code"] = string[:loc].strip()
        dc["type"] = string[loc+1:loc+4]
        ls.append(dc)
        return adopted_ls(string[loc+4:],ls)

# 構造標準入庫字典
def standard_dict(html):
    dc = {}
    tree = etree.HTML(html)
    # 標準名稱
    dc["title"] = list_str(tree.xpath("//h1/text()"))
    # 外文名稱
    dc["title_eng"] = list_str(tree.xpath("//h2/text()"))
    # 標準編號
    dc["standard_number"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[0])))
    # 釋出單位
    dc["publishing_department"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[1])))
    # 釋出日期
    dc["release_date"] = date_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[2])))
    # 狀態
    dc["state"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[3])))
    # 實施日期
    dc["enforcement_date"] = date_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[4])))
    # 開本頁數
    dc["pages"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[5])))
    # 採用關係
    dc["adopted"] = adopted_ls(list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[6]))), [])
    # 中圖分類號
    dc["clc"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[7])))
    # 中國標準分類號
    dc["ccs"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/child::*/text()"%(s[8])))
    # 國際標準分類號
    dc["ics"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[9])))
    # 國別
    dc["country"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[10])))
    # 關鍵詞
    dc["keywords"] = keywords_ls(tree.xpath("//span[text()='%s']/following-sibling::*/child::*/text()"%(s[11])))
    # 摘要
    dc["summary"] = summary_str(tree.xpath("//span[text()='%s']/parent::*/following-sibling::*/text()"%(s[12])))
    # 替代標準
    dc["replace_for"] = replace_str(tree.xpath("//div[@id='replaceStandard']//child::*//text()"))
    return dc

# 主函式
def main():
    db = get_db()
    collection=db.standard
    collection2 = db.standard_cleaned
    for item in collection.find({}, {"content":1,"_id":0}):
        if item.has_key('content') and item['content']:
            dc = standard_dict(item['content'])
            collection2.insert(dc)

if __name__ == '__main__':
    main()
    
    # 以下程式碼用於測試清洗特定一條資料
    # db = get_db()
    # collection=db.standard
    # collection2 = db.standard_cleaned
    # data = get_data(collection, 8)
    # dc = standard_dict(data)
    # collection2.insert(dc)
    # for k,v in dc.items():
    #     print k,v

    # # 以下程式碼用於測試提取摘要
    # data = requests.get('http://d.wanfangdata.com.cn/Standard/ISO%208528-5-2013')
    # dc = standard_dict(data.text)
    # for k,v in dc.items():
    #     print k,v

    # # 以下程式碼用於測試修改日期格式
    # l1 = [u"2017年6月28日"]
    # l2 = [u"2017年10月27日"]
    # l3 = [u"2017年12月1日"]
    # l4 = [u"2017年7月1日"]
    # print date_str(l1)
    # print date_str(l2)
    # print date_str(l3)
    # print date_str(l4)

爬取外文工業技術期刊網頁原始碼（自用）
2018-03-12
網頁原始碼
DRF之分頁類原始碼分析
2024-04-23
原始碼
排序策略 - Swift標準庫原始碼
2018-04-16
排序Swift原始碼
網頁黑白程式碼
2024-04-27
網頁
標準API展開BOM程式碼
2024-03-19
API
【Python】Python抓取分享頁面的原始碼示例
2019-06-27
Python原始碼
Python 列舉類原始碼解析
2019-01-08
Python原始碼
QWebView獲取網頁原始碼
2018-11-01
WebView網頁原始碼
go標準庫-log包原始碼學習
2018-03-25
Go原始碼
網頁程式碼(主頁)（初始版）:
2024-04-06
網頁
Python程式碼混淆工具，Python原始碼保密、加密、混淆
2024-02-05
Python原始碼加密
frank程式碼網為網頁前端人員提供建站常用的網頁js程式碼
2019-05-11
網頁前端JS
Kotlin StandardKt 標準庫原始碼走一波
2019-03-12
Kotlin原始碼
Swift標準庫原始碼閱讀筆記 - Dictionary
2018-07-06
Swift原始碼筆記
好看的404頁面html原始碼網站404原始碼分享
2022-04-12
HTML原始碼網站
是誰在Go標準庫的原始碼中植入了色情網站？
2021-10-17
Go原始碼網站
免費404頁面程式碼分享 404錯誤頁面原始碼
2022-04-21
原始碼
兒童攝影網-網頁原始碼全
2020-12-27
網頁原始碼
網頁地址編碼解碼（網頁地址明文密文轉換）url編碼解碼 Python3
2018-08-27
網頁Python
PACS原始碼，遵循DICOM3.0國際標準開發的醫院PACS原始碼
2023-03-24
原始碼
手把手教你利用爬蟲爬網頁（Python程式碼）
2019-05-14
爬蟲網頁Python
通達信頂底準確指標公式原始碼
2024-03-10
指標公式原始碼
特別好看個人主頁官網展示開源程式原始碼
2021-11-14
原始碼
十行程式碼實現網頁標題滾動效果！
2021-04-13
行程網頁
整合 Python標準庫之 Path/File 類
2019-02-26
Python
200 行 Python 程式碼做個換臉程式（附原始碼）
2018-05-16
Python原始碼
Python標準庫06 子程式
2021-09-09
Python
華為程式設計規範，程式碼驗收標準。
2018-10-31
程式設計
MSVC2019的vector標準庫實現原始碼分析
2022-03-15
原始碼
Swift標準庫原始碼閱讀筆記 - Array和ContiguousArray
2018-06-27
Swift原始碼筆記
另類KDJ指標公式原始碼 2019通達信指標公式
2022-03-03
指標公式原始碼
【原始碼】基於IEEE 14匯流排標準的複合微電網SIMULINK模型
2018-11-08
原始碼模型
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
聞香識程式碼，什麼是衡量程式碼質量的終極標準？
2021-11-03
EV程式碼簽名證書和標準程式碼簽名證書有何不同？
2020-07-29
直播賣貨小程式原始碼中，商品分類頁面是如何實現的
2020-07-22
原始碼
Python實現簡單網頁圖片抓取完整程式碼例項
2020-05-27
Python網頁
網頁設計的步驟和標準
2020-12-19
網頁
6. 開篇《刻意學習 Golang - 標準庫原始碼分析》
2019-03-18
Golang原始碼

Python 自用程式碼（某方標準類網頁原始碼清洗）

相關文章