爬取行政區劃程式碼

bigroc發表於2024-02-27

原文網址 : https://www.cnblogs.com/bigroc/p/18037559

爬取國家統計局統計用區劃程式碼和城鄉劃分程式碼 2023 版

python 實現

一、開啟國家統計局官網

https://www.stats.gov.cn/sj/tjbz/qhdm/

二、分析每一級URL找到規律

省級:https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/index.html
地市級：https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/61.html 61為陝西編碼
區縣級：https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023/61/6101.html

找到規律當前路徑+href 路徑即可跳入下一級

打碼

import json
import time

import requests
from bs4 import BeautifulSoup

main_url = "https://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2023"


class area_code:
    name = ""
    code = ""
    url = ""
    child = []
    urban_rural_type = 0
    lng = 0
    lat = 0

    def __init__(self, name, code, url, child, urban_rural_type=0):
        self.name = name
        self.code = code
        self.url = url
        self.child = child
        self.urban_rural_type = urban_rural_type
        self.lng = 0
        self.lat = 0


# 爬取全國統計用區劃程式碼和城鄉劃分程式碼
# pip install beautifulsoup4
def get_code(suffix_url="index.html"):
    _province_url = "{}/{}".format(main_url, suffix_url)
    response = requests.get(_province_url)
    response.encoding = "utf-8"
    _html = response.text
    _soup = BeautifulSoup(_html, "html.parser")
    _province_code = {}
    for a in _soup.find_all("a"):
        if a.get("href") and a.get("href").endswith(".html"):
            _province_code[a.text] = a.get("href")
    return _province_code


def get_child_code(_url, _parent_url=None, _retry=3):
    """
    輸出 [{name:"呼和浩特市", code:"150100000000", url:"15/1501.html"},{name:"包頭市", code:"150200000000", url:"15/1502.html"}]
    :param _parent_url: 父級url
    :param _retry: 重試次數
    :param _url: 當前url
    :return:
    """
    _city_code = []
    if _parent_url is not None and len(_parent_url) > 0:
        # 擷取最後一個"/"之前的字串
        _parent_path = _parent_url.rsplit("/", 1)[0]
        _req_url = "{}/{}".format(_parent_path, _url)
    else:
        _req_url = "{}/{}".format(main_url, _url)
    try:
        response = requests.get(_req_url)
    except Exception as e:
        if _retry > 0:
            time.sleep(1)
            print("請求出錯：{},第{}次重試".format(e, 4 - _retry))
            return get_child_code(_url, _parent_url, _retry - 1)
        else:
            raise e
    response.encoding = "utf-8"
    _html = response.text
    _soup = BeautifulSoup(_html, "html.parser")

    # class_="citytr" or class_="towntr" or class_="countytr" or class_="villagetr"
    for tr in _soup.find_all("tr", class_=["citytr", "towntr", "countytr"]):
        _tds = tr.find_all("td")
        print("開始處理 - {}".format(_tds[1].text))
        _child_url = ""
        if _tds[0].find("a") is not None and _tds[0].find("a").get("href") is not None:
            _child_url = _tds[0].find("a").get("href")
            if _child_url.endswith(".html"):
                _child = get_child_code(_child_url, _req_url)
                _city_code.append(area_code(_tds[1].text, _tds[0].text, _child_url, _child))
        else:
            _city_code.append(area_code(_tds[1].text, _tds[0].text, _child_url, []))
    for tr in _soup.find_all("tr", class_=["villagetr"]):
        _tds = tr.find_all("td")
        code = _tds[0].text
        urban_rural_type = _tds[1].text
        name = _tds[2].text
        _city_code.append(area_code(name, code, "", [], urban_rural_type))
    return _city_code


def get_province_list():
    """
    # 獲取省份、直轄市、自治區程式碼
    :return:
    """
    province_map = get_code()
    _province_list = []
    for _name, _url in province_map.items():
        _province_list.append(area_code(_name, _url.split(".")[0], _url, []))
    return _province_list


if __name__ == '__main__':
    province_list = get_province_list()
    # 獲取市級程式碼
    for province in province_list:
        print("開始處理 - {}".format(province.name))
        city_code = get_child_code(province.url)
        province.child = city_code
    # 輸出到檔案json
    with open("area_code.json", "w", encoding="utf-8") as f:
        f.write(json.dumps(province_list, default=lambda obj: obj.__dict__, ensure_ascii=False))

缺陷

json格式太大了，建議直接入庫或者生成cvs
不支援退出續爬，後續完善....

獲取行政區劃資訊（省、市）工具類
2024-08-31
重複的縣級行政區劃名稱
2019-02-20
如何爬取視訊的爬蟲程式碼原始碼
2020-12-26
爬蟲原始碼
使用 Scrapy 爬取股票程式碼
2019-02-25
中國行政區劃資訊JS庫china-location
2018-05-25
JS
拂衣天氣（微天氣）—行政區劃資料（一）
2024-05-04
拂衣天氣（微天氣）— 行政區劃資料（二）
2024-05-04
80行Python程式碼搞定全國區劃程式碼
2021-02-14
Python
【ArcGIS For JS】前端geojson渲染行政區劃圖層並加標籤
2024-05-29
前端JSON
百度地圖獲取多行政區域圍欄
2020-12-25
地圖
最新全國省市區縣鄉鎮街道行政區劃資料提取(2022年)
2022-02-12
爬取某網站寫的python程式碼
2019-11-29
網站Python
新手小白的爬蟲神器-無程式碼高效爬取資料
2021-01-01
爬蟲
爬蟲爬取微信小程式
2019-02-16
爬蟲微信小程式
用一段爬蟲程式碼爬取高音質音訊示例
2023-10-18
爬蟲音訊
60行程式碼爬取知乎神回覆
2018-11-15
行程
2020年中國全國5級行政區劃（省、市、縣、鎮、村）
2020-11-28
「無程式碼」高效的爬取網頁資料神器
2021-10-18
網頁
20行Python程式碼爬取王者榮耀全英雄皮膚
2020-01-11
Python
爬蟲小程式 - 爬取王者榮耀全皮膚
2020-01-31
爬蟲
上天的Node.js之爬蟲篇 15行程式碼爬取京東資源
2019-03-22
Node.js爬蟲行程
Python使用多程式提高網路爬蟲的爬取速度
2019-02-01
Python爬蟲
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
最新全國省市區縣鄉鎮街道行政區劃資料和座標邊界資料更新提取(2023年)
2023-02-08
國家統計局通用資料爬取思路+程式碼實現（超級舒暢的一次爬取經歷）
2022-05-04
60行程式碼爬取知乎“神回覆”，句句戳中淚點
2019-03-06
行程
三十行程式碼教你批量爬取某網站妹紙圖
2020-11-30
行程網站
C語言爬蟲程式編寫的爬取APP通用模板
2024-01-17
C語言爬蟲APP
Python 萬能程式碼模版：爬蟲程式碼篇
2022-08-25
Python爬蟲
網路爬蟲——Urllib模組實戰專案（含程式碼）爬取你的第一個網站
2020-02-12
爬蟲網站
golang 原生爬取 bilibili-up 主相簿程式
2020-03-04
Golang
用urllib爬取鏈家北京地區所有小區的戶型圖
2018-08-13
python爬蟲爬取網頁中文亂碼問題的解決
2024-11-17
Python爬蟲網頁
JavaScript爬蟲程式實現自動化爬取tiktok資料教程
2023-10-18
JavaScript爬蟲
不用寫程式碼的爬蟲
2019-06-17
爬蟲
爬蟲之股票定向爬取
2018-12-06
爬蟲
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
JB的Python之旅-爬取phizhub網站（原始碼）
2019-03-01
Python網站原始碼

爬取行政區劃程式碼

爬取國家統計局統計用區劃程式碼和城鄉劃分程式碼 2023 版

一、開啟國家統計局官網

二、分析每一級URL找到規律

打碼

缺陷

相關文章