Python爬資料之全國中小學資訊

zenobia119發表於2018-07-08

原文網址 : https://blog.csdn.net/zenobia119/article/details/80963222

Python

爬取網站：http://www.xuexiaodaquan.com/ 學校大全

技術路線： requests + BeautifulSoup

貌似這個網站反爬蟲還挺牛的，經常就返回自動跳入的139網站，隨意得換著IP試試

需要準備中國市名稱拼音存在EXCEL中，顯示是第一列：市民；第二列：拼音；到市級就可以。

需要挖掘哪些城市就放哪些，如果挖全國，就要放所有市名。

如：

輸出是一個EXCEL，包括：

城市

型別

學習名稱

地址

電話

網址

如：

直接上程式碼：

from bs4 import BeautifulSoup
import requests
import re
import sys
import xlwt
import xlrd
from xlutils.copy import copy

#獲取html
def getHtmlText(url, code="GBK"):
    try:
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36'}
        r = requests.get(url,headers = headers,timeout = 30)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return "獲取html異常"
#解析地區，返回地區清單
'''
def getGroundList(htext):
    try:
        grounddict = {}
        soup = BeautifulSoup(htext, "html.parser")
        gdname = soup.find('dl', attrs={'class':'nobackground'})
        keyList = gdname.find_all('a')
        for i in range(1,len(keyList)):
            key = keyList[i].text
            val = keyList[i].get('href')
            grounddict[key] = val
        return grounddict
    except:
        print("getGroundList異常")
'''
#解析頁碼
def getPageCode(htext,typeitem):   
    try:
        soup = BeautifulSoup(htext, "html.parser")
        s1 = soup.find('a', attrs={'class':'last'})
        if (s1):
            pat = re.compile(typeitem + r'pn([0-9]+).html')
            if(s1.get('href')):
               code = pat.search(s1.get('href'))
               if(code):
                   return code.group(1)
        else:
            return 0
            
    except:
        print("getPageCode異常")
    

#解析學校資訊，返回學校名稱、地址、電話、網址
def getSchoolList(htext,fileAddress,cityitem,typeitem):
    try:
        schoolDict = {}
        soup = BeautifulSoup(htext, "html.parser")
        sclist1 = soup.find_all('dl',attrs={'class':'left'})
        sclist2 = soup.find_all('dl',attrs={'class':'right'})
        sclist = sclist1 + sclist2
        for item in sclist:
            schoolDict['城市'] = cityitem
            schoolDict['型別'] = typeitem
            schoolDict['學習名稱'] = item.find('p').text
            sl = item.find_all('li')
            schoolDict['地址'] = sl[0].text
            schoolDict['電話'] = sl[1].text
            schoolDict['網址'] = sl[2].text
            #f = open(fileAddress, 'a', encoding='utf-8')
            #f.write(str(schoolDict)  + '\n' )
            savefile(schoolDict,fileAddress)
    except:
        print("getSchoolList異常")

#儲存到excel
def savefile(schoolDict,fileAddress):
    workbook = xlrd.open_workbook(fileAddress,'w+b')
    sheet = workbook.sheet_by_index(0)
    wb = copy(workbook)
    ws = wb.get_sheet(0)
    rowNum = sheet.nrows
    ws.write(rowNum,0,schoolDict['城市'])
    ws.write(rowNum,1,schoolDict['型別'])
    ws.write(rowNum,2,schoolDict['學習名稱'])
    ws.write(rowNum,3,schoolDict['地址'])
    ws.write(rowNum,4,schoolDict['電話'])
    ws.write(rowNum,5,schoolDict['網址'])
    wb.save(fileAddress)
        
#獲取城市列表,城市由EXCEL檔案儲存
def getCityList():
    try:
        cityFileAddress = r'D:\中國省市名稱拼音.xls'
        file = xlrd.open_workbook(cityFileAddress)
        sheet = file.sheet_by_name('city')
        cityDic = {}
        for i in range(sheet.nrows):
            key = sheet.col_values(0)[i]
            value = sheet.col_values(1)[i].lower()
            cityDic[key] = value
        return cityDic
    except:
        print("getCityList失敗")
            
def main():
    cityList = getCityList()
    typeList = {'小學':'/xiaoxue/','初中':'/chuzhong/','高中':'/gaozhong/'}
    for cityitem in cityList:
        for typeitem in typeList:
            searchUrl = 'http://'+ cityList[cityitem] + '.xuexiaodaquan.com'
            fileAddress = 'D:/school.xls'
            htext = getHtmlText(searchUrl+typeList[typeitem])
            getSchoolList(htext,fileAddress,cityitem,typeitem)
            pagecode = int(getPageCode(htext,typeList[typeitem]))
            if pagecode != 0:
                for i in range(2,pagecode+1):
                    h1text = getHtmlText(searchUrl+typeList[typeitem]+'pn'+str(i)+'.html')
                    getSchoolList(h1text,fileAddress,cityitem,typeitem)
       
main()

求全國中小學學校資料庫
2022-09-08
資料庫
Python爬蟲之小說資訊爬取與資料視覺化分析
2021-01-09
Python爬蟲視覺化
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
Python爬蟲之資料解析（XPath）
2018-12-18
Python爬蟲
利用Python自動爬取全國30+城市地鐵圖資料
2019-01-18
Python
python爬蟲--招聘資訊
2018-11-03
Python爬蟲
Python爬取所有人位置資訊——騰訊位置大資料！
2020-11-13
Python大資料
Python爬蟲之使用MongoDB儲存資料
2019-02-16
Python爬蟲MongoDB
【從零開始學爬蟲】採集全國高校導師資料
2022-12-28
爬蟲
python中小資料池和編碼
2024-05-09
Python
python獲取全國地鐵資料
2021-11-11
Python
Python資料爬取處理視覺化，手把手全流程教學
2024-12-01
Python視覺化
Python爬蟲訓練：爬取酷燃網視訊資料
2020-10-23
Python爬蟲
Python爬蟲框架：scrapy爬取高考派大學資料
2019-10-07
Python爬蟲框架
Python 資料科學之 Pandas
2020-03-16
Python資料科學
【Python資料科學】之Numpy
2019-04-29
Python資料科學
python爬取北京租房資訊
2018-05-18
Python
Python爬蟲抓取股票資訊
2021-01-03
Python爬蟲
Python爬取股票資訊，並實現視覺化資料
2020-09-25
Python視覺化
Python資料爬蟲學習筆記（11）爬取千圖網圖片資料
2018-09-18
Python爬蟲筆記
Python全棧開發之—redis資料庫
2018-12-24
Python全棧Redis資料庫
【從零開始學爬蟲】採集全國曆史天氣資料
2022-12-22
爬蟲
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
Python 爬蟲實戰之爬拼多多商品並做資料分析
2023-10-17
Python爬蟲
Python爬蟲實戰之（二）| 尋找你的招聘資訊
2018-04-28
Python爬蟲
【機器學習】資料準備--python爬蟲
2022-06-22
機器學習Python爬蟲
Python學習之資料型別
2019-05-12
Python資料型別
[Python] 網路爬蟲與資訊提取（1）網路爬蟲之規則
2020-11-06
Python爬蟲
python爬蟲，獲取中國工程院院士資訊
2021-12-04
Python爬蟲
【爬蟲+資料分析+資料視覺化】python資料分析全流程《2021胡潤百富榜》榜單資料！
2022-12-29
爬蟲視覺化Python
python爬取股票資料並存到資料庫
2021-03-29
Python資料庫
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python：爬取疫情每日資料
2020-02-17
Python
使用Python爬取代理資料
2021-01-02
Python
【0基礎學爬蟲】爬蟲基礎之資料儲存
2023-04-14
爬蟲
Python爬蟲學習筆記（三、儲存資料）
2020-10-03
Python爬蟲筆記
Python爬蟲初學二（網路資料採集）
2020-05-03
Python爬蟲

Python爬資料之全國中小學資訊

相關文章