91job就業知識競賽題庫的爬取

V-Sugar發表於2020-10-30

最近呢一個大學生就業知識競賽(這個需要有需要學生的賬號才可以登入)的一個比賽,從這個網站上來看的話,如果你要是自己去看題庫的話,是很不容易記住的,而我用的方法是將他的題庫用爬蟲爬取下來,然後再做的時候就可以直接檢視了,今天呢閒下來了就把我之前爬取這個題庫的過程寫一下吧,僅供學習使用

  1. 在網站中找到請求返回資料那個包,如圖可以看出是那個question那個包返回的資料,並且裡面有我們想要的資料
    在這裡插入圖片描述

  2. 當我們找到請求返回的響應資料包的位置的時候去檢視請求時響應對應的地址和請求時headers攜帶的資料(由於是要用學校的學生賬號才可以登入,所以攜帶的資料中的cookie可定包含了賬號的資訊所以必須要在程式碼的時候攜帶cookie),如圖所示
    在這裡插入圖片描述

  3. 找到請求的地址之後,我們就可以去查詢我們所需要的資料(需要用到寫xpath的知識來定位)在那一部分了我們需要的東西有四個,分別是 1. 題目 2. 題目的選項 3. 正確的答案 4. 下一題的連結;如下圖正是我們想要找的東西;
    在這裡插入圖片描述
    在我們觀察過幾個題目後,可以看出來他的下一題的地址是有規律的,所以就不需要找到他的xpath了

題目的xpath為:"//div[@class='title']//text()")
選項的xpath為: "//div[@class='answer']//text()"
正確答案的xpath為:"//div[@class='right']//text()"
下一題的地址: http://ccit.91job.org.cn/contest/question?page=i     註釋: page=i 即page等於一個數字,數字是幾就是第幾題,一共530道題目 
  1. 編寫程式碼
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random


class Job():
    """創新創業試題"""
    
    def __init__(self, url):
        self.url = url
        self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}"  #下一頁的構造地址
        self.USER_AGENT_LIST = [
    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
    "Mozilla/2.02E (Win95; U)",
    "Mozilla/3.01Gold (Win95; I)",
    "Mozilla/4.8 [en] (Windows NT 5.1; U)",
    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]  # 這個是讓程式偽裝成不同的裝置去訪問這個網址
        self.headers = {
            "User-Agent": random.choice(self.USER_AGENT_LIST),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
            
        }  # Accept、Accept-Encoding 、Accept-Language 這幾個可以不新增,headers中只用User-Agent和Cookie就可以了
    
    def parse_url(self, url):
        """傳送請求獲取資料"""
        html = requests.get(url, headers=self.headers)
        return html.content.decode()
    
    def get_content_list(self, html):
        """獲取資料"""
        html_xml = etree.HTML(html)
        print(html_xml)
        item = dict()
        item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
        item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
        item["right"] = html_xml.xpath("//div[@class='right']//text()")
        
        return item
    
    def save_content(self, content):
        """儲存資料"""
        file_path = "job.json"
        with open(file_path, "a", encoding="utf-8") as f:
            f.write(json.dumps(content, ensure_ascii=False, indent=2))
            f.write('\n')
        print("儲存成功")
    
    def run(self):
        """主流程控制"""
        # 1.獲取url
        next_url = self.url
        for i in range(1,531):
            # 2. 傳送請求獲取資料
            html = self.parse_url(next_url)
            # 3. 資料提取
            content = self.get_content_list(html)
            # content= self.get_content_list(html)
            # 4. 儲存資料
            self.save_content(content)
            
            next_url = self.comment_url.format(i)
            print(i)
            time.sleep(5)
            


if __name__ == '__main__':
    url = "http://ccit.91job.org.cn/contest/question"
    job = Job(url)
    job.run()





  1. 當我用了上面的程式碼後發現爬取的速度非常的慢,找了原因後,發現是網站的響應非常的慢,緊接著我就將其更改為了多執行緒的程式碼,發現後來的速率提升了很多,程式碼如下
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random
from queue import Queue
import threading


class Job():
    """創新創業試題"""
    
    def __init__(self, url):
        """
        建構函式,例項化物件
        :param url:
        """
        self.url = url  #這個沒有用到
        self.session = requests.session()
        self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}"
        self.USER_AGENT_LIST = [
    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
    "Mozilla/2.02E (Win95; U)",
    "Mozilla/3.01Gold (Win95; I)",
    "Mozilla/4.8 [en] (Windows NT 5.1; U)",
    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
        self.headers = {
            "User-Agent": random.choice(self.USER_AGENT_LIST),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
            
        }
        self.url_queue = Queue()
        self.html_queue = Queue()
        self.content_queue = Queue()
    
    def get_url_list(self):
        """
        構造url地址,最後將其放入佇列
        :return: 無返回
        """
        for i in range(1,531):
            self.url_queue.put(self.comment_url.format(i))
    
    def get_url(self):
        """
        請求url獲取響應
        :return:無返回
        """
        while True:
            url = self.url_queue.get()
            print(url)
            response = self.session.get(url,headers=self.headers)
            # return response.content.decode()
            self.html_queue.put(response.content.decode())
            self.url_queue.task_done()  # 佇列數減一
    
    def get_content_list(self):
        """
        獲取資料,並將資料格式化
        :return:無返回
        """
        while True:
            html_xml = etree.HTML(self.html_queue.get())
            # print(html_xml)
            item = dict()
            item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
            item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
            item["right"] = html_xml.xpath("//div[@class='right']//text()")
            # print(item)
            self.content_queue.put(item)
            self.html_queue.task_done()  # 佇列數減一
        
    
    def save_content(self):
        """
        儲存資料
        :return:無返回
        """
        file_path = "job1.json"
        while True:
            with open(file_path, "a", encoding="utf-8") as f:
                f.write(json.dumps(self.content_queue.get(), ensure_ascii=False, indent=2))
                f.write('\n')
            self.content_queue.task_done()
            print("儲存成功")
    
    def run(self):
        """
        主流程控制
        :return:無返回
        """
        thread_list = list()
        # 構造url
        url = threading.Thread(target=self.get_url_list)
        thread_list.append(url)
        for i in range(5):
            # 傳送請求
            get_url = threading.Thread(target=self.get_url)
            thread_list.append(get_url)
        for i in range(4):
            # 提取資料
            content = threading.Thread(target=self.get_content_list)
            thread_list.append(content)
        # 儲存
        save = threading.Thread(target=self.save_content)
        thread_list.append(save)

        for t in thread_list:
            t.setDaemon(True)  # 將子執行緒設定為守護執行緒,該執行緒不重要主執行緒結束,子執行緒結束
            t.start()
        for q in [self.url_queue, self.html_queue, self.content_queue]:
            q.join()  # 讓主執行緒等待子執行緒結束,即佇列中數為空

        print("主執行緒結束")
            


if __name__ == '__main__':
    url = "http://ccit.91job.org.cn/contest/question"  # 最後這個沒有用到,第一題的page=1
    job = Job(url)
    job.run()


7.爬取的結果

在這裡插入圖片描述

如果不夠詳細可以檢視如下筆記
爬蟲筆記

相關文章