91job就業知識競賽題庫的爬取
最近呢一個大學生就業知識競賽(這個需要有需要學生的賬號才可以登入)的一個比賽,從這個網站上來看的話,如果你要是自己去看題庫的話,是很不容易記住的,而我用的方法是將他的題庫用爬蟲爬取下來,然後再做的時候就可以直接檢視了,今天呢閒下來了就把我之前爬取這個題庫的過程寫一下吧,僅供學習使用
-
在網站中找到請求返回資料那個包,如圖可以看出是那個question那個包返回的資料,並且裡面有我們想要的資料
-
當我們找到請求返回的響應資料包的位置的時候去檢視請求時響應對應的地址和請求時headers攜帶的資料(由於是要用學校的學生賬號才可以登入,所以攜帶的資料中的cookie可定包含了賬號的資訊所以必須要在程式碼的時候攜帶cookie),如圖所示
-
找到請求的地址之後,我們就可以去查詢我們所需要的資料(需要用到寫xpath的知識來定位)在那一部分了我們需要的東西有四個,分別是 1. 題目 2. 題目的選項 3. 正確的答案 4. 下一題的連結;如下圖正是我們想要找的東西;
在我們觀察過幾個題目後,可以看出來他的下一題的地址是有規律的,所以就不需要找到他的xpath了
題目的xpath為:"//div[@class='title']//text()")
選項的xpath為: "//div[@class='answer']//text()"
正確答案的xpath為:"//div[@class='right']//text()"
下一題的地址: http://ccit.91job.org.cn/contest/question?page=i 註釋: page=i 即page等於一個數字,數字是幾就是第幾題,一共530道題目
- 編寫程式碼
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random
class Job():
"""創新創業試題"""
def __init__(self, url):
self.url = url
self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}" #下一頁的構造地址
self.USER_AGENT_LIST = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
] # 這個是讓程式偽裝成不同的裝置去訪問這個網址
self.headers = {
"User-Agent": random.choice(self.USER_AGENT_LIST),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
} # Accept、Accept-Encoding 、Accept-Language 這幾個可以不新增,headers中只用User-Agent和Cookie就可以了
def parse_url(self, url):
"""傳送請求獲取資料"""
html = requests.get(url, headers=self.headers)
return html.content.decode()
def get_content_list(self, html):
"""獲取資料"""
html_xml = etree.HTML(html)
print(html_xml)
item = dict()
item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
item["right"] = html_xml.xpath("//div[@class='right']//text()")
return item
def save_content(self, content):
"""儲存資料"""
file_path = "job.json"
with open(file_path, "a", encoding="utf-8") as f:
f.write(json.dumps(content, ensure_ascii=False, indent=2))
f.write('\n')
print("儲存成功")
def run(self):
"""主流程控制"""
# 1.獲取url
next_url = self.url
for i in range(1,531):
# 2. 傳送請求獲取資料
html = self.parse_url(next_url)
# 3. 資料提取
content = self.get_content_list(html)
# content= self.get_content_list(html)
# 4. 儲存資料
self.save_content(content)
next_url = self.comment_url.format(i)
print(i)
time.sleep(5)
if __name__ == '__main__':
url = "http://ccit.91job.org.cn/contest/question"
job = Job(url)
job.run()
- 當我用了上面的程式碼後發現爬取的速度非常的慢,找了原因後,發現是網站的響應非常的慢,緊接著我就將其更改為了多執行緒的程式碼,發現後來的速率提升了很多,程式碼如下
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random
from queue import Queue
import threading
class Job():
"""創新創業試題"""
def __init__(self, url):
"""
建構函式,例項化物件
:param url:
"""
self.url = url #這個沒有用到
self.session = requests.session()
self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}"
self.USER_AGENT_LIST = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
self.headers = {
"User-Agent": random.choice(self.USER_AGENT_LIST),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
}
self.url_queue = Queue()
self.html_queue = Queue()
self.content_queue = Queue()
def get_url_list(self):
"""
構造url地址,最後將其放入佇列
:return: 無返回
"""
for i in range(1,531):
self.url_queue.put(self.comment_url.format(i))
def get_url(self):
"""
請求url獲取響應
:return:無返回
"""
while True:
url = self.url_queue.get()
print(url)
response = self.session.get(url,headers=self.headers)
# return response.content.decode()
self.html_queue.put(response.content.decode())
self.url_queue.task_done() # 佇列數減一
def get_content_list(self):
"""
獲取資料,並將資料格式化
:return:無返回
"""
while True:
html_xml = etree.HTML(self.html_queue.get())
# print(html_xml)
item = dict()
item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
item["right"] = html_xml.xpath("//div[@class='right']//text()")
# print(item)
self.content_queue.put(item)
self.html_queue.task_done() # 佇列數減一
def save_content(self):
"""
儲存資料
:return:無返回
"""
file_path = "job1.json"
while True:
with open(file_path, "a", encoding="utf-8") as f:
f.write(json.dumps(self.content_queue.get(), ensure_ascii=False, indent=2))
f.write('\n')
self.content_queue.task_done()
print("儲存成功")
def run(self):
"""
主流程控制
:return:無返回
"""
thread_list = list()
# 構造url
url = threading.Thread(target=self.get_url_list)
thread_list.append(url)
for i in range(5):
# 傳送請求
get_url = threading.Thread(target=self.get_url)
thread_list.append(get_url)
for i in range(4):
# 提取資料
content = threading.Thread(target=self.get_content_list)
thread_list.append(content)
# 儲存
save = threading.Thread(target=self.save_content)
thread_list.append(save)
for t in thread_list:
t.setDaemon(True) # 將子執行緒設定為守護執行緒,該執行緒不重要主執行緒結束,子執行緒結束
t.start()
for q in [self.url_queue, self.html_queue, self.content_queue]:
q.join() # 讓主執行緒等待子執行緒結束,即佇列中數為空
print("主執行緒結束")
if __name__ == '__main__':
url = "http://ccit.91job.org.cn/contest/question" # 最後這個沒有用到,第一題的page=1
job = Job(url)
job.run()
7.爬取的結果
如果不夠詳細可以檢視如下筆記
爬蟲筆記
相關文章
- 爬知識星球,製作自己的知識倉庫
- 用python爬取知識星球Python
- 知識庫終極指南:為什麼您的企業需要知識庫?
- 電商行業客戶運營知識庫:構建數字化知識庫的探索行業
- 爬蟲之前需要先了解哪些專業知識?爬蟲
- 爬取知乎單個網頁問題和回答網頁
- 金融行業客戶運營知識庫:構建數字化知識庫行業
- 爬蟲基礎知識爬蟲
- python爬蟲如何爬知乎的話題?Python爬蟲
- 首屆電競主題辯論賽《電競青年說》來了,它會讓人們重新認識電競產業嗎?產業
- Python—Requests庫的爬取效能分析Python
- http快取知識HTTP快取
- 逆向爬蟲知識學習爬蟲
- PgSql 知識庫SQL
- 知識雜庫
- 新手爬蟲教程:Python爬取知乎文章中的圖片爬蟲Python
- 搭建企業知識庫有哪些好處?
- Python網路爬蟲實戰:爬取知乎話題下 18934 條回答資料Python爬蟲
- Linux運維就業前景如何?linux基礎知識學習Linux運維就業
- Python-爬取CVE漏洞庫?Python
- Python分散式爬蟲(三) - 爬蟲基礎知識Python分散式爬蟲
- Python爬蟲之路-爬蟲基礎知識(理論)Python爬蟲
- 【知識分享】 清空linux的DNS快取LinuxDNS快取
- 爬蟲必須得會的預備知識爬蟲
- Python爬蟲需要了解的代理IP知識Python爬蟲
- Python爬蟲新手教程: 知乎文章圖片爬取器Python爬蟲
- 學習雲端計算好就業嗎?需要學習哪些知識?就業
- 【YashanDB知識庫】EXP導致主機卡死問題
- 【YashanDB知識庫】ODBC驅動類問題定位方法
- Hibernate【快取】知識要點快取
- 初賽De各種各樣的知識點
- 資料競賽:第四屆工業大資料競賽-虛擬測量大資料
- 學習爬蟲必須學的基礎知識爬蟲
- Python相關爬蟲的框架有哪些?Python知識Python爬蟲框架
- 分散式爬蟲之知乎使用者資訊爬取分散式爬蟲
- 知識分享--資料庫資料庫
- MySQL資料庫知識MySql資料庫
- 專業領域的顯性知識與隱性知識