本文測試程式碼要利用到上一篇文章爬取到的資料,上一章連結:爬蟲:獲取動態載入資料(selenium)(某站) ,本文要爬取的內容是某乎提問上面的話題關鍵字
1. 多程式語法
1.1 語法1
import multiprocessing import time def func(x): print(x*x) if __name__ == '__main__': start = time.time() jobs = [] for i in range(5): p = multiprocessing.Process(target=func, args=(i, )) jobs.append(p) p.start() end = time.time() print(end - start)
截圖如下:先列印時間不知怎麼解釋?求大佬指點
1.2 語法2
from multiprocessing import Pool import time def func(x, y): print(x+y) if __name__ == '__main__': pool = Pool(5) start = time.time() for i in range(100): pool.apply_async(func=func, args=(i, 3)) pool.close() pool.join() end = time.time() print(end - start)
2. 實踐測試程式碼
import requests from bs4 import BeautifulSoup import time from requests.exceptions import RequestException from pymongo import MongoClient from multiprocessing import Pool client = MongoClient('localhost') db = client['test_db'] def get_page_keyword(url, word): headers = { 'cookie': '', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' } # 替換為自己的cookie try: html = requests.get(url, headers=headers, timeout=5) html = BeautifulSoup(html.text, "html.parser") key_words = html.find("div", {'class': 'QuestionPage'}).find("meta", {'itemprop': 'keywords'}).attrs['content'] print(key_words) with open(r'女性話題連結.txt', 'a') as file: file.write(key_words + '\n') db[u'' + word + 'keyword'].insert_one({"link": url, "key_words": key_words, "time": time.ctime()}) except RequestException: print('請求失敗') if __name__ == '__main__': input_word = input('輸入連結檔案所屬話題(比如:女性):') f = open(r'女性2021-5-16-3-8.txt') # 自己爬取到連結的檔案位置 lines = [] for i in f.readlines(): lines.append(i.strip()) # 因為上次爬取連結結尾加了行結束符 EOF f.close() # 多程式測試 pool = Pool(2) # 數字大會快點,但筆者電腦兩核,而且數字太大網站一會就說你賬號異常 start = time.time() for link in lines: pool.apply_async(func=get_page_keyword, args=(link, input_word)) pool.close() pool.join() end = time.time() print(end - start)
截圖:不打算重新跑了,是以前的截圖