遞迴遍歷網站所有 url

lyyyyyyy發表於2020-12-31

想寫一個指令碼,遍歷帶有域名的url,檢查狀態碼是否有異常。遇到一個問題,requests返回的內容裡面沒有a標籤。
網頁的內容都在這個div裡面,但是requests返回的資料裡面為空

import requests
from bs4 import BeautifulSoup

headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
resource_list = list()


def get_urls(url):
r = requests.get(url)
print(url)
print(r.text)
soup = BeautifulSoup(r.text, 'html.parser')
urls = soup.find_all("a")
if not urls:
return
if urls:
for i in urls:
try:
if i['href'] not in resource_list:
status_code = requests.get(i['href']).status_code
if status_code not in (200, 0): # code 不對則列印出來
print(i['href'], status_code)

if "https://www.mxc.ai/" in i['href']: # 判斷是否含有域名
resource_list.append(i['href'])
get_urls(i['href'])
except Exception:
pass


get_urls("https://www.mxc.ai/")

相關文章