scrapy 爬取空值
DEBUG: Redirecting (301) to <GET https://edu.csdn.net/> from <GET http://edu.csdn.net>
import scrapy
class S1Spider(scrapy.Spider):
name = 's1' # 爬蟲的名字
allowed_domains = ['blog.csdn.net'] # 如果URL地址的HOST不屬於allowed_domains,則過濾掉該請求
start_urls = ['http://blog.csdn.net/'] # 專案啟動時,訪問的URL地址
def parse(self, response):
print('-'*90)
# print(response.body)
# print('-'*40)# 訪問start_urls,得到響應後呼叫的方案,response為響應物件
print(response.xpath('//div[@class="nav_com"]/ul/li/a/text()').extract())
print('-' * 90)
# print('-'*40)# 對response 做Xpath
# 爬蟲開始,執行的方法,相當於start_urls
# def start_requests(self): # 向排程器傳送一個request物件
# yield scrapy.Request(
# url=('http://edu.csdn.net'), # 請求地址,預設是GET方式
# callback=self.parse2, # 得到響應後,呼叫的函式
# )
# def parse2(self,response): # 得到響應後,呼叫的函式
# print(response.xpath('//div[@id="nav_com"]//li/a/text()')) # 得到位元組型別 的資料
D:\爬蟲\scrapy_spider1\myscrapy1>scrapy crawl s1
2020-10-03 08:25:48 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: myscrapy1)
2020-10-03 08:25:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0,
w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)]
, pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-10-03 08:25:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-03 08:25:48 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'myscrapy1',
'NEWSPIDER_MODULE': 'myscrapy1.spiders',
'SPIDER_MODULES': ['myscrapy1.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like '
'Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2020-10-03 08:25:49 [scrapy.extensions.telnet] INFO: Telnet Password: b9d51701061fdac5
2020-10-03 08:25:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-03 08:25:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-03 08:25:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-03 08:25:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-03 08:25:49 [scrapy.core.engine] INFO: Spider opened
2020-10-03 08:25:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 item
s/min)
2020-10-03 08:25:49 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-03 08:25:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://blog.csdn.net/
> from <GET http://blog.csdn.net/>
2020-10-03 08:25:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://blog.csdn.net/> (referer: None)
------------------------------------------------------------------------------------------
[]
------------------------------------------------------------------------------------------
2020-10-03 08:25:50 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-03 08:25:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 558,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 14583,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 0.527588,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 3, 0, 25, 50, 372656),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 10, 3, 0, 25, 49, 845068)}
2020-10-03 08:25:50 [scrapy.core.engine] INFO: Spider closed (finished)
爬取空值
相關文章
- Scrapy框架的使用之Scrapy爬取新浪微博框架
- 使用 Scrapy 爬取股票程式碼
- Scrapy框架爬取海量妹子圖框架
- scrapy 也能爬取妹子圖?(5)
- scrapy爬取豆瓣電影資料
- 爬蟲教程——用Scrapy爬取豆瓣TOP250爬蟲
- Selenium + Scrapy爬取某商標資料
- 如何提升scrapy爬取資料的效率
- 爬蟲 Scrapy框架 爬取圖蟲圖片並下載爬蟲框架
- Python爬蟲框架:scrapy爬取高考派大學資料Python爬蟲框架
- Python爬蟲筆記(4):利用scrapy爬取豆瓣電影250Python爬蟲筆記
- scrapy入門:豆瓣電影top250爬取
- Scrapy使用隨機User-Agent爬取網站隨機網站
- 讓 scrapy 重複爬取同一個頁面
- Python Scrapy 爬蟲(二):scrapy 初試Python爬蟲
- Scrapy爬蟲-草稿爬蟲
- Scrapy爬蟲框架爬蟲框架
- 使用Scrapy爬取圖片入庫,並儲存在本地
- 需要取最近的非空值
- 爬蟲--Scrapy簡易爬蟲爬蟲
- Scrapy框架的使用之Scrapy通用爬蟲框架爬蟲
- scrapy之分散式爬蟲scrapy-redis分散式爬蟲Redis
- 爬蟲(9) - Scrapy框架(1) | Scrapy 非同步網路爬蟲框架爬蟲框架非同步
- Python scrapy增量爬取例項及實現過程解析Python
- 一文解決scrapy帶案例爬取噹噹圖書
- python爬蟲Scrapy框架Python爬蟲框架
- scrapy爬蟲代理池爬蟲
- 爬蟲實戰scrapy爬蟲
- Python爬蟲—Scrapy框架Python爬蟲框架
- 【Python篇】scrapy爬蟲Python爬蟲
- Python爬蟲入門【3】:美空網資料爬取Python爬蟲
- scrapy實戰專案(簡單的爬取知乎專案)
- 初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊框架爬蟲
- scrapy + mogoDB 網站爬蟲Go網站爬蟲
- 爬蟲框架-scrapy的使用爬蟲框架
- Scrapy爬蟲框架的使用爬蟲框架
- 使用selenium爬取網頁,如何在scrapy shell中除錯響應網頁除錯
- Scrapy 爬取不同網站及自動執行的經驗分享網站