爬取githubs——登入後的東西（兩種方法）

weixin_34148340發表於2018-11-10

原文網址 : https://blog.csdn.net/weixin_34148340/article/details/88296124

Github

注意：scrapy也是能傳送post請求的，但是不使用scrapy.Request，而是要使用scrapy.FormRequest

1.開啟githubs，輸入https://github.com/login來到登入頁面，點選檢查，

分析：network裡看到data要攜帶有的資料是commit，utf8，authenticity_token，login,password
,

3.png

2.在githubs的登入頁面，找到commit，utf8，authenticity_token，在element中的位置，並用xpath找出來

2.png

上圖所示authenticity_token的位置：

//input[@name='authenticity_token']/@value

1.png

上圖所示：utf8所在的位置：

//input[@name='utf8']/@value

同理：commit的位置：

//input[@name='commit']/@value

附上登入spider的原始碼：

# -*- coding: utf-8 -*-
import re
import scrapy


class GhSpider(scrapy.Spider):
    name = 'gh'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
        commit = response.xpath("//input[@name='commit']/@value").extract_first()
        post_data = dict(
        commit=commit,
        utf8=utf8,
        authenticity_token=authenticity_token,
        login="你的使用者名稱",
        password="你的githubs的密碼"
        )
        yield scrapy.FormRequest(
            "https://github.com/session",
            formdata=post_data,
            callback=self.after_login
        )

    def after_login(self,response):
        print(re.findall("ycw",response.body.decode()))
        with open("a.html","w",encoding="utf-8")as f:
            f.write(response.body.decode())
if __name__ == '__main__':
    from scrapy import cmdline
    cmdline.execute("scrapy crawl gh".split())

想對登入後的頁面進行提取還是別的處理，在def after_login()後面繼續寫程式碼

即使密碼輸入錯誤，使用re也能匹配到部分使用者名稱，可以將html儲存下來，檢視結果

 with open("a.html","w",encoding="utf-8")as f:
            f.write(response.body.decode())

方法二：

form表單只有一個的時候，我們可以使用如下方式，自動的從response中尋找from表單

yield scrapy.FormRequest.from_response(
            response,   #自動的從response中尋找from表單

1.png

建立專案
D:\scrapy_1>cd login1
D:\scrapy_1\login1>scrapy genspider gt2 github.com

# -*- coding: utf-8 -*-
import re
import scrapy


class Gt2Spider(scrapy.Spider):
    name = 'gt2'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response,   #自動的從response中尋找from表單
            formdata={"login":"你的使用者名稱","password":"你的githubs的密碼"},
            callback=self.after_login
        )

    def after_login(self,response):
        print(re.findall("ycw",response.body.decode()))

執行：

scrapy crawl gt2

程式碼執行結果