Python爬取周杰倫instagram

冰山劉發表於2018-07-08

原文網址 : https://flycode.co/archives/169486

整體框架

使用國內能訪問的某國外 app angine 爬取Jay的 Instagram 並顯示，再使用國內的 sae 訪問這個網站，再爬取一次併傳送到微博小號。

bs4

使用requests爬取 Instagram 時候，並沒有加request header，Instagram 對 robot 還算友好，直接返回不帶 js 的網頁資訊。通過bs4迅速定位到照片、視訊資訊，~~再用正規表示式提取到連結並下載、顯示。~~，正規表示式讓人頭痛，使用str.split(` `)來使字串變成列表。

from bs4 import  BeautifulSoup

filepath = `C:UsershndxDesktopins.html`
soup = BeautifulSoup(open(filepath),`lxml`)

script = soup.select(`script`)

print str(script[2]).split(`},{"node":`)[1]

通過分析ins.html，得到每個node就是一個動態。這樣str(script[2]).split(`},{"node":`)[1]就是帶有Jay最新動態資訊的 Unicode 字元了。應用json 直接將這個資訊轉化成字典dict，如下

import json
i = json.loads(info)
print i["edge_media_to_caption"][`edges`]
"""
[{u`node`: {u`text`: u`Just finished now u8b1du8b1du91d1u83efu7684u670bu53cbu5011 highu7684u4e0du8981u4e0du8981u7684 #u91d1u83ef #u96d9u7bc0u68cd#u96d9u622au68cd`}}]
"""

儲存圖片

參考：如何用requests優雅的下載圖片？，這個應該是最簡潔的答案了。

import requests
s = requests.session()
ss = s.get(`https://www.baidu.com/img/bd_logo1.png`)
open(`logo.png`, `wb`).write(ss.content)

資料庫 ORM Flask-SQLAlchemy

學習參考 Flask-SQLAlchemy 官方文件
有關增刪改查的操作 flask SQLAlchemy 資料庫操作
資料庫物件

class photo(db.Model):
    __tablename__ = "photoid"

    id = db.Column(db.Integer)  #引數  primary_key=True 表示此鍵值不能重複，必須有一個primary_key=True。
    url = db.Column(db.String(4096),primary_key=True )
    text = db.Column(db.String(4096))

    def __init__(self, id, url,text):
        self.id = id
        self.url = url
        self.text = text #這裡有三個行

關於建立、查詢


In [1]: import flask_app

In [2]: con = [`1`,`2`]

In [3]: flask_app.photo(con[0],con[1])
Out[3]: <flask_app.photo at 0x7fa109b5ccd0>

In [4]: flask_app.db.session.add(flask_app.photo(con[0],con[1]))

In [5]: flask_app.db.session.commit()

In [6]: flask_app.photo.query.filter_by(id=`1`).first()
Out[6]: <flask_app.photo at 0x7fa11179e890>

In [7]: p1 =flask_app.photo.query.filter_by(id=`1`).first()

In [8]: p1.id
Out[8]: 1

In [9]: p1.url
Out[9]: u`2`

Mysql 用到的語句

DROP TABLE table_name ;

刪除資料表

insta.py

# -*- coding: utf-8 -*-
"""
insta 爬蟲
Created on Fri May 04 09:02:26 2018

@author: aubucuo
"""
import requests
from json import loads
from bs4 import  BeautifulSoup
import re

s = requests.session()
u = `https://www.instagram.com/jaychou/`

def ins(pid):
    rt = []
    c1 = s.get(u)
    soup = BeautifulSoup(c1.content,`lxml`)
    script = soup.select(`script`)

    ls = script[2].contents
    ls1 = re.findall(`window._sharedData = (.+?);`, str(ls[0]))
    js = loads(ls1[0])[`entry_data`][`ProfilePage`][0][`graphql`][`user`][`edge_owner_to_timeline_media`][`edges`]
    j_id = js[0][`node`][`id`]
    is_video =js[0][`node`][`is_video`]
    j_url = js[0][`node`][`display_url`]
    j_text = js[0][`node`][`edge_media_to_caption`][`edges`][0][`node`][`text`]


    if j_id!= pid and not is_video: #如果id 不重複 且不是video
        rt.append(True)
        rt.append(j_id)
        rt.append(j_url)
        rt.append(j_text)

        c2 = s.get(j_url)
        open(`mysite/static/jay.jpg`, `wb`).write(c2.content)
        return rt
    else :
        rt.append(False)
        return rt

其中pid是上次執行時候最新一張圖片的id，用來判斷是否有更新。其實這裡bs4對我的幫助並不大。~~上面程式只做到了儲存最新的一張圖片（jay.jpg），實測中，總是儲存第二張圖片，可能是正規表示式的問題。不影響功能，不再深究了。~~
使用 json 精準定位N次，（注意到js變數）

發微博參見微博API 學習記錄

Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
08、尋找周杰倫
2019-04-11
周杰倫稱昆凌懷三胎周杰倫幾歲結婚孩子幾歲了
2022-03-20
周杰倫價值百萬元的NFT被盜
2022-04-02
【資料視覺化】周杰倫新歌《Mojito》豆瓣短評資料
2020-06-26
視覺化
360：2019年熱搜排行榜最火人物是周杰倫
2019-12-17
python爬取網圖
2019-10-15
Python
音樂故事丨他是這樣一個人，他叫周杰倫
2018-08-17
預計周杰倫新專輯銷售額已超1.5億元
2022-07-17
《街霸：對決》今日上線，周杰倫邀你重燃街霸，贏到底！
2020-11-26
python 爬蟲爬取 learnku 精華文章
2020-04-17
Python爬蟲
Python爬取電影天堂
2018-11-01
Python
python 爬取 mc 皮膚
2019-08-02
Python
Python《爬取IPhone各式桌布》
2020-12-11
PythoniPhone
周杰倫等名人網站頻被掛馬粉絲上網需警惕
2019-05-13
網站
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
python爬蟲——爬取大學排名資訊
2019-08-02
Python爬蟲
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
python爬蟲--爬取鏈家租房資訊
2020-05-16
Python爬蟲
python 爬蟲 1 爬取酷狗音樂
2020-03-29
Python爬蟲
【Python爬蟲】正則爬取趕集網
2020-12-24
Python爬蟲
用python爬取知識星球
2019-02-16
Python
python爬取糗事百科
2018-08-14
Python
python爬取北京租房資訊
2018-05-18
Python
Python：爬取疫情每日資料
2020-02-17
Python
利用Python爬取必應桌布
2020-10-13
Python
Python-爬取CVE漏洞庫?
2021-11-05
Python
關於python爬取網頁
2021-03-10
Python網頁
python——豆瓣top250爬取
2021-01-02
Python
python網路爬蟲--爬取淘寶聯盟
2018-07-17
Python爬蟲
Python爬蟲入門【5】：27270圖片爬取
2019-07-30
Python爬蟲
Python 第一個爬蟲，爬取 147 小說
2020-05-08
Python爬蟲
小白學 Python 爬蟲（25）：爬取股票資訊
2019-12-24
Python爬蟲
爬蟲——爬取貴陽房價（Python實現）
2022-02-09
爬蟲Python
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python 爬取 baidu 股票市值資料
2019-02-16
PythonAI