（分享）第一次寫的爬蟲學習　Instagram Crawler User Profile v0.1

kengsley發表於2020-01-05

原文網址 : https://learnku.com/articles/39243

Purpore

Create a crawler to download Profile detail into local folder via the usename or useid.
開源地址：https://github.com/kengsley1993/instagram_...
希望大家可以多給意見我，令我改進　謝謝

Usage

Python 3.6
urllib
pyquery
PyMySQL
MySQL
pyspider
Scrapy

Setup

Create a project folder:

scrapy startproject instagram_user

Create a spider crawl into project folder:

scrapy spider user_crawler www.instagram.com

Information

The information need to crawl:

user's id
username
post's id
post's liked
post's caption
post's commit count
post's images and videos

Project

1. Spider (user_crawler)

1.1 Send Requests

Capture the user information from 'https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={'id':{userid},'first':'12','after':{after_string}}'
It will return a json data.

base_url = 'https://www.instagram.com/graphql/query/?'
if settings.USERID is '':
    find_id_url = 'https://www.instagram.com/' + settings.USERNAME
    response = requests.get(find_id_url, headers=headers)
    result = re.search('"profilePage_(.*?)"', response.text)
    settings.USERID = result[1]

param = {
    'query_hash': 'e769aa130647d2354c40ea6a439bfc08',
    'variables': '{"id":"'+ settings.USERID +'","first":12}',
}

def start_requests(self):
    url = self.base_url + urlencode(self.param)
    yield Request(url, headers=headers, callback=self.parse)

1.2 Data Collect

Decoding the response by using json format:

data_json = json.loads(response.text)
data = data_json.get('data').get('user').get('edge_owner_to_timeline_media')

1.3 Store Item

Collect the data json into own database settings format for store into mysql and download images and videos in local folder.

for user_detail in data.get('edges'):
    user_node = user_detail.get('node')
    item = InstagramUserItem()
    item['postid'] = user_node.get('id')
    item['username'] = user_node.get('owner').get('username')
    if settings.USERNAME is '':
        settings.USERNAME = user_node.get('owner').get('username')
    item['userid'] = user_node.get('owner').get('id')
    item['liked'] = user_node.get('edge_media_preview_like').get('count')
    try:
        item['caption'] = user_node.get('edge_media_to_caption').get('edges')[0].get('node').get('text')
    except:
        item['caption'] = ''
    item['comment'] = user_node.get('edge_media_to_comment').get('count')

    video_link = ''
    images_link = ''
    if user_node.get('edge_sidecar_to_children'):
        child_edges = user_node.get('edge_sidecar_to_children').get('edges')
        for child in child_edges:
            node = child.get('node')
            if user_node.get('is_video'):
                # get video link
                video_link += node.get('video_url')+';'
            else:
                images_link += node.get('display_url')+';'
    else:
        if user_node.get('is_video'):
            video_link = user_node.get('video_url')+';'
        else:
            images_link = user_node.get('display_url')+';'
    item['image_list'] = images_link
    item['video_list'] = video_link

    yield item

Get the next page and recall the request:

page_info = data.get('page_info')
if page_info.get('has_next_page'):
    temp_variables = json.loads(self.param.get('variables'))
    temp_variables['after'] = page_info.get('end_cursor')
    self.param['variables'] = json.dumps(temp_variables)
    url = self.base_url + urlencode(self.param)
    yield Request(url, headers=headers, callback=self.parse)

2. Items

Setup the data elements for data storing and usage in item pipeline.

collection = table = 'user_post'
postid = Field()
userid = Field()
username = Field()
liked = Field()
caption = Field()
comment = Field()
image_list = Field()
video_list = Field()

3. Pipelines

3.1 FilePipeline

Download the videos or images and store into local folder.

def file_path(self, request, response=None, info=None):
    file_name = settings.USERNAME + '/' + os.path.basename(urlparse(request.url).path)
    return file_name

def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem('Image Downloaded Failed')
    return item

def get_media_requests(self, item, info):
    for url in item['image_list'].split(';'):
        yield Request(url)
    for url in item['video_list'].split(';'):
        yield Request(url)

3.2 MySQL Pipeline (Future)

4. Settings

Enter username or userid to caption the profile:

USERID = '[Enter instagram userid]'
USERNAME = '[Enter instagram username]'

Setup the item pipelines:

ITEM_PIPELINES = {
   'instagram_user.pipelines.FilePipeline': 301,
}

FILES_STORE = './user'

Run

Start the crawler, store the image and information into local

scrapy crawl user_crawler

本作品採用《CC 協議》，轉載必須註明作者和本文連結

Node.js爬取妹子圖-crawler爬蟲的使用
2018-04-04
Node.js爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲：爬取instagram，破解js加密引數
2019-04-09
Python爬蟲JS加密
用PyCharm Profile分析非同步爬蟲效率
2019-04-24
PyCharm非同步爬蟲
爬蟲學習-初次上路
2020-11-21
爬蟲
selenium爬蟲學習1
2024-08-29
爬蟲
python爬蟲學習1
2020-11-29
Python爬蟲
什麼是爬蟲?學習Python爬蟲難不難?
2019-11-05
爬蟲Python
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
爬蟲學習日記（六）
2019-01-14
爬蟲
Android 淘寶爬蟲學習
2019-03-18
Android爬蟲
爬蟲學習日記（八）
2019-01-18
爬蟲
爬蟲學習日記（七）
2019-01-15
爬蟲
爬蟲學習日記（五）
2018-12-14
爬蟲
爬蟲學習日記（三）
2018-12-07
爬蟲
爬蟲學習日記（二）
2018-11-28
爬蟲
爬蟲學習日記（一）
2018-11-28
爬蟲
11.18爬蟲學習（BeautifulSoup類）
2024-11-18
爬蟲
逆向爬蟲知識學習
2022-03-21
爬蟲
ORACLE user profile配置/管理/維護
2021-09-09
Oracle
不用寫程式碼的爬蟲
2019-06-17
爬蟲
新手寫的視訊爬蟲
2020-12-16
爬蟲
python爬蟲—學習筆記-4
2024-04-23
Python爬蟲筆記
python爬蟲—學習筆記-2
2024-04-10
Python爬蟲筆記
python爬蟲js逆向學習（二）
2020-07-03
Python爬蟲JS
爬蟲之CSS語法學習
2024-10-23
爬蟲CSS
Python爬蟲學習筆記(三)
2021-01-30
Python爬蟲筆記
python爬蟲學習筆記（二）
2020-11-24
Python爬蟲筆記
寫個爬蟲唄
2019-02-25
爬蟲
爬蟲學習日記（六）完成第一個爬蟲任務
2019-01-10
爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
學習爬蟲必須學的基礎知識
2020-01-13
爬蟲
從零基礎開始學習Python爬蟲你需要注意的點以及如何學習爬蟲
2019-01-02
Python爬蟲
學習C語言還是學習Python爬蟲?
2020-11-23
C語言Python爬蟲
爬蟲學習筆記：練習爬取多頁天涯帖子
2019-02-16
爬蟲筆記
你有自己寫過爬蟲的程式嗎？說說你對爬蟲和反爬蟲的理解？
2024-11-28
爬蟲
一入爬蟲深似海，總結python爬蟲學習筆記！
2019-02-14
爬蟲Python筆記
python爬蟲學習01--電子書爬取
2020-07-13
Python爬蟲

（分享）第一次寫的爬蟲學習 Instagram Crawler User Profile v0.1