Python資料分析之糗事百科

weixin_34124651發表於2017-05-26

原文網址 : https://blog.csdn.net/weixin_34124651/article/details/87059845

最近一直忙著寫材料，沒給大家寫作業的案例，第二期同學很厲害，都是搶著要作業做，哈哈，今天我就給大家寫點爬蟲的擴充套件和資料分析，讓厲害的同學學起來。

程式碼

這次除了爬取老師的作業佈置的欄位外，還爬取了使用者的一些資訊，如圖所示。

之前的作業亮同學已經詳細講解了，我今天就貼下我的程式碼：

import requests
from lxml import etree
import pymongo
import time

client = pymongo.MongoClient('localhost', 27017)
qiushi = client['qiushi']
qiushi_info = qiushi['qiushi_info']
user_info = qiushi['user_info']

header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
}

def get_info(url):
    html = requests.get(url,headers=header)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//div[@class="col1"]/div')
    base_url = 'https://www.qiushibaike.com'
    for info in infos:
        id = info.xpath('div[1]/a[2]/h2/text()')[0] if len(info.xpath('div[1]/a[2]/h2/text()'))==1 else '匿名使用者'
        jug_sex = info.xpath('div[1]/div/@class')
        if len(jug_sex)==0:
            sex = '不詳'
            age = '不詳'
        elif jug_sex[0]=='articleGender manIcon':
            sex = '男'
            age = info.xpath('div[1]/div/text()')[0]
        else:
            sex = '女'
            age = info.xpath('div[1]/div/text()')[0]
        content = info.xpath('a[1]/div/span[1]/text()')[0]
        laugh = info.xpath('div[2]/span[1]/i/text()')[0]
        comment = info.xpath('div[2]/span[2]/a/i/text()')[0] if info.xpath('div[2]/span[2]/a/i/text()') else None
        user_url = base_url + info.xpath('div[1]/a[2]/@href')[0] if info.xpath('div[1]/a[2]/@href') else None
        data ={
            'id':id,
            'sex':sex,
            'age':age,
            'laugh':laugh,
            'comment':comment,
            'user_url':user_url,
            'content':content
        }
        qiushi_info.insert_one(data)
        if user_url == None:
            pass
        else:
            get_user_info(user_url)
    time.sleep(1)

def get_user_info(url):
    html = requests.get(url,headers=header)
    selector = etree.HTML(html.text)
    if selector.xpath('//div[@class="user-block user-setting clearfix"]'):
        pass
    else:
        fans = selector.xpath('//div[2]/div[3]/div[1]/ul/li[1]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[1]/text()') else None
        topic = selector.xpath('//div[2]/div[3]/div[1]/ul/li[2]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[2]/text()') else None
        qiushi = selector.xpath('//div[2]/div[3]/div[1]/ul/li[3]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[3]/text()') else None
        comment_1 = selector.xpath('//div[2]/div[3]/div[1]/ul/li[4]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[4]/text()') else None
        favour = selector.xpath('//div[2]/div[3]/div[1]/ul/li[5]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[5]/text()') else None
        handpick = selector.xpath('//div[2]/div[3]/div[1]/ul/li[6]/text()')[0] if selector.xpath('//div[2]/div[3]/div[1]/ul/li[6]/text()') else None
        martial_status = selector.xpath('//div[2]/div[3]/div[2]/ul/li[1]/text()')[0] if selector.xpath('//div[2]/div[3]/div[2]/ul/li[1]/text()') else '不詳'
        constellation = selector.xpath('//div[2]/div[3]/div[2]/ul/li[2]/text()')[0] if selector.xpath('//div[2]/div[3]/div[2]/ul/li[2]/text()') else '不詳'
        profession = selector.xpath('//div[2]/div[3]/div[2]/ul/li[3]/text()')[0] if selector.xpath('//div[2]/div[3]/div[2]/ul/li[3]/text()') else '不詳'
        home = selector.xpath('//div[2]/div[3]/div[2]/ul/li[4]/text()')[0] if selector.xpath('//div[2]/div[3]/div[2]/ul/li[4]/text()') else '不詳'
        qiushi_age = selector.xpath('//div[2]/div[3]/div[2]/ul/li[5]/text()')[0] if selector.xpath('//div[2]/div[3]/div[2]/ul/li[5]/text()') else '不詳'
        # print(fans,topic,qiushi,comment_1,favour,handpick,martial_status,constellation,profession,home,qiushi_age)
        data ={
            'fans':fans,
            'topic':topic,
            'qiushi':qiushi,
            'comment_1':comment_1,
            'favour':favour,
            'handpick':handpick,
            'martial_status':martial_status,
            'constellation':constellation,
            'profession':profession,
            'home':home,
            'qiushi_age':qiushi_age,
            'user_url':url
        }
        user_info.insert_one(data)


if __name__ == '__main__':
    urls = ['https://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1,36)]
    for url in urls:
        get_info(url)

資料儲存到mongodb資料庫中，如圖：

資料預處理

首先，匯入庫和資料：

import pandas as pd
import pymongo
import jieba.analyse
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
client = pymongo.MongoClient('localhost',port = 27017)
qiushi = client['qiushi']
qiushi_info = qiushi['qiushi_info']
data = pd.DataFrame(list(qiushi_info.find()))
data

欄位型別轉化
由於有些欄位沒有，填充了“不詳”或None，所以age，comment欄位都是文字型別的，需轉化為整形，但有None這些東西沒法轉，需要把這些內容替換為“0”才能轉，以下就是轉化程式碼。（怎麼就管不住我這雙手呢，填空值可以直接轉化，而且填充缺失值也很簡單）

data['age'].replace('不詳','0',inplace=True)
data['comment'].replace([None],'0',inplace=True)
data['age'] = data['age'].astype('int64')
data['comment'] = data['comment'].astype('int64')
data['laugh'] = data['laugh'].astype('int64')
data.dtypes

填補缺失值
我把一些值都替換成了0，我們通過列的平均值進行填充即可。

data['age'].replace(0,int(data[data['age']!=0]['age'].mean()),inplace=True)
data['comment'].replace(0,int(data[data['comment']!=0]['comment'].mean()),inplace=True)

玩糗事的人年齡

通過describe看下：

data.describe()

可以看出平均年齡為34，話說不是我們才是段子手的主力軍麼，我回頭看了下資料，有很多人填寫的年齡為100以上，為虛假資訊，由於資料量少，拉高了平均值，段子手是屬於我們的！！！！（我不會告訴你我才17）

誰是段子手

通過排序，找出前十評論和前十好笑的段子的使用者，看看誰才是真正的段子手。

data1 = data.sort_values(['comment'],ascending=False)[0:10]
plt.figure(figsize=(20,10),dpi=80)
lables=list(data1['id'])
plt.bar(range(len(lables)),data1['comment'],tick_label=lables)

data2 = data.sort_values(['laugh'],ascending=False)[0:10]
plt.figure(figsize=(20,10),dpi=80)
lables=list(data1['id'])
plt.bar(range(len(lables)),data1['laugh'],tick_label=lables)

段子手性別比例

看下段子手男女比例：

plt.figure(figsize=(8,6),dpi=80)
labels = list(data3.index)
sizes = list(data3)
colors = ['red','yellowgreen','lightskyblue']

explode = (0.05,0,0)

patches,l_text,p_text = plt.pie(sizes,explode=explode,labels=labels,colors=colors,
                                labeldistance = 1.1,autopct = '%3.1f%%',shadow = False,
                                startangle = 90,pctdistance = 0.6)
plt.axis('equal')
plt.legend()

男生比較多，哈哈，汙汙的女生最可愛！！！

段子詞雲

詞雲製作講過很多次了，放上程式碼和圖。

duanzi = ''  #初始化字串
for i in range(360):   #數字為資料的行數
    index = data.ix[i,:]   #取每行
    content = index['content']  #取每行的question
    duanzi = duanzi + content
jieba.analyse.set_stop_words('停用詞表路徑')
tags = jieba.analyse.extract_tags(duanzi, topK=50, withWeight=True)
for item in tags:
    print(item[0]+'\t'+str(int(item[1]*1000)))

段子嘛，無非是男生聊女生，女生聊男生。

總結

放假了，作業寫完的同學，資料分析來一波，還有使用者的詳細資訊沒分析，我們下次分析咯！！端午快樂。

python爬取糗事百科
2018-08-14
Python
Python爬取糗事百科段子
2018-08-31
Python
python多執行緒爬去糗事百科
2018-04-03
Python執行緒
python爬蟲十二：middlewares的使用，爬取糗事百科
2018-05-31
Python爬蟲
python3.6.5 爬取糗事百科，開心一下
2018-07-10
Python
[外掛擴充套件]糗事百科QiuBa
2020-04-04
套件
仿的一個笑話網站糗事百科
2019-05-11
網站
Python資料分析之numpy
2018-07-23
Python
Python資料分析之pandas
2018-07-23
Python
Python資料分析之Pandas篇
2020-10-05
Python
Python資料分析與展示之『Numpy』
2020-12-25
Python
spark實戰之：分析維基百科網站統計資料(java版)
2022-08-19
Spark網站Java
網路爬蟲——專案實戰（爬取糗事百科所有文章）
2020-02-07
爬蟲
【python資料探勘課程】二十四.KMeans文字聚類分析互動百科語料
2018-07-06
Python聚類
資料分析師之如何學好Python（四）
2019-01-18
Python
仿糗事百科笑話系統原始碼，PHP笑話系統原始碼
2019-05-11
原始碼PHP
資料分析之Python受歡迎的原因（二）
2019-01-21
Python
初遇python--之新手學資料分析（1）
2020-11-15
Python
python之資料結構與演算法分析
2021-04-30
Python資料結構演算法
Python資料分析 – numpy
2019-02-16
Python
Python - pandas 資料分析
2020-04-05
Python
Python+資料分析：資料分析：北京Python開發的現狀
2018-11-24
Python
利用python進行資料分析之準備工作（1）
2018-08-10
Python
資料分析之tableau
2024-09-05
資料分析之matplotlib
2020-12-05
Python | 資料分析實戰Ⅰ
2019-03-04
Python
Python | 資料分析實戰 Ⅱ
2018-04-28
Python
python資料分析-Anaconda使用
2020-11-02
Python
Python資料分析入門
2021-09-09
Python
資料分析系列之python中擴充庫SciPy的使用
2021-05-05
Python
Python資料分析入門（十四）：資料分析中常用圖
2021-04-10
Python
資料分析利器之Pandas
2022-12-05
Python基礎之Python資料世界
2019-12-18
Python
大資料分析之資料下鑽上卷
2024-03-19
大資料
Python資料分析之 pandas彙總和計算描述統計
2019-09-30
Python
Python資料分析庫之pandas，你該這麼學！No.1
2019-05-15
Python
Python資料分析入門(五)
2018-08-27
Python
Python資料分析入門(四)
2018-08-25
Python
Python資料分析入門(一)
2018-08-19
Python