一月份工作記錄

weixin_33728268發表於2018-01-17

關鍵字
K_means、ARIMA


前言

一月份主要工作如下:

精細化資料預處理
過濾掉單一地點mac、過濾掉出現天數低於10天的mac、進一步細分地點列表;

資料索引
保留兩份原始資料,以不同的索引儲存,便於後續檢索
a.時間戳、地點->mac
b.日期、mac->時間段:地點

人員數目分佈統計

聚類準備
將人員關於地點的時間分佈以ndarray的形式呈現(經過資料處理)


1.

第一部分的工作只是簡單的修改了之前的程式碼,內容意義不是很多,所以這裡就不詳細記錄啦~
資料索引這塊,詳細記錄一下通過日期和mac索引到place id的過程:

輸入引數:

start_time  開始時間
end_time    結束時間
mac         索引的mac地址物件

輸出

stime1,etime1,pid1  停留時間段1
stime2,etime2,pid2  停留時間段2
...
stimen,etimen,pidn  停留時間段n

資料片段

2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N

python程式碼:

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'
"""
資料維度變換2
日期、mac->時間段:地點

重新索引後的資料格式:
起始時間1,終止時間1,place id1
起始時間2,終止時間2,place id2
起始時間3,終止時間3,place id3
...
並得到規定時間內的軌跡陣列:
[142, 202, 142, 202, 200, 202, 200, 142, 142](example)
輸入引數:mac地址、開始時間、結束時間
"""

start_time ='2017-09-11 00:00:00'
end_time ='2017-09-18 00:00:00'
mac ='205d4717e6de'

def findpathByMacDate(mac,start_time,end_time):

    records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
    #讀取源資料,並指明列名(時間、時間範圍、地點id、地點型別、mac、是否為節假日)
    records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp'] <end_time)]
    #篩選出時間範圍內的mac記錄
    records_select = records_select.reset_index(drop=True)
    #重新索引資料(從0到n)

    filepath='./macdata/path/'+mac+'_pathinfo'+'.txt'
    change = []
    #chang列表,記錄該mac地點變化的節點
    place =[]
    #place列表,記錄該mac的place軌跡

    rs = open(filepath,'w')

    for i in range(records_select.shape[0] - 1):
        if int(records_select.ix[i][2]) == int(records_select.ix[i+1][2]):
            continue
        #如果相鄰記錄的地點一致,則繼續
        else:
            change.append((i + 1))
        #否則,記錄記錄變化之處的index

    # print records_select
    # print change

    print str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])
    place.append(str(records_select.ix[0][2]))
    # 頭部,也就是第一個地點對應的時間段
    rs.write(str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])+'\n')

    for n in range(len(change) - 1):
        print str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])
        place.append(str(records_select.ix[change[n]][2]))
        rs.write(str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])+'\n')
    # 中部

    print str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])
    place.append(str(records_select.ix[records_select.shape[0] - 1][2]))
    rs.write(str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])+'\n')
    # 尾部,也就是最後一個時間段對應的地點

    place = [int(i) for i in place]
    print place
    rs.close()

findpathByMacDate(mac,start_time,end_time)

資料重新索引看上去比較麻煩,巧妙運用pandas進行資料聚合、篩選操作,發現程式碼量並不多,很容易就實現了~


2.人員分佈統計

工作內容:

根據日期、時間段、地點型別(地點)等三個維度統計mac數量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便於觀察特徵

輸入:start_time,end_time
按天輸出:不同地點型別的mac數量
按時段輸出:不同地點型別的mac數量

返回檔案屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心

python程式碼

# -*- coding: UTF-8 -*-

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime

__author__ = 'SuZibo'

"""
根據日期、時間段、地點型別(地點)等三個維度統計mac數量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便於觀察特徵

輸入:start_time,end_time
按天輸出:不同地點型別的mac數量
按時段輸出:不同地點型別的mac數量

返回檔案屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心
"""

dormitory =[141,142,145,146,148,149,150,151,152,153]
canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
stadium =[183,184,185,186,187,188,189,190,191,232]

stime ='2017-09-11 00:00:00'
# etime ='2017-09-12 00:00:00'
etime ='2017-09-12 00:00:00'
#小迴圈裡面的時間上限和下限

weekdaylist =[]
start_date = '2017-09-11'
# end_date = '2017-11-13'
end_date='2017-11-13'
#大迴圈的時間上限和下限

sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
while sdate<edate:
    weekdaylist.append(sdate.strftime('%Y-%m-%d'))
    sdate += datetime.timedelta(days=1)

def getMacCountInfoByDay(stime,etime):
    #實現stime到etime時間段內的人數分佈統計
    dic_dormitory =dict()
    dic_canteen =dict()
    dic_edu =dict()
    dic_stadium =dict()
    mac_count=[]

    with open('../macinfo/macdata/normalinfo_trans.txt') as file:

        for line in file:
            line = line.split(',')
            line[-1] = line[-1].strip('\n')
            day = line[0][5:10]

            if stime<line[0]<etime:

                if line[3] =='dormitory':
                    if line[4] not in dic_dormitory:
                        dic_dormitory[line[4]] = dict()
                    dic_dormitory[line[4]][day] = dic_dormitory[line[4]].get(day, 0) + 1
                if line[3] =='canteen':
                    if line[4] not in dic_canteen:
                        dic_canteen[line[4]] = dict()
                    dic_canteen[line[4]][day] = dic_canteen[line[4]].get(day, 0) + 1
                if line[3] =='edu':
                    if line[4] not in dic_edu:
                        dic_edu[line[4]] = dict()
                    dic_edu[line[4]][day] = dic_edu[line[4]].get(day, 0) + 1
                if line[3] =='stadium':
                    if line[4] not in dic_stadium:
                        dic_stadium[line[4]] = dict()
                    dic_stadium[line[4]][day] = dic_stadium[line[4]].get(day, 0) + 1

        for mac in dic_dormitory:
            dic_dormitory[mac] = len(dic_dormitory[mac])
        for mac in dic_canteen:
            dic_canteen[mac] = len(dic_canteen[mac])
        for mac in dic_edu:
            dic_edu[mac] = len(dic_edu[mac])
        for mac in dic_stadium:
            dic_stadium[mac] = len(dic_stadium[mac])

        mac_count.append(stime)
        mac_count.append(len(dic_dormitory))
        mac_count.append(len(dic_canteen))
        mac_count.append(len(dic_edu))
        mac_count.append(len(dic_stadium))
        # print mac_count
        return mac_count
        #返回mac_count列表

rs = open('./plotdata/maccountbyday.txt','w')

for i in range(len(weekdaylist)):
    #for 迴圈程式執行getMacCountInfoByDay,得到(sdate到edate時間段內的)mac數按天、按地點分佈
    list = getMacCountInfoByDay(stime,etime)
    # print list
    # print list[0]
    rs.write(str(list[0][0:10])+','+str(list[1])+','+str(list[2])+','+str(list[3])+','+str(list[4])+'\n')
    #將返回的mac_count列表寫入檔案
    stime = datetime.datetime.strptime(stime, '%Y-%m-%d %H:%M:%S')
    etime = datetime.datetime.strptime(etime, '%Y-%m-%d %H:%M:%S')
    stime += datetime.timedelta(days=1)
    etime += datetime.timedelta(days=1)
    stime = str(stime)
    etime = str(etime)
rs.close()

關於人員統計,需要熟練運用python字典裡面的get方法
簡要陳述字典get方法:

語法
get()方法語法:

dict.get(key, default=None)

引數
key -- 字典中要查詢的鍵。
default -- 如果指定鍵的值不存在時,返回該預設值值。

返回值
返回指定鍵的值,如果值不在字典中返回預設值None。


3.人員時間分佈矩陣獲取

工作內容:
以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library為屬性
建立人員出現時長矩陣(以mac為索引)

python程式碼:

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

"""
統計每個人時間特徵矩陣(地點分佈)

地點list
male_dor=[141,145,146,149,151]
#男生宿舍
famale_dor=[148,150,152,153]
#女生宿舍
postgraduate_dor=[142]
#研究生宿舍
net=[217,229]
#網路中心
hospital=[192]
#校醫院
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
#食堂
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
#教學樓
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
#實驗室
stadium=[189,190,191]
#體育館
activity=[183,184,185,186,187,188,232]
#學生活動中心
administration=[221,222,223]
#行政樓
library=[193,194,195,227]
#圖書館
"""
mac_time_dic =dict()
#建立字典儲存mac對應的時間統計資訊,因為源資料的時間週期為1min,利用此特性累加得到的結果正好就是時長(單位為min)

# start_time ='2017-09-11 00:00:00'
# end_time ='2017-11-13 00:00:00'

# frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
# print frame_data.tail()

with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:

    for line in file:
        # print line
        line = line.split(',')
        line[-1] = line[-1].strip('\n')

        if line[4] not in mac_time_dic:
            mac_time_dic[line[4]] = dict()
        mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
#{mac1:{place1:m,place2:n,...},...}
#{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}

# print mac_time_dic
# print list(mac_time_dic.iteritems())
# print list(mac_time_dic.values())
# list1 = list(mac_time_dic.values())
# print list(mac_time_dic.keys())

frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
#轉換成dataframe格式,並且以mac為index
frame = frame.dropna(how='all')
#去掉NA項
frame = frame.fillna(0)
#用0填充NA項
frame.to_csv('./data/user_time_array_includex.csv')
frame.to_csv('./data/user_time_array.csv',index=False,header=False)

4.人員頻次分佈矩陣生成

接3,由於android和iOS作業系統的區別——前者開啟wifi後鎖屏會繼續連線,而後者鎖屏後過一小段時間會退出無線連線,因此以時間長度來衡量人員特徵不夠準確,於是希望以人員頻次為單位建立人員關於地點的向量矩陣。

Ps:希望對特定區域劃分時間段來區分人群,比如教學樓7:00-22:00和其他時間兩個時間段,藉此劃分人群
因此在以上基礎上又擴充了一些資料運算操作

python程式碼1:
不需要劃分時間段的地點頻次統計

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

"""
統計每個人時間特徵矩陣(地點分佈)

地點list
male_dor=[141,145,146,149,151]
famale_dor=[148,150,152,153]
postgraduate_dor=[142]
net=[217,229]
hospital=[192]
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
stadium=[189,190,191]
activity=[183,184,185,186,187,188,232]
administration=[221,222,223]
library=[193,194,195,227]


最終資料結構:{'教學樓(07:00-22:00)': 1, '教學樓(其他時段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '學生活動中心(07:00-21:00)': 0, '學生活動中心(其他時段)': 0, '行政樓(07:00-21:00)': 0, '行政樓(其他時段)': 0, '實驗樓(07:00-21:00)': 0, '實驗樓(其他時段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他時段)': 0}
edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
"""

mac_count_dic = dict()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
    for line in file:
        # print line
        line = line.split(',')
        line[-1] = line[-1].strip('\n')
        day = line[0][5:10]

        if line[4] not in mac_count_dic:
            mac_count_dic[line[4]] = dict()

        if line[3] not in mac_count_dic[line[4]]:
            mac_count_dic[line[4]][line[3]] = dict()

        mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
#建立巢狀mac
#mac_count_dic['mac']['地點'] [日期集合]
# print mac_count_dic

rs = open('./data/user_count_array_includex.csv','w')

for key in mac_count_dic:
#遍歷得到的字典
    mac = key
    dis = mac_count_dic[key]
    #相當於解巢狀
    if dis.has_key('male_dor') == True:
        male_dor_count = len(dis['male_dor'])
    if dis.has_key('male_dor') == False:
        male_dor_count = 0

    if dis.has_key('famale_dor') == True:
        famale_dor_count = len(dis['famale_dor'])
    if dis.has_key('famale_dor') == False:
        famale_dor_count = 0

    if dis.has_key('postgraduate_dor') == True:
        postgraduate_dor_count = len(dis['postgraduate_dor'])
    if dis.has_key('postgraduate_dor') == False:
        postgraduate_dor_count = 0

    if dis.has_key('net') == True:
        net_count = len(dis['net'])
    if dis.has_key('net') == False:
        net_count = 0

    if dis.has_key('hospital') == True:
        hospital_count = len(dis['hospital'])
    if dis.has_key('hospital') == False:
        hospital_count = 0

    if dis.has_key('stadium') == True:
        stadium_count = len(dis['stadium'])
    if dis.has_key('stadium') == False:
        stadium_count = 0

rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
rs.close()
#mac,male_count,famale_count,...
#mac為索引

同理得到7:00-22:00時間段內的頻次字典/extra時間段內的頻次字典
建立三個dataframe物件,命名為df1,df2,df3

python程式碼2:
dataframe物件合併

# -*- coding: UTF-8 -*-

import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time

__author__ = 'SuZibo'

df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])

# print len(df1)
# print len(df2)
# print len(df3)

df = df2.join(df1)
# print df
df = df.join(df3)
df = df.dropna(how='all')
df = df.fillna(0)
# print df
df.to_csv('./data/user_TimeArray_includex.csv')
#生成有索引的csv
df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
#生成無索引csv

至此就完成了人員頻次向量矩陣的生成

矩陣片段:

,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0

在下一篇準備對於ARIMA模型進行描述和研究

相關文章