一月份工作記錄
關鍵字
K_means、ARIMA
前言
一月份主要工作如下:
精細化資料預處理
過濾掉單一地點mac、過濾掉出現天數低於10天的mac、進一步細分地點列表;
資料索引
保留兩份原始資料,以不同的索引儲存,便於後續檢索
a.時間戳、地點->mac
b.日期、mac->時間段:地點
人員數目分佈統計
聚類準備
將人員關於地點的時間分佈以ndarray的形式呈現(經過資料處理)
1.
第一部分的工作只是簡單的修改了之前的程式碼,內容意義不是很多,所以這裡就不詳細記錄啦~
資料索引這塊,詳細記錄一下通過日期和mac索引到place id的過程:
輸入引數:
start_time 開始時間
end_time 結束時間
mac 索引的mac地址物件
輸出
stime1,etime1,pid1 停留時間段1
stime2,etime2,pid2 停留時間段2
...
stimen,etimen,pidn 停留時間段n
資料片段
2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N
python程式碼:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
資料維度變換2
日期、mac->時間段:地點
重新索引後的資料格式:
起始時間1,終止時間1,place id1
起始時間2,終止時間2,place id2
起始時間3,終止時間3,place id3
...
並得到規定時間內的軌跡陣列:
[142, 202, 142, 202, 200, 202, 200, 142, 142](example)
輸入引數:mac地址、開始時間、結束時間
"""
start_time ='2017-09-11 00:00:00'
end_time ='2017-09-18 00:00:00'
mac ='205d4717e6de'
def findpathByMacDate(mac,start_time,end_time):
records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
#讀取源資料,並指明列名(時間、時間範圍、地點id、地點型別、mac、是否為節假日)
records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp'] <end_time)]
#篩選出時間範圍內的mac記錄
records_select = records_select.reset_index(drop=True)
#重新索引資料(從0到n)
filepath='./macdata/path/'+mac+'_pathinfo'+'.txt'
change = []
#chang列表,記錄該mac地點變化的節點
place =[]
#place列表,記錄該mac的place軌跡
rs = open(filepath,'w')
for i in range(records_select.shape[0] - 1):
if int(records_select.ix[i][2]) == int(records_select.ix[i+1][2]):
continue
#如果相鄰記錄的地點一致,則繼續
else:
change.append((i + 1))
#否則,記錄記錄變化之處的index
# print records_select
# print change
print str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])
place.append(str(records_select.ix[0][2]))
# 頭部,也就是第一個地點對應的時間段
rs.write(str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])+'\n')
for n in range(len(change) - 1):
print str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])
place.append(str(records_select.ix[change[n]][2]))
rs.write(str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])+'\n')
# 中部
print str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])
place.append(str(records_select.ix[records_select.shape[0] - 1][2]))
rs.write(str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])+'\n')
# 尾部,也就是最後一個時間段對應的地點
place = [int(i) for i in place]
print place
rs.close()
findpathByMacDate(mac,start_time,end_time)
資料重新索引看上去比較麻煩,巧妙運用pandas進行資料聚合、篩選操作,發現程式碼量並不多,很容易就實現了~
2.人員分佈統計
工作內容:
根據日期、時間段、地點型別(地點)等三個維度統計mac數量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便於觀察特徵
輸入:start_time,end_time
按天輸出:不同地點型別的mac數量
按時段輸出:不同地點型別的mac數量
返回檔案屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心
python程式碼
# -*- coding: UTF-8 -*-
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
__author__ = 'SuZibo'
"""
根據日期、時間段、地點型別(地點)等三個維度統計mac數量。柱狀圖同時顯示兩個維度(固定第三個維度),顯示時可以切換第三個維度便於觀察特徵
輸入:start_time,end_time
按天輸出:不同地點型別的mac數量
按時段輸出:不同地點型別的mac數量
返回檔案屬性說明:
宿舍,食堂,教學樓,體育館/學生活動中心
"""
dormitory =[141,142,145,146,148,149,150,151,152,153]
canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
stadium =[183,184,185,186,187,188,189,190,191,232]
stime ='2017-09-11 00:00:00'
# etime ='2017-09-12 00:00:00'
etime ='2017-09-12 00:00:00'
#小迴圈裡面的時間上限和下限
weekdaylist =[]
start_date = '2017-09-11'
# end_date = '2017-11-13'
end_date='2017-11-13'
#大迴圈的時間上限和下限
sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
while sdate<edate:
weekdaylist.append(sdate.strftime('%Y-%m-%d'))
sdate += datetime.timedelta(days=1)
def getMacCountInfoByDay(stime,etime):
#實現stime到etime時間段內的人數分佈統計
dic_dormitory =dict()
dic_canteen =dict()
dic_edu =dict()
dic_stadium =dict()
mac_count=[]
with open('../macinfo/macdata/normalinfo_trans.txt') as file:
for line in file:
line = line.split(',')
line[-1] = line[-1].strip('\n')
day = line[0][5:10]
if stime<line[0]<etime:
if line[3] =='dormitory':
if line[4] not in dic_dormitory:
dic_dormitory[line[4]] = dict()
dic_dormitory[line[4]][day] = dic_dormitory[line[4]].get(day, 0) + 1
if line[3] =='canteen':
if line[4] not in dic_canteen:
dic_canteen[line[4]] = dict()
dic_canteen[line[4]][day] = dic_canteen[line[4]].get(day, 0) + 1
if line[3] =='edu':
if line[4] not in dic_edu:
dic_edu[line[4]] = dict()
dic_edu[line[4]][day] = dic_edu[line[4]].get(day, 0) + 1
if line[3] =='stadium':
if line[4] not in dic_stadium:
dic_stadium[line[4]] = dict()
dic_stadium[line[4]][day] = dic_stadium[line[4]].get(day, 0) + 1
for mac in dic_dormitory:
dic_dormitory[mac] = len(dic_dormitory[mac])
for mac in dic_canteen:
dic_canteen[mac] = len(dic_canteen[mac])
for mac in dic_edu:
dic_edu[mac] = len(dic_edu[mac])
for mac in dic_stadium:
dic_stadium[mac] = len(dic_stadium[mac])
mac_count.append(stime)
mac_count.append(len(dic_dormitory))
mac_count.append(len(dic_canteen))
mac_count.append(len(dic_edu))
mac_count.append(len(dic_stadium))
# print mac_count
return mac_count
#返回mac_count列表
rs = open('./plotdata/maccountbyday.txt','w')
for i in range(len(weekdaylist)):
#for 迴圈程式執行getMacCountInfoByDay,得到(sdate到edate時間段內的)mac數按天、按地點分佈
list = getMacCountInfoByDay(stime,etime)
# print list
# print list[0]
rs.write(str(list[0][0:10])+','+str(list[1])+','+str(list[2])+','+str(list[3])+','+str(list[4])+'\n')
#將返回的mac_count列表寫入檔案
stime = datetime.datetime.strptime(stime, '%Y-%m-%d %H:%M:%S')
etime = datetime.datetime.strptime(etime, '%Y-%m-%d %H:%M:%S')
stime += datetime.timedelta(days=1)
etime += datetime.timedelta(days=1)
stime = str(stime)
etime = str(etime)
rs.close()
關於人員統計,需要熟練運用python字典裡面的get方法
簡要陳述字典get方法:
語法
get()方法語法:
dict.get(key, default=None)
引數
key -- 字典中要查詢的鍵。
default -- 如果指定鍵的值不存在時,返回該預設值值。
返回值
返回指定鍵的值,如果值不在字典中返回預設值None。
3.人員時間分佈矩陣獲取
工作內容:
以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library為屬性
建立人員出現時長矩陣(以mac為索引)
python程式碼:
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
統計每個人時間特徵矩陣(地點分佈)
地點list
male_dor=[141,145,146,149,151]
#男生宿舍
famale_dor=[148,150,152,153]
#女生宿舍
postgraduate_dor=[142]
#研究生宿舍
net=[217,229]
#網路中心
hospital=[192]
#校醫院
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
#食堂
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
#教學樓
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
#實驗室
stadium=[189,190,191]
#體育館
activity=[183,184,185,186,187,188,232]
#學生活動中心
administration=[221,222,223]
#行政樓
library=[193,194,195,227]
#圖書館
"""
mac_time_dic =dict()
#建立字典儲存mac對應的時間統計資訊,因為源資料的時間週期為1min,利用此特性累加得到的結果正好就是時長(單位為min)
# start_time ='2017-09-11 00:00:00'
# end_time ='2017-11-13 00:00:00'
# frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
# print frame_data.tail()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
if line[4] not in mac_time_dic:
mac_time_dic[line[4]] = dict()
mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
#{mac1:{place1:m,place2:n,...},...}
#{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}
# print mac_time_dic
# print list(mac_time_dic.iteritems())
# print list(mac_time_dic.values())
# list1 = list(mac_time_dic.values())
# print list(mac_time_dic.keys())
frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
#轉換成dataframe格式,並且以mac為index
frame = frame.dropna(how='all')
#去掉NA項
frame = frame.fillna(0)
#用0填充NA項
frame.to_csv('./data/user_time_array_includex.csv')
frame.to_csv('./data/user_time_array.csv',index=False,header=False)
4.人員頻次分佈矩陣生成
接3,由於android和iOS作業系統的區別——前者開啟wifi後鎖屏會繼續連線,而後者鎖屏後過一小段時間會退出無線連線,因此以時間長度來衡量人員特徵不夠準確,於是希望以人員頻次為單位建立人員關於地點的向量矩陣。
Ps:希望對特定區域劃分時間段來區分人群,比如教學樓7:00-22:00和其他時間兩個時間段,藉此劃分人群
因此在以上基礎上又擴充了一些資料運算操作
python程式碼1:
不需要劃分時間段的地點頻次統計
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
"""
統計每個人時間特徵矩陣(地點分佈)
地點list
male_dor=[141,145,146,149,151]
famale_dor=[148,150,152,153]
postgraduate_dor=[142]
net=[217,229]
hospital=[192]
canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
stadium=[189,190,191]
activity=[183,184,185,186,187,188,232]
administration=[221,222,223]
library=[193,194,195,227]
最終資料結構:{'教學樓(07:00-22:00)': 1, '教學樓(其他時段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '學生活動中心(07:00-21:00)': 0, '學生活動中心(其他時段)': 0, '行政樓(07:00-21:00)': 0, '行政樓(其他時段)': 0, '實驗樓(07:00-21:00)': 0, '實驗樓(其他時段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他時段)': 0}
edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
"""
mac_count_dic = dict()
with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
for line in file:
# print line
line = line.split(',')
line[-1] = line[-1].strip('\n')
day = line[0][5:10]
if line[4] not in mac_count_dic:
mac_count_dic[line[4]] = dict()
if line[3] not in mac_count_dic[line[4]]:
mac_count_dic[line[4]][line[3]] = dict()
mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
#建立巢狀mac
#mac_count_dic['mac']['地點'] [日期集合]
# print mac_count_dic
rs = open('./data/user_count_array_includex.csv','w')
for key in mac_count_dic:
#遍歷得到的字典
mac = key
dis = mac_count_dic[key]
#相當於解巢狀
if dis.has_key('male_dor') == True:
male_dor_count = len(dis['male_dor'])
if dis.has_key('male_dor') == False:
male_dor_count = 0
if dis.has_key('famale_dor') == True:
famale_dor_count = len(dis['famale_dor'])
if dis.has_key('famale_dor') == False:
famale_dor_count = 0
if dis.has_key('postgraduate_dor') == True:
postgraduate_dor_count = len(dis['postgraduate_dor'])
if dis.has_key('postgraduate_dor') == False:
postgraduate_dor_count = 0
if dis.has_key('net') == True:
net_count = len(dis['net'])
if dis.has_key('net') == False:
net_count = 0
if dis.has_key('hospital') == True:
hospital_count = len(dis['hospital'])
if dis.has_key('hospital') == False:
hospital_count = 0
if dis.has_key('stadium') == True:
stadium_count = len(dis['stadium'])
if dis.has_key('stadium') == False:
stadium_count = 0
rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
rs.close()
#mac,male_count,famale_count,...
#mac為索引
同理得到7:00-22:00時間段內的頻次字典/extra時間段內的頻次字典
建立三個dataframe物件,命名為df1,df2,df3
python程式碼2:
dataframe物件合併
# -*- coding: UTF-8 -*-
import csv
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from dateutil.parser import parse
import datetime
import time
__author__ = 'SuZibo'
df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])
# print len(df1)
# print len(df2)
# print len(df3)
df = df2.join(df1)
# print df
df = df.join(df3)
df = df.dropna(how='all')
df = df.fillna(0)
# print df
df.to_csv('./data/user_TimeArray_includex.csv')
#生成有索引的csv
df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
#生成無索引csv
至此就完成了人員頻次向量矩陣的生成
矩陣片段:
,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0
在下一篇準備對於ARIMA模型進行描述和研究
相關文章
- 雜記-本週工作記錄
- SQL優化工作記錄SQL優化
- 尋找工作仿騙記錄
- 互動設計工作記錄
- 記錄我DBA的工作和生活
- 工作記錄2015.3.23
- win10如何關閉工作列歷史記錄_win10工作列歷史記錄怎麼關閉Win10
- 工作記錄-優化大表更新一例優化
- [工作記錄]ORA-03113/ORA-07445
- Windows XP 清除工作列隱藏圖示記錄(轉)Windows
- 容器 工作目錄
- tbm實施工作記錄20130826
- 記錄工作過程中一次業務最佳化
- PUTTY 記錄操作記錄
- 域名解析的記錄型別:A記錄、CNAME、MX記錄、NS記錄型別
- SAP BI工作摘錄
- Cookie記錄瀏覽記錄Cookie
- DNS 系列(二):DNS 記錄及工作方式,你瞭解嗎?DNS
- Xshell 5 記錄操作記錄
- git checkout 對工作目錄的影響 —— Git 學習筆記 21Git筆記
- 201312芳烴車間核算SAPTBM工作記錄APT
- 記錄
- 工作筆記筆記
- 番茄工作法摘錄
- [20181219]記錄自己工作中的錯誤.txt
- 簡單記錄一下新行動硬碟的初始化工作硬碟
- Conti內部聊天記錄暴露勒索軟體團伙工作日常
- apache工作模式總結及網站訪問緩慢處理記錄Apache模式網站
- python記錄下工作後遇到的問題,時間的轉化Python
- 淘寶記錄筆記筆記
- python工作目錄在哪裡Python
- colab 改變工作目錄
- 工作瑣記:2
- 工作日記
- Mysql 工作筆記MySql筆記
- 面試記錄面試
- Git記錄Git
- 命令記錄