資料預處理 demo

娃哈哈店長發表於2020-02-19

原文網址 : https://learnku.com/articles/40418?order_by=created_at&

對load的資料進行資料預處理，比如去除電影名中的年份、將使用者性別用0-1表示、使用者年齡段用0~6來表示等等

資料預處理

UserID、Occupation和MovieID不用變。
Gender欄位：需要將‘F’和‘M’轉換成0和1。
Age欄位：要轉成7個連續數字0~6。
Genres欄位：是分類欄位，要轉成數字。首先將Genres中的類別轉成字串到數字的字典，然後再將每個電影的Genres欄位轉成數字列表，因為有些電影是多個Genres的組合。
Title欄位：處理方式跟Genres欄位一樣，首先建立文字到數字的字典，然後將Title中的描述轉成數字的列表。另外Title中的年份也需要去掉。
Genres和Title欄位需要將長度統一，這樣在神經網路中方便處理。空白部分用‘< PAD >’對應的數字填充。

構建資料處理的物件，預設初始化，三個資料集的路徑，方便呼叫訪問：

class MovieRecommend():
    def __init__(self):
        self.UserDataPath = './ml-1m/users.dat'
        self.MoviesDataPath = './ml-1m/movies.dat'
        self.RatingsDataPath = './,1-1m/ratings.dat'

對使用者資料進行預處理

如上

Gender欄位：需要將‘F’和‘M’轉換成0和1。
Age欄位：要轉成7個連續數字0~6。

def load_user_data(self):
        '''
        load & preprocessing dataset from file
        '''
        users_title = ['UserID', 'Gender', 'Age', 'JobID', 'Zip-code']

        users = pd.read_csv(self.UserDataPath, sep='::', header=None, names=users_title, engine='python')

        # change the 'gender'  expression from dataset
        gender_map = {'F':0, 'M':1}
        users['Gender'] = users['Gender'].map(gender_map)
        # change the 'Age' expression, {1: 0, 18: 1, 25: 2, 35: 3, 45: 4, 50: 5, 56: 6}
        age_map = {val:ii for ii,val in enumerate(sorted(set(list(users['Age']))))}
        users['Age'] = users['Age'].map(age_map)

        return users

處理前的資料：

(env) PS F:\movie_recommender> & f:/movie_recommender/env/Scripts/python.exe f:/movie_recommender/test.py
      UserID Gender  Age  JobID Zip-code
0          1      F    1     10    48067
1          2      M   56     16    70072
2          3      M   25     15    55117
3          4      M   45      7    02460
4          5      M   25     20    55455
...      ...    ...  ...    ...      ...
6035    6036      F   25     15    32603
6036    6037      F   45      1    76006
6037    6038      F   56      1    14706
6038    6039      F   45      0    01060
6039    6040      M   25      6    11106

[6040 rows x 5 columns]

處理後的資料：

(env) PS F:\movie_recommender> & f:/movie_recommender/env/Scripts/python.exe f:/movie_recommender/test.py
      UserID  Gender  Age  JobID Zip-code
0          1       0    0     10    48067
1          2       1    6     16    70072
2          3       1    2     15    55117
3          4       1    4      7    02460
4          5       1    2     20    55455
...      ...     ...  ...    ...      ...
6035    6036       0    2     15    32603
6036    6037       0    4      1    76006
6037    6038       0    6      1    14706
6038    6039       0    4      0    01060
6039    6040       1    2      6    11106

[6040 rows x 5 columns]

對電影資料進行預處理

Title欄位：處理方式跟Genres欄位一樣，首先建立文字到數字的字典，然後將Title中的描述轉成數字的列表。另外Title中的年份也需要去掉。
Genres欄位：是分類欄位，要轉成數字。首先將Genres中的類別轉成字串到數字的字典，然後再將每個電影的Genres欄位轉成數字列表，因為有些電影是多個Genres的組合。
Genres和Title欄位需要將長度統一，這樣在神經網路中方便處理。空白部分用‘< PAD >’對應的數字填充。

去除title中的年份：

# Remove the 'year' from Title
pattern = re.compile(r'^(.*)\((\d+)\)$')
title_map = {val:pattern.match(val).group(1) for ii,val in enumerate(set(movies['Title']))}
movies['Title'] = movies['Title'].map(title_map)

將電影種類轉化為數字字典：

genres_set = set()

for val in movies['Genres'].str.split('|'):# movies genres are separated via '|'
    genres_set.update(val)# store all genres

genres_set.add('<PAD>')
genres2int = {val:ii for ii,val in enumerate(genres_set)}

所有種類：

{'Horror', 'Action', 'Mystery', 'Film-Noir', 'Western', 'Drama', 'Crime', 'Fantasy', 'Musical', "Children's", 'Documentary', '<PAD>', 'Romance', 'Sci-Fi', 'Animation', 'Comedy', 'Adventure', 'Thriller', 'War'}

轉化為數字字典genres2int：

{'Adventure': 0, 'Documentary': 1, 'Film-Noir': 2, "Children's": 3, 'War': 4, 'Sci-Fi': 5, 'Romance': 6, 'Comedy': 7, '<PAD>': 8, 'Thriller': 9, 'Horror': 10, 'Western': 11, 'Drama': 12, 'Action': 13, 'Crime': 14, 'Animation': 15, 'Fantasy': 16, 'Mystery': 17, 'Musical': 18}

將電影種類轉化為一個固定長度18的lis，不夠的用<PAD>來補上，部分資料，由於我們生成的數字字典是動態的，再每一次執行的0~17都不一樣，比如這一次執行<PAD>表示為8，但是下一次可能是10，檢視movies資料集更改後的結果：

      MovieID                         Title                                             Genres
0           1                    Toy Story   [11, 17, 16, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
1           2                      Jumanji   [14, 17, 10, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
2           3             Grumpier Old Men   [16, 1, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
3           4            Waiting to Exhale   [16, 18, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
4           5  Father of the Bride Part II   [16, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
...       ...                           ...                                                ...
3878     3948             Meet the Parents   [16, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
3879     3949          Requiem for a Dream   [18, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
3880     3950                    Tigerland   [18, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
3881     3951             Two Family House   [18, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
3882     3952               Contender, The   [18, 15, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...

對於title的處理，因為電影名是幾個單詞幾個單片語成的，我們先用簡單的處理方式處理一下：

將所有標題中出現的單詞，組成過一個set：

, 'Jerry', 'Prince', 'Joe,', 'Richie', 'Mutters', 'Darkness,', 'Vanities', 'Bell,', 'Traffic', 'Jakob', 'Pride', 'Short', 'Romy', 'Got', 'Supper,', 'Wicked', 'Tender', 'del', 'European', 'Deuce', 'Criminels)', 'Crockett,', 'Monte', 'Tigerland', 'belle,', 'Fool', '蒫ole', 'Mall', 'Ozona', 'Purple,', 'Dudley', 'Gigolo', 'Duck', 'Troopers', 'Sleepwalkers', 'Wolves', 'Nuts', '(Zerkalo)', 'Dollhouse', 'Inheritors,', 'Tetsuo', 'Plutonium', 'Good,', 'Cujo',
'(Rekopis', 'Railroad', "It's", 'FairyTale:', 'Paulie', 'Couple,', 'Mighty', 'Just', 'Brothers,', 'Logo', 'Associate,', 'Babyfever', 'Cruise', "Can't",
'Cats', 'Sensibility', 'Half', 'Transit', 'Lives', 'Conformista)', 'Plan', 'Faith', 'Saboteur', 'Bowl,', 'pianista', 'May', 'Turkish', 'Hush', 'Thunderball', 'Alaska', 'Farm', 'Hedd', 'ni', "Fishin'", 'Witness', 'Ruby', 'Scent', '(Se7en)', 'Sunday', 'Passi', 'Swing', 'Assignment,', 'Bits', 'Time', 'Phantoms', 'Verdict,', 'Chapter,', "I'll", 'peor', 'Bloody', 'Mercury', 'Endurance', 'Face', 'Stonewall', 'Stock', 'John', 'Les', 'Place', 'Turtle', 'Prisoner,', 'Ice', 'Love!', 'Choice', 'Fletch', 'Romancing', 'Happy,', 'Gloria', 'Notorious', 'Me,', '鰃a)', 'Lifeboat', 'Peak', 'Ruling', 'Crush,', 'Portrait', 'Extremities', 'SS', 'Rage', 'Sorority', 'Tron', '(Karakter)', 'Ed', 'Africa,', 'Roberts', '(Jing', 'Mifune', 'Jimmy,', 'Consequences,', 'Gambler,'

和轉化文章分類的方法一樣，我們生成一個15長度的list ，不夠的用<PAD>來補上，部分資料，同樣：由於我們生成的數字字典是動態的，再每一次執行的0~17都不一樣，比如這一次執行<PAD>表示為8。

程式碼：

 # convert movies_title to a list of fixed-length, 15 in length
 title_count = 15
 title_map = {val:[title2int[row] for row in val.split()] for ii,val in enumerate(set(movies['Title']))}

for key in title_map:
for cnt in range(title_count - len(title_map[key])):
title_map[key].insert(len(title_map[key]) + cnt,title2int['<PAD>'])

movies['Title'] = movies['Title'].map(title_map)

return movies

執行效果：

(env) PS F:\movie_recommender> & f:/movie_recommender/env/Scripts/python.exe f:/movie_recommender/test.py
      MovieID                                              Title                                             Genres
0           1  [4581, 1639, 1792, 1792, 1792, 1792, 1792, 179...  [17, 18, 5, 13, 13, 13, 13, 13, 13, 13, 13, 13...
1           2  [3920, 1792, 1792, 1792, 1792, 1792, 1792, 179...  [16, 18, 8, 13, 13, 13, 13, 13, 13, 13, 13, 13...
2           3  [3661, 537, 1011, 1792, 1792, 1792, 1792, 1792...  [5, 11, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
3           4  [4026, 2962, 5003, 1792, 1792, 1792, 1792, 179...  [5, 1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,...
4           5  [3996, 2973, 4394, 4317, 3618, 2947, 1792, 179...  [5, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
...       ...                                                ...                                                ...
3878     3948  [1800, 4394, 5012, 1792, 1792, 1792, 1792, 179...  [5, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
3879     3949  [298, 4143, 3656, 5191, 1792, 1792, 1792, 1792...  [1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
3880     3950  [3178, 1792, 1792, 1792, 1792, 1792, 1792, 179...  [1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
3881     3951  [1513, 2135, 2105, 1792, 1792, 1792, 1792, 179...  [1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13...
3882     3952  [3063, 2235, 1792, 1792, 1792, 1792, 1792, 179...  [1, 2, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,...

[3883 rows x 3 columns]

本作品採用《CC 協議》，轉載必須註明作者和本文連結

文章！！首發於我的部落格Stray_Camel(＾Ｕ＾)ノ~ＹＯ。

spark處理json資料Demo
2019-02-26
SparkJSON
資料預處理
2021-09-09
資料分析--資料預處理
2023-12-14
資料預處理-資料清理
2020-01-19
資料預處理-資料歸約
2020-01-19
nlp 中文資料預處理
2019-12-02
TANet資料預處理流程
2020-10-07
資料預處理方法彙總
2020-03-16
資料預處理和特徵工程
2020-07-24
特徵工程
深度學習--資料預處理
2024-07-28
深度學習
資料預處理-資料整合與資料變換
2020-01-19
特徵工程之資料預處理（下）
2019-02-13
特徵工程
資料預處理之 pandas 讀表
2020-03-01
人工智慧 (01) 資料預處理
2019-12-18
人工智慧
深度學習——資料預處理篇
2019-02-18
深度學習
sklearn中常用資料預處理方法
2018-03-27
資料預處理利器 Amazon Glue DataBrew
2022-05-31
NUS-WIDE資料集預處理
2020-11-24
IDE
模型訓練：資料預處理和預載入
2020-10-27
模型
機器學習：探索資料和資料預處理
2020-12-13
機器學習
機器學習一：資料預處理
2019-02-27
機器學習
sklearn 第二篇：資料預處理
2019-07-30
flutter demo （三）：json處理
2018-08-09
FlutterJSON
Python資料處理（二）：處理 Excel 資料
2019-02-16
PythonExcel
資料預處理- 資料清理資料整合資料變換資料規約
2020-01-15
電影推薦系統資料預處理
2020-02-19
Sklearn之資料預處理——StandardScaler歸一化
2020-10-18
Struts2 action前的資料預處理
2020-04-05
資料清洗與預處理：使用 Python Pandas 庫
2024-07-26
Python
機器學習筆記---資料預處理
2022-04-30
機器學習筆記
資料處理
2024-07-18
文字資料預處理：sklearn 中 CountVectorizer、TfidfTransformer 和 TfidfVectorizer
2018-09-13
ORM
2.資料及其預處理
2022-06-19
海量資料處理
2018-03-08
python 處理資料
2020-10-29
Python
springmvc 資料處理
2020-10-06
SpringMVC
Panda資料處理
2024-07-15
Excel 資料處理
2022-07-21
Excel

資料預處理 demo