定量資料本質上是數值,應該是衡量某樣東西的數量。
定性資料本質上是類別,應該是描述某樣東西的性質。
全部的資料列如下,其中既有定性列也有定量列;
import pandas as pd
pd.options.display.max_columns = None
pd.set_option('expand_frame_repr', False)
salary_ranges = pd.read_csv('./data/Salary_Ranges_by_Job_Classification.csv')
print(salary_ranges.head())
# SetID JobCode Eff Date SalEndDate SalarySetID SalPlan Grade Step BiweeklyHighRate BiweeklyLowRate UnionCode ExtendedStep PayType
# 0 COMMN 109 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $0.00 $0.00 330 0 C
# 1 COMMN 110 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $15.00 $15.00 323 0 D
# 2 COMMN 111 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $25.00 $25.00 323 0 D
# 3 COMMN 112 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $50.00 $50.00 323 0 D
# 4 COMMN 114 07/01/2009 12:00:00 AM 06/30/2010 12:00:00 AM COMMN SFM 0 1 $100.00 $100.00 323 0 M
.info()可以瞭解資料的列資訊以及每列非null的行數;
print(salary_ranges.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 13 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 SetID 1356 non-null object
# 1 Job Code 1356 non-null object
# 2 Eff Date 1356 non-null object
# 3 Sal End Date 1356 non-null object
# 4 Salary SetID 1356 non-null object
# 5 Sal Plan 1356 non-null object
# 6 Grade 1356 non-null object
# 7 Step 1356 non-null int64
# 8 Biweekly High Rate 1356 non-null object
# 9 Biweekly Low Rate 1356 non-null object
# 10 Union Code 1356 non-null int64
# 11 Extended Step 1356 non-null int64
# 12 Pay Type 1356 non-null object
# dtypes: int64(3), object(10)
# memory usage: 137.8+ KB
# None
也可以使用以下方法更快速的計算缺失值的資訊;
print(salary_ranges.isnull().sum())
# SetID 0
# Job Code 0
# Eff Date 0
# Sal End Date 0
# Salary SetID 0
# Sal Plan 0
# Grade 0
# Step 0
# Biweekly High Rate 0
# Biweekly Low Rate 0
# Union Code 0
# Extended Step 0
# Pay Type 0
# dtype: int64
describe方法檢視定量資料的描述性統計;Pandas認為,資料只有3個定量列:Step、Union Code和Extended Step(步進、工會程式碼和增強步進)。先不說步進和增強步進,很明顯工會程式碼不是定量的。雖然這一列是數,但這些數不代表數量,只代表某個工會的程式碼
print( salary_ranges.describe())
# Step Union Code Extended Step
# count 1356.000000 1356.000000 1356.000000
# mean 1.294985 392.676991 0.150442
# std 1.045816 338.100562 1.006734
# min 1.000000 1.000000 0.000000
# 25% 1.000000 21.000000 0.000000
# 50% 1.000000 351.000000 0.000000
# 75% 1.000000 790.000000 0.000000
# max 5.000000 990.000000 11.000000
最值得注意的特徵是一個定量列Biweekly High Rate(雙週最高工資)和一個定性列Grade(工作種類);
salary_ranges = salary_ranges[['BiweeklyHighRate', 'Grade']]
print(salary_ranges.head())
# BiweeklyHighRate Grade
# 0 $0.00 0
# 1 $15.00 0
# 2 $25.00 0
# 3 $50.00 0
# 4 $100.00 0
檢視兩個欄位的型別;
print(salary_ranges.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 BiweeklyHighRate 1356 non-null object
# 1 Grade 1356 non-null object
# dtypes: object(2)
# memory usage: 21.3+ KB
# None
我們清理一下資料,移除工資前面的美元符號,保證資料型別正確。當處理定量資料時,一般使用整數或浮點數作為型別(最好使用浮點數);定性資料則一般使用字串或Unicode物件。
salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].map(lambda value:value.replace('$',''))
print(salary_ranges.head())
# BiweeklyHighRate Grade
# 0 0.00 0
# 1 15.00 0
# 2 25.00 0
# 3 50.00 0
# 4 100.00 0
資料型別並沒有變
print(salary_ranges.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 BiweeklyHighRate 1356 non-null object
# 1 Grade 1356 non-null object
# dtypes: object(2)
# memory usage: 21.3+ KB
# None
將BiweeklyHighRate和Grade列中的資料分別轉換為浮點數、字串;
salary_ranges['BiweeklyHighRate'] = salary_ranges['BiweeklyHighRate'].astype(float)
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
print(salary_ranges.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1356 entries, 0 to 1355
# Data columns (total 2 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 BiweeklyHighRate 1356 non-null float64
# 1 Grade 1356 non-null object
# dtypes: float64(1), object(1)
# memory usage: 21.3+ KB
# None