機器學習基礎04DAY

ThankCAT發表於2023-03-25

scikit-learn資料集

我們將介紹sklearn中的資料集類,模組包括用於載入資料集的實用程式,包括載入和獲取流行參考資料集的方法。它還具有一些人工資料生成器。

sklearn.datasets

(1)datasets.load_*()

獲取小規模資料集,資料包含在datasets裡

(2)datasets.fetch_*()

獲取大規模資料集,需要從網路上下載,函式的第一個引數是data_home,表示資料集下載的目錄,預設是 ~/scikit_learn_data/,要修改預設目錄,可以修改環境變數SCIKIT_LEARN_DATA

(3)datasets.make_*()

本地生成資料集

load*和 fetch* 函式返回的資料型別是 datasets.base.Bunch,本質上是一個 dict,它的鍵值對可用透過物件的屬性方式訪問。主要包含以下屬性:

  • data:特徵資料陣列,是 n_samples * n_features 的二維 numpy.ndarray 陣列
  • target:標籤陣列,是 n_samples 的一維 numpy.ndarray 陣列
  • DESCR:資料描述
  • feature_names:特徵名
  • target_names:標籤名

資料集目錄可以透過datasets.get_data_home()獲取,clear_data_home(data_home=None)刪除所有下載資料

  • datasets.get_data_home(data_home=None)

返回scikit學習資料目錄的路徑。這個資料夾被一些大的資料集裝載器使用,以避免下載資料。預設情況下,資料目錄設定為使用者主資料夾中名為“scikit_learn_data”的資料夾。或者,可以透過“SCIKIT_LEARN_DATA”環境變數或透過給出顯式的資料夾路徑以程式設計方式設定它。'〜'符號擴充套件到使用者主資料夾。如果資料夾不存在,則會自動建立。

  • sklearn.datasets.clear_data_home(data_home=None)

刪除儲存目錄中的資料

獲取小資料集

用於分類

  • sklearn.datasets.load_iris
class sklearn.datasets.load_iris(return_X_y=False)
  """
  載入並返回虹膜資料集

  :param return_X_y: 如果為True,則返回而不是Bunch物件,預設為False

  :return: Bunch物件,如果return_X_y為True,那麼返回tuple,(data,target)
  """
In [12]: from sklearn.datasets import load_iris
    ...: data = load_iris()
    ...:

In [13]: data.target
Out[13]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [14]: data.feature_names
Out[14]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [15]: data.target_names
Out[15]:
array(['setosa', 'versicolor', 'virginica'],
      dtype='|S10')

In [17]: data.target[[1,10, 100]]
Out[17]: array([0, 0, 2])
名稱 數量
類別 3
特徵 4
樣本數量 150
每個類別數量 50
  • sklearn.datasets.load_digits
class sklearn.datasets.load_digits(n_class=10, return_X_y=False)
    """
    載入並返回數字資料集

    :param n_class: 整數,介於0和10之間,可選(預設= 10,要返回的類的數量

    :param return_X_y: 如果為True,則返回而不是Bunch物件,預設為False

    :return: Bunch物件,如果return_X_y為True,那麼返回tuple,(data,target)
    """
In [20]: from sklearn.datasets import load_digits

In [21]: digits = load_digits()

In [22]: print(digits.data.shape)
(1797, 64)

In [23]: digits.target
Out[23]: array([0, 1, 2, ..., 8, 9, 8])

In [24]: digits.target_names
Out[24]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [25]: digits.images
Out[25]:
array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,  15.,   5.,   0.],
        [  0.,   3.,  15., ...,  11.,   8.,   0.],
        ...,
        [  0.,   4.,  11., ...,  12.,   7.,   0.],
        [  0.,   2.,  14., ...,  12.,   0.,   0.],
        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

        [[  0.,   0.,  10., ...,   1.,   0.,   0.],
        [  0.,   2.,  16., ...,   1.,   0.,   0.],
        [  0.,   0.,  15., ...,  15.,   0.,   0.],
        ...,
        [  0.,   4.,  16., ...,  16.,   6.,   0.],
        [  0.,   8.,  16., ...,  16.,   8.,   0.],
        [  0.,   1.,   8., ...,  12.,   1.,   0.]]])
名稱 數量
類別 10
特徵 64
樣本數量 1797

用於迴歸

  • sklearn.datasets.load_boston
class  sklearn.datasets.load_boston(return_X_y=False)
  """
  載入並返回波士頓房價資料集

  :param return_X_y: 如果為True,則返回而不是Bunch物件,預設為False

  :return: Bunch物件,如果return_X_y為True,那麼返回tuple,(data,target)
  """
In [34]: from sklearn.datasets import load_boston

In [35]: boston = load_boston()

In [36]: boston.data.shape
Out[36]: (506, 13)

In [37]: boston.feature_names
Out[37]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='|S7')

In [38]:
名稱 數量
目標類別 5-50
特徵 13
樣本數量 506
  • sklearn.datasets.load_diabetes
class sklearn.datasets.load_diabetes(return_X_y=False)
  """
  載入和返回糖尿病資料集

  :param return_X_y: 如果為True,則返回而不是Bunch物件,預設為False

  :return: Bunch物件,如果return_X_y為True,那麼返回tuple,(data,target)
  """
In [13]:  from sklearn.datasets import load_diabetes

In [14]: diabetes = load_diabetes()

In [15]: diabetes.data
Out[15]:
array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])
名稱 數量
目標範圍 25-346
特徵 10
樣本數量 442

獲取大資料集

  • sklearn.datasets.fetch_20newsgroups
class sklearn.datasets.fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)
  """
  載入20個新聞組資料集中的檔名和資料

  :param subset: 'train'或者'test','all',可選,選擇要載入的資料集:訓練集的“訓練”,測試集的“測試”,兩者的“全部”,具有洗牌順序


  :param data_home: 可選,預設值:無,指定資料集的下載和快取資料夾。如果沒有,所有scikit學習資料都儲存在'〜/ scikit_learn_data'子資料夾中

  :param categories: 無或字串或Unicode的集合,如果沒有(預設),載入所有類別。如果不是無,要載入的類別名稱列表(忽略其他類別)

  :param shuffle: 是否對資料進行洗牌

  :param random_state: numpy隨機數生成器或種子整數

  :param download_if_missing: 可選,預設為True,如果False,如果資料不在本地可用而不是嘗試從源站點下載資料,則引發IOError

  :param remove: 元組
  """
In [29]: from sklearn.datasets import fetch_20newsgroups

In [30]: data_test = fetch_20newsgroups(subset='test',shuffle=True, random_sta
    ...: te=42)

In [31]: data_train = fetch_20newsgroups(subset='train',shuffle=True, random_s
    ...: tate=42)
  • sklearn.datasets.fetch_20newsgroups_vectorized
class sklearn.datasets.fetch_20newsgroups_vectorized(subset='train', remove=(), data_home=None)
  """
  載入20個新聞組資料集並將其轉換為tf-idf向量,這是一個方便的功能; 使用sklearn.feature_extraction.text.Vectorizer的預設設定完成tf-idf 轉換。對於更高階的使用(停止詞過濾,n-gram提取等),將fetch_20newsgroup與自定義Vectorizer或CountVectorizer組合在一起

  :param subset: 'train'或者'test','all',可選,選擇要載入的資料集:訓練集的“訓練”,測試集的“測試”,兩者的“全部”,具有洗牌順序

  :param data_home: 可選,預設值:無,指定資料集的下載和快取資料夾。如果沒有,所有scikit學習資料都儲存在'〜/ scikit_learn_data'子資料夾中

  :param remove: 元組
  """
In [57]: from sklearn.datasets import fetch_20newsgroups_vectorized

In [58]: bunch = fetch_20newsgroups_vectorized(subset='all')

In [59]: from sklearn.utils import shuffle

In [60]: X, y = shuffle(bunch.data, bunch.target)
    ...: offset = int(X.shape[0] * 0.8)
    ...: X_train, y_train = X[:offset], y[:offset]
    ...: X_test, y_test = X[offset:], y[offset:]
    ...:

獲取本地生成資料

生成本地分類資料:

  • sklearn.datasets.make_classification

    class make_classification(n_samples=100, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    """
    生成用於分類的資料集
    
    :param n_samples:int,optional(default = 100),樣本數量
    
    :param n_features:int,可選(預設= 20),特徵總數
    
    :param n_classes:int,可選(default = 2),類(或標籤)的分類問題的數量
    
    :param random_state:int,RandomState例項或無,可選(預設=無)
      如果int,random_state是隨機數生成器使用的種子; 如果RandomState的例項,random_state是隨機數生成器; 如果沒有,隨機數生成器所使用的RandomState例項np.random
    
    :return :X,特徵資料集;y,目標分類值
    """
    
from sklearn.datasets.samples_generator import make_classification
X,y= datasets.make_classification(n_samples=100000, n_features=20,n_informative=2, n_redundant=10,random_state=42)

生成本地迴歸資料:

  • sklearn.datasets.make_regression
class make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
  """
  生成用於迴歸的資料集

  :param n_samples:int,optional(default = 100),樣本數量

  :param  n_features:int,optional(default = 100),特徵數量

  :param  coef:boolean,optional(default = False),如果為True,則返回底層線性模型的係數

  :param random_state:int,RandomState例項或無,可選(預設=無)
    如果int,random_state是隨機數生成器使用的種子; 如果RandomState的例項,random_state是隨機數生成器; 如果沒有,隨機數生成器所使用的RandomState例項np.random

  :return :X,特徵資料集;y,目標值
  """
from sklearn.datasets.samples_generator import make_regression
X, y = make_regression(n_samples=200, n_features=5000, random_state=42)

資料的分類與劃分

資料集返回的型別

  • load和fetch返回的資料型別datasets.base.Bunch(字典格式)
  • data: 特徵資料陣列,是[n_samples * n_features]的二維numpy.ndarray陣列
  • target:標籤陣列,是n_samples的一維numpy.ndarray陣列
  • DESCR:資料描述
  • feature_names: 特徵名,新聞資料,手寫數字、迴歸資料集沒有
  • target_names:標籤名,迴歸資料集沒有

分類資料集

鳶尾花資料集

In [ ]:

from sklearn.datasets import load_iris

# 例項化資料集
iris = load_iris()

#檢視特徵值
iris.data

Out[ ]:

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.2],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.6, 1.4, 0.1],
       [4.4, 3. , 1.3, 0.2],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3. , 4.5, 1.5],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.3, 4.4, 1.3],
       [5.6, 3. , 4.1, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 2.6, 4.4, 1.2],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.6, 4. , 1.2],
       [5. , 2.3, 3.3, 1. ],
       [5.6, 2.7, 4.2, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [5.7, 2.9, 4.2, 1.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3],
       [6.3, 3.3, 6. , 2.5],
       [5.8, 2.7, 5.1, 1.9],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [6.5, 3. , 5.8, 2.2],
       [7.6, 3. , 6.6, 2.1],
       [4.9, 2.5, 4.5, 1.7],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 2.5, 5.8, 1.8],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3.2, 5.1, 2. ],
       [6.4, 2.7, 5.3, 1.9],
       [6.8, 3. , 5.5, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [5.8, 2.8, 5.1, 2.4],
       [6.4, 3.2, 5.3, 2.3],
       [6.5, 3. , 5.5, 1.8],
       [7.7, 3.8, 6.7, 2.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.2, 5. , 1.5],
       [6.9, 3.2, 5.7, 2.3],
       [5.6, 2.8, 4.9, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3.2, 6. , 1.8],
       [6.2, 2.8, 4.8, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [6.3, 2.8, 5.1, 1.5],
       [6.1, 2.6, 5.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.4, 3.1, 5.5, 1.8],
       [6. , 3. , 4.8, 1.8],
       [6.9, 3.1, 5.4, 2.1],
       [6.7, 3.1, 5.6, 2.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [6.8, 3.2, 5.9, 2.3],
       [6.7, 3.3, 5.7, 2.5],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])

In [ ]:

#檢視資料形狀
iris.data.shape
# 4個特徵,150個樣本

Out[ ]:

(150, 4)

In [ ]:

# 檢視標籤陣列
iris.target

Out[ ]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [ ]:

# 檢視資料的描述
print(iris.DESCR)
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

In [ ]:

# 檢視特徵名
iris.feature_names

Out[ ]:

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [ ]:

# 檢視目標名
iris.target_names

Out[ ]:

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

迴歸資料集

糖尿病資料集

In [ ]:

from sklearn.datasets import load_diabetes

#例項化
diabetes = load_diabetes()
print(diabetes.DESCR)
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [ ]:

diabetes.data

Out[ ]:

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]])

In [ ]:

diabetes.target

Out[ ]:

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 288.,  88., 292.,  71.,
       197., 186.,  25.,  84.,  96., 195.,  53., 217., 172., 131., 214.,
        59.,  70., 220., 268., 152.,  47.,  74., 295., 101., 151., 127.,
       237., 225.,  81., 151., 107.,  64., 138., 185., 265., 101., 137.,
       143., 141.,  79., 292., 178.,  91., 116.,  86., 122.,  72., 129.,
       142.,  90., 158.,  39., 196., 222., 277.,  99., 196., 202., 155.,
        77., 191.,  70.,  73.,  49.,  65., 263., 248., 296., 214., 185.,
        78.,  93., 252., 150.,  77., 208.,  77., 108., 160.,  53., 220.,
       154., 259.,  90., 246., 124.,  67.,  72., 257., 262., 275., 177.,
        71.,  47., 187., 125.,  78.,  51., 258., 215., 303., 243.,  91.,
       150., 310., 153., 346.,  63.,  89.,  50.,  39., 103., 308., 116.,
       145.,  74.,  45., 115., 264.,  87., 202., 127., 182., 241.,  66.,
        94., 283.,  64., 102., 200., 265.,  94., 230., 181., 156., 233.,
        60., 219.,  80.,  68., 332., 248.,  84., 200.,  55.,  85.,  89.,
        31., 129.,  83., 275.,  65., 198., 236., 253., 124.,  44., 172.,
       114., 142., 109., 180., 144., 163., 147.,  97., 220., 190., 109.,
       191., 122., 230., 242., 248., 249., 192., 131., 237.,  78., 135.,
       244., 199., 270., 164.,  72.,  96., 306.,  91., 214.,  95., 216.,
       263., 178., 113., 200., 139., 139.,  88., 148.,  88., 243.,  71.,
        77., 109., 272.,  60.,  54., 221.,  90., 311., 281., 182., 321.,
        58., 262., 206., 233., 242., 123., 167.,  63., 197.,  71., 168.,
       140., 217., 121., 235., 245.,  40.,  52., 104., 132.,  88.,  69.,
       219.,  72., 201., 110.,  51., 277.,  63., 118.,  69., 273., 258.,
        43., 198., 242., 232., 175.,  93., 168., 275., 293., 281.,  72.,
       140., 189., 181., 209., 136., 261., 113., 131., 174., 257.,  55.,
        84.,  42., 146., 212., 233.,  91., 111., 152., 120.,  67., 310.,
        94., 183.,  66., 173.,  72.,  49.,  64.,  48., 178., 104., 132.,
       220.,  57.])

In [ ]:

print("糖尿病資料集的特徵名:{0}\n糖尿病資料集的目標值檔名:{1}".format(diabetes.feature_names, diabetes.target_filename))
糖尿病資料集的特徵名:['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
糖尿病資料集的目標值檔名:diabetes_target.csv.gz

資料集分割(訓練集與測試集)

In [ ]:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25)

Out[ ]:

(111,)

相關文章