Udacity Deep Learning課程作業(一)

天才XLM發表於2017-08-17

Udacity的深度學習是Google開設的一門基於TensorFlow完成任務的線上課程,課程短小精悍,包括4章(入門ML/DL,DNN,CNN,RNN)、6個小作業(以ipynb的形式,十分方便友好)和1個大作業(開發實時攝像頭應用)。

有ML/DL基礎的同學,視訊很快可以過完,因此課程精華在於其實戰專案,很有意思。作為G家的課程,算是TensorFlow比較權威的學習tutorial了。

課程連結:這裡
作業連結:這裡

以下是本人的課程作業(一)的程式碼

Problem 1

IPython.display來視覺化一些樣本資料:

from IPython.display import display, Image
def visualize(folders):
    for folder_path in folders:
        fnames = os.listdir(folder_path)
        random_index = np.random.randint(len(fnames))
        fname = fnames[random_index]
        display(Image(filename=os.path.join(folder_path, fname)))

print("train_folders")
visualize(train_folders)
print("test_folders")
visualize(test_folders)

Problem 2

使用matplotlib.pyplot視覺化樣本:

def visualize_datasets(datasets):
    for dataset in datasets:
        with open(dataset, 'rb') as f:
            letter = pickle.load(f)
            sample_idx = np.random.randint(len(letter))
            sample_image = letter[sample_idx, :, :]
            fig = plt.figure()
            plt.imshow(sample_image)
        break

visualize_datasets(train_datasets)
visualize_datasets(test_datasets)

Problem 3

檢查樣本是否平衡(不同樣本的數量差不多):

def check_dataset_is_balanced(datasets, notation=None):
    print(notation)
    for label in datasets:
        with open(label, 'rb') as f:
            ds = pickle.load(f)
            print("label {} has {} samples".format(label, len(ds)))

check_dataset_is_balanced(train_datasets, "training set")
check_dataset_is_balanced(test_datasets, "test set")

Problem 5

統計訓練集、測試集和驗證集出現重複的樣本:

import hashlib

def count_duplicates(dataset1, dataset2):
    hashes = [hashlib.sha1(x).hexdigest() for x in dataset1]
    dup_indices = []
    for i in range(0, len(dataset2)):
        if hashlib.sha1(dataset2[i]).hexdigest() in hashes:
            dup_indices.append(i)
    return len(dup_indices)

data = pickle.load(open('notMNIST.pickle', 'rb'))
print(count_duplicates(data['test_dataset'], data['valid_dataset']))
print(count_duplicates(data['valid_dataset'], data['train_dataset']))
print(count_duplicates(data['test_dataset'], data['train_dataset']))

Problem 6

使用50、100、1000和5000個和全部訓練樣本來訓練一個off-the-shelf模型,可以藉助sklearn.linear_model中的Logistic Regression方法。

def train_and_predict(X_train, y_train, X_test, y_test):
    lr = LogisticRegression()

    X_train = X_train.reshape(X_train.shape[0], 28*28)
    lr.fit(X_train, y_train)

    X_test = X_test.reshape(X_test.shape[0], 28*28)
    print(lr.score(X_test, y_test))

def main():
    X_train = data["train_dataset"]
    y_train = data["train_labels"]

    X_test = data["test_dataset"]
    y_test = data["test_labels"]
    for size in [50, 100, 1000, 5000, None]:
        train_and_predict(X_train[:size], y_train[:size], X_test[:size], y_test[:size])

main()

相關文章