人工智慧領域經典資料集

EulerAndGauss發表於2018-06-12

原文網址 : https://blog.csdn.net/yang9649/article/details/80668694

人工智慧領域經典資料集,轉載自 medium.com 由於原網頁在國內無法開啟，轉錄至此。
原網頁地址 https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2
在手機上有些連結可能會出現異常，建議在電腦端開啟本頁面下載資料。

Computer Vision

MNIST : most commonly used sanity check. Dataset of 25x25, centered, B ? W handwritten digits. It is an easy task - just because something works on MNIST, does not mean it works.

CIFAR 10 ? CIFAR 100 : 32x32 color images. Not commonly used anymore, though once again, can be an interesting sanity check.

Image API : the de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category WordNet hierarchy from ImageNet.

LSUN : Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.

PASCAL VOC : Generic image Segmentation / classification - not terribly useful for building real-world image annotation, but great for baselines.

SVHN : House numbers from Google Street View. Think of this as recurrent MNIST in the wild.

MS COCO : Generic image understanding / captioning, with an associated competition.

Visual Genome : Very detailed visual knowledge base with deep captioning of ~ 100K images.

Labeled Faces in the Wild : Cropped faces (using Viola-Jones ) that have been labeled with a name identifier. A subset of the people present have two images in the dataset - it's quite common for people to train matching Systems here

Natural Language

Text Classification Datasets (Google Drive Link) from zh et al., 2015 : An extensive set of eight datasets for text classification. These are the most printed baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging From binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo !, Sogou, and AG.

WikiText : large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind .

Question Pairs : first dataset release from Quora containing duplicate / semantic similarity labels.

SQuAD : The Stanford Question Answering Dataset - broadly useful methods answering and reading comprehension dataset, where every answer to a question is posed as a span , or segment of text.

CMU Q / A Dataset : Manually-generated factoid question / answer pairs with difficulty ratings from Wikipedia articles.

Maluuba Datasets : Sophisticated, human-generated datasets for stateful natural language understanding research.

Billion Words : large, general purpose modeling modeling dataset. Often used to train distributed word representations such as word2vec or GloVe .

Common Crawl : Petabyte -scale crawl of the web - most frequently used for learning word embeddings. Available for free from Amazon S3 . Can also useful as a network dataset for it's crawl of the WWW.

bAbi : synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR) .

The Children's Book Test ( download link ): Baseline of (Question + context, Answer) pairs extracted from Children's books available through Project Gutenberg. Useful for question-answering, reading comprehension, and factoid look-up

Stanford Sentiment Treebank : standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence's parse tree.

20 Newsgroups : one of the classic datasets for text classification, usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.

Reuters : older, purely classification based dataset with text from the newswire. Commonly used in tutorials.

IMDB : an older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.

UCI's Spambase : Older, classic spam email dataset from the famous UCI Machine Learning Repository . Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized flight filtering.

Speech

Most speech recognition collectors are proprietary - the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.

2000 HUB5 English : English-only speech data used most recently in the Deep Speech paper from Baidu.

LibriSpeech : Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by clubs of the book containing both the text and the speech

VoxForge : Clean speech dataset of accented english, useful for instances in which you expect to need robustness to different accents or intonations

TIMIT : English-only speech recognition dataset.

CHIME : Noisy speech recognition challenge dataset. Dataset contains real, simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non- Noisy recordings.

TED-LIUM : Audio transcription of TED talks. 1495 TED meetings audio recordings along with full text transcriptions of those recordings.

Recommendation and ranking systems

Netflix Challenge : first major Kaggle style data challenge. Only available unofficially, as privacy issues arose .

MovieLens : various sizes of movie review data - commonly used for collaborative filtering baselines.

Million Song Dataset : large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendations systems.

Last.fm : music recommended dataset with access to underlying social network and other metadata that can be useful for hybrid systems.

Networks and Graphs

Amazon Co-Purchasing and Amazon Reviews : crawled data from the " users who bought this also bought ... " section of Amazon, as well as amazon review data for related products. Good for experimenting with recommended systems in networks

Friendster Social Network Dataset : Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Geospatial data

OpenStreetMap : Vector data for the entire planet under a free license . It includes (an older version of) the US Census Bureau's TIGER data.

Landsat8 : Satellite shots of the entire Earth surface, updated every several weeks.

NEXRAD : Doppler radar scans of atmospheric conditions in the US

人工智慧研究：經典推理和非經典推理
2019-01-08
人工智慧
Tensorflow 2.0筆記 -第五章-經典資料集使用
2020-10-16
筆記
人工智慧在辦公領域的應用及API資料返回
2023-05-05
人工智慧API
資料探勘領域十大經典演算法之—樸素貝葉斯演算法（附程式碼）
2018-09-18
演算法
【神經網路篇】--基於資料集cifa10的經典模型例項
2018-03-30
神經網路模型
微軟人工智慧和資料科學25個經典面試問題！
2018-09-28
微軟人工智慧資料科學面試
2019 人工智慧領域AI相關比賽資訊
2019-03-27
人工智慧AI
可口可樂在人工智慧和大資料領域的7項應用
2018-04-04
人工智慧大資料
大資料經典學習路線
2019-01-23
大資料
經典資料分析應用介紹
2023-12-11
從入門到研究，人工智慧領域最值得一讀的10本資料
2018-03-22
人工智慧
大資料—人工智慧領域高水平會議及熱點技術彙總
2022-11-10
大資料人工智慧
【轉】numpy：python資料領域的功臣
2021-01-05
Python
萬字長文，詳解推薦系統領域經典模型FM因子分解機
2020-11-13
模型
收藏|精選11篇AI領域論文（附程式碼、資料集連結）
2018-04-02
AI
什麼是人工智慧領域的 GAN
2023-05-16
人工智慧
人工智慧的應用領域有哪些？
2020-10-28
人工智慧
並查集經典應用場景
2019-03-10
並查集
人工智慧開發者：如何選擇人工智慧領域？
2020-07-05
人工智慧
大資料經典學習路線，必看！！
2019-04-11
大資料
人工智慧資料標註領域五大核心需求痛點丨曼孚科技
2020-06-16
人工智慧
小米隨星借依託在人工智慧、大資料等領域的突出優勢
2022-01-21
人工智慧大資料
禮來公司投資 4.09 億美元，進軍人工智慧領域
2024-09-06
人工智慧
脈脈釋出2021春招人才流動大資料: 人工智慧成人才活躍度最高新經濟領域遊戲、新生活服務、智慧硬體等新經濟領域最好找工作
2021-03-25
大資料人工智慧遊戲
深度學習領域的資料增強
2020-03-13
深度學習
什麼是人工智慧領域的 Foundation Model？
2023-05-12
人工智慧
專家系統：人工智慧領域的重要分支
2024-10-18
人工智慧
資料探勘十大經典演算法
2020-11-03
演算法
在這個連開源標註資料集都沒有的領域，AI該如何落地？
2018-03-29
AI
10、一篇經典的域滲透文章
2018-06-11
資料庫領域尺有所短寸有所長
2024-01-16
資料庫
ClickHouse在大資料領域應用實踐
2022-02-25
大資料
資料庫領域3月大事件
2023-03-27
資料庫事件
《Nature》：2030年，中國能在人工智慧領域領先於世界嗎？
2019-08-26
人工智慧
5大資料經典模型詳解——資料分析師必須掌握
2021-03-16
大資料模型
什麼是人工智慧領域的深度學習？
2023-05-16
人工智慧深度學習
人工智慧在材料領域的應用有哪些？
2023-09-25
人工智慧
Flink 在人工智慧領域的應用實踐
2019-11-28
人工智慧