Source | Links | Data Stored | Free/Paid |
---|---|---|---|
100,000 Faces | https://generated.photos/ | 100,000 Faces Generated by AI. We have built an original machine learning dataset, and used StyleGAN (an amazing resource by NVIDIA) to construct a realistic set of 100,000 faces. Our dataset has been built by taking 29,000+ photos of 69 different models over the last 2 years in our studio.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
3D60 | https://vcl3d.github.io/3D60/ | 3D60 is a collective dataset generated in the context of various 360 vision research works. It comprises multi-modal (i.e. color, depth and normal) omnidirectional stereo renders (i.e. horizontal and vertical) of scenes from realistic and synthetic large-scale 3D datasets (Matterport3D, Stanford2D3D, SunCG). Contains 224,406 spherical panoramas.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
3DPeople Dataset | https://cv.iri.upc-csic.es/ | First dataset for computer vision research of dressed humans with specific geometry representation for the clothes. It contains ~2 Million images with 40 male/40 female performing 70 actions.;Attribution-NonCommercial 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes. | Free |
A dataset of English plaintext jokes | https://github.com/taivop/joke-dataset | There are about 208,000 jokes in this database scraped from three sources (reddit, stupidstuff.org, wocka.com).;Parts of the dataset could be under different licenses, check the dataset web page for more information | Free |
A*3D | https://github.com/I2RDL2/ASTAR-3D | A*3D dataset is a step forward to make autonomous driving safer for pedestrians and the public in the real world. 230K human-labeled 3D object annotations in 39,179 LiDAR point cloud frames and corresponding frontal-facing RGB images. Captured at different times (day, night) and weathers (sun, cloud, rain).;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Academic Torrents | http://academictorrents.com/ | Mostly used by researchers these are datasets relating to topics like medical, security, biology and several others. | Free |
ACTIVITYNET | http://activity-net.org/ | ActivityNet is a new large-scale video benchmark for human activity understanding. ActivityNet aims at covering a wide range of complex human activities that are of interest to people in their daily living. In its current version, ActivityNet provides samples from 203 activity classes with an average of 137 untrimmed videos per class and 1.41 activity instances per video, for a total of 849 video hours.;License information not found | Free |
ActivityNet-QA | https://github.com/MILVLG/activitynet-qa | The ActivityNet-QA dataset contains 58,000 human-annotated QA pairs on 5,800 videos derived from the popular ActivityNet dataset. The dataset provides a benckmark for testing the performance of VideoQA models on long-term spatio-temporal reasoning.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
ADE20K | http://groups.csail.mit.edu/vision/datasets/ADE20K/ | A dataset for scene parsing. There are 20,210 images in the training set, 2,000 images in the validation set, and 3,000 images in the testing set. All the images are exhaustively annotated with objects. Many objects are also annotated with their parts. For each object there is additional information about whether it is occluded or cropped, and other attributes.;License information not found | Free |
Agriculture-Vision | https://www.agriculture-vision.com/dataset | Agriculture-Vision: a large-scale aerial farmland image dataset for semantic segmentation of agricultural patterns. We collected 94, 986 high-quality aerial images from 3, 432 farmlands across the US, where each image consists of RGB and Near-infrared (NIR) channels with resolution as high as 10 cm per pixel.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
AmbigQA | https://nlp.cs.washington.edu/ambigqa/ | AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open.;Attribution-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
ApolloScape | http://apolloscape.auto/ | ApolloScape is an order of magnitude bigger and more complex than existing similar datasets such as Kitti and CityScapes. ApolloScape offers 10 times more high-resolution images with pixel-by-pixel annotations, and includes 26 different recognizable objects such as cars, bicycles, pedestrians and buildings. The dataset offers several levels of scene complexity with increasing number of pedestrians and vehicles, up to 100 vehicles in a given scene, as well as a wider set of challenging environments such as heavy weather or extreme lighting conditions.;Non-commercial and commercial licenses available | Free |
Argoverse | https://www.argoverse.org/ | Argoverse is a research collection with three distinct types of data. The first is a dataset with sensor data from 113 scenes observed by our fleet, with 3D tracking annotations on all objects. The second is a dataset of 300,000-plus scenarios observed by our fleet, wherein each scenario contains motion trajectories of all observed objects. The third is a set of HD maps of several neighborhoods in Pittsburgh and Miami, to add rich context for all of the data mentioned above.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Robust Reading | https://rrc.cvc.uab.es/ | "Robust Reading" refers to the research area dealing with the interpretation of written communication in unconstrained settings. | Free |
AssetMacro | http://www.assetmacro.com/ | AssetMacro is a data provider for 35,000+ stocks, bonds, commodities, credit default swaps and currencies of 10 market exchanges. | Free/Paid |
Astyx HiRes2019 | https://www.astyx.com/development/astyx-hires2019-dataset.html | A radar-centric automotive datasetbased on radar, lidar and camera data for the purposeof 3D object detection.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
AU-AIR | https://bozcani.github.io/auairdataset | AU-AIR dataset is the first multi-modal UAV dataset for object detection. It meets vision and robotics for UAVs having the multi-modal data from different on-board sensors, and pushes forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. >2 hours raw videos, 32,823 labelled frames,132,034 object instances.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Audi A2D2 | https://www.a2d2.audi/a2d2/en.html | The dataset features 2D semantic segmentation, 3D point clouds, 3D bounding boxes, and vehicle bus data. Dataset includes more than 40,000 frames with semantic segmentation image and point cloud labels, of which more than 12,000 frames also have annotations for 3D bounding boxes. In addition, we provide unlabelled sensor data (approx. 390,000 frames) for sequences with several loops, recorded in three cities. A2D2 is around 2.3 TB in total.;Attribution No Derivatives 4.0 International (CC BY ND 4.0) - You are Free to: Share - copy and redistribute, Under the following terms: Attribution - you must give approprate credit., NoDerivatives - you may not redistribute the modified material. | Free |
AVID | https://github.com/piergiaj/AViD | AViD is a large-scale video dataset with 467k videos and 887 action classes. The collected videos have a creative-commons license.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
AVSpeech | https://looking-to-listen.github.io/avspeech/ | AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering backgruond noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Awesome Public Datasets | https://github.com/caesar0301/awesome-public-datasets | Categorized list of large datasets (available for public use) | Free |
AWS Public Dataset Program | https://aws.amazon.com/opendata/public-datasets/ | Need to set up an account. Includes Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Project, and Space Telescope Science Institute. | Free/Paid |
Baidu DuReader 2.0 | https://ai.baidu.com/broad/subordinate?dataset=dureader | DuReader 2.0 is a large-scale open-domain Chinese dataset for Machine Reading Comprehension (MRC) and Question Answering (QA). It contains more than 300K questions, 1.4M evident documents and corresponding human generated answers.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Baidu Large-scale Street View Text with Partial Labeling (LSVT) | https://ai.baidu.com/broad/subordinate?dataset=lsvt | A new large-scale scene text dataset, namely Large-scale Street View Text with Partial Labeling (LSVT), with 30,000 training data and 20,000 testing images in full annotations, and 400,000 training data in weak annotations, which are referred to as partial labels.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Berkeley Deep Drive (BDD100K) | http://bdd-data.berkeley.edu/ | The dataset contains over 100k videos of driving experience, each running 40 seconds at 30 frames per second. The total image count is 800 times larger than Baidu ApolloScape (released March 2018), 4,800 times larger than Mapillary and 8,000 times larger than KITTI.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
BigML big list of public data sources | http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/#comment-7538 | Best place to explore, sell and buy datasets | Free/Paid |
Billion Words | http://www.statmt.org/lm-benchmark/ | The purpose of the project is to make available a standard training and test setup for language modeling experiments.;License information not found | Free |
BIMCV-COVID19+ | https://bimcv.cipf.es/bimcv-projects/bimcv-covid19/ | BIMCV-COVID19+: a large annotated dataset of RX and CT images of COVID19 patients. This first iteration of the database includes 1380 CX, 885 DX and 163 CT studies.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
BLiMP | https://github.com/alexwarstadt/blimp | The Benchmark of Linguistic Minimal Pairs. BLiMP is a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars.;License information not found | Free |
Bosch Small Traffic Lights Dataset | https://hci.iwr.uni-heidelberg.de/node/6132 | This dataset contains 13,427 camera images at a resolution of 1280x720 pixels and contains about 24,000 annotated traffic lights. The annotations include bounding boxes of traffic lights as well as the current state (active light) of each traffic light.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Boulder County Open Data | https://www.bouldercounty.org/government/open-data/ | Datasets such as parcels, zoning, property information, open space, trails and more. | Free |
Break | https://allenai.github.io/Break/ | Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation.;The dataset contains data from several sources, check the links on the website for individual licenses | Free |
Brno Urban Dataset | https://github.com/Robotics-BUT/Brno-Urban-Dataset | A new dataset recorded in Brno, Czech Republic. It offers data from four WUXGA cameras, two 3D LiDARs, inertial measurement unit, infrared camera and especially differential RTK GNSS receiver with centimetre accuracy which, to the best knowledge of the authors, is not available from any other public dataset so far. In addition, all the data are precisely timestamped with sub-millisecond precision to allow wider range of applications. At the time of publishing of the paper, it contains recordings of more than 350 km of rides in varying environments.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
Broad Institute Cancer Program Datasets | http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi | All data sets on cancer disease for research | Free |
Bureau of Justice Statistics | https://www.bjs.gov/index.cfm?ty=dca#262 | Dataset from 40 urban counties used to describe the characteristics of more than 7,000 juveniles charged with felonies in State courts. | Free |
Bureau of Labor Statistics | https://www.bls.gov/data/ | Conveniently search multiple data sets all at once. Users can extract specific data by searching by keyword or by filtering through multiple topics, measures, and attributes. | Free |
BuzzFeed Data | https://github.com/BuzzFeedNews/everything | Has libraries, tools, guides and dataset. Datasets range from news related topics to climate changes and politics. | Free |
Canada Open Data | https://open.canada.ca/en/open-data | Pilot project with many government and geospatial datasets | Free |
Canadian Adverse Driving Conditions Dataset | http://cadcd.uwaterloo.ca/ | Open-source dataset for autonomous driving in wintry weather. The CADC dataset aims to promote research to improve self-driving in adverse weather conditions. This is the first public dataset to focus on real world driving data in snowy weather conditions. It features: 56,000 camera images, 7,000 LiDAR sweeps, 75 scenes of 50-100 frames each.;Attribution-NonCommercial 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes. | Paid |
Causality Workbench | http://www.causality.inf.ethz.ch/repository.php | Data sets covering medicine, marketing,signaling and various other categories | Free/Paid |
CCMatrix | https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix | A billion-scale bitext data set for training translation models. CCMatrix is the largest data set of high-quality, web-based bitexts for training translation models with more than 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public data set.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CelebA | http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html | CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Celeb-DF | http://www.cs.albany.edu/~lsw/celeb-deepfakeforensics.html | DeepFake Forensics (Celeb-DF) dataset contains real and DeepFake synthesized videos having similar visual quality on par with those circulated online. The Celeb-DF dataset includes 408 original videos collected from YouTube with subjects of different ages, ethic groups and genders, and 795 DeepFake videos synthesized from these real videos.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Charles Stewart Congressional Data | http://web.mit.edu/17.251/www/data_page.html | Data file that corresponds with the hard copy version of Nelson's two-volume set Committees in the U.S. Congress, 1947-1992, CQ Press. Corrections of the data set to Charles Stewart at MIT. Note: the House committee data set for the 96th-102nd Congress Congress is in the same format as the data set below that starts with the 103rd Congress. | Paid |
CheXpert | https://stanfordmlgroup.github.io/competitions/chexpert/ | CheXpert is a large public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Chinese Text in the Wild | https://ctwdataset.github.io/ | A dataset of Chinese text with about 1 million Chinese characters annotated by experts in over 30 thousand street view images.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
CIFAR-100 | https://www.cs.toronto.edu/~kriz/cifar.html | This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).;License information not found | Paid |
CityCam | https://www.citycam-cmu.com/ | CITYCAM aims to understand the city by analyzing the vehicles. We collected and annotated 60,000 frames with rich information, leading to about 900,000 annotated objects.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Paid |
Cityscapes | https://www.cityscapes-dataset.com/ | Large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
ClarQ | https://github.com/vaibhav4595/ClarQ | ClarQ: A large-scale and diverse dataset for Clarification Question Generation. Consists of ~2M examples distributed across 173 domains of stackexchange.;Attribution-NonCommercial 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes. | Free |
CMU-MOSEI | http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ | CMU-MOSEI is the largest in-the-wild dataset of multimodal sentiment analysis and emotion recognition in NLP. It consists of 23,500 sentences from more than 1000 youtube identities and 200 topics. Sentences are annotated for sentiment and emotion intensity. The dataset also contains unsupervised data (unannotated sentences).;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CNN and Daily Mail summarization | https://cs.nyu.edu/~kcho/DMQA/ | Two datasets using news articles for Q&A research. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
COCO | http://cocodataset.org/ | COCO is a large-scale object detection, segmentation, and captioning dataset. It contains: 330K images (>200K labeled), 1.5 million object instances, 80 object categories.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
COCO-Text | https://bgshih.github.io/cocotext/ | A Large-Scale Scene Text Dataset, Based on MSCOCO. COCO-Text V2.0 contains 63,686 images with 239,506 annotated text instances. Segmentation mask is annotated for every word, allowing fine-level detection. Three attributes are labeled for every word: machine-printed vs. handwritten, legible vs. illgible, and English vs. non-English.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
CODAH | https://github.com/Websail-NU/CODAH | CODAH is an adversarially-constructed evaluation dataset with 2.8k questions for testing common sense. CODAH forms a challenging extension to the SWAG dataset, which tests commonsense knowledge using sentence-completion questions that describe situations observed in video.;License information not found | Free |
Colorado Open Data | https://data.colorado.gov/ | Colorado's open data at your fingertips | Free |
Comma 2k19 | https://github.com/commaai/comma2k19 | comma.ai presents comma2k19, a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. comma2k19 is a fully reproducible and scalable dataset.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
comma.ai | https://github.com/commaai/research | 7 and a quarter hours of largely highway driving.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CommonsenseQA | https://www.tau-nlp.org/commonsenseqa | CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split".;License information not found | Paid |
CompCars | http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/index.html | The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Condensed Movies | https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/ | A large-scale video dataset, featuring clips from movies with detailed captions. Over 3,000 diverse movies from a variety of genres, countries and decades.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CoQA | https://stanfordnlp.github.io/coqa/ | CoQA is a large-scale dataset for building Conversational Question Answering systems. CoQA contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains.;CoQA contains passages from seven domains. We make five of these public under the following licenses: Literature and Wikipedia passages are shared under CC BY-SA 4.0 license. Children's stories are collected from MCTest which comes with MSR-LA license. Middle/High school exam passages are collected from RACE which comes with its own license. News passages are collected from the DeepMind CNN dataset which comes with Apache license. | Free |
CORNELL NEWSROOM | https://summari.es/ | CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
COVID-19 image data collection | https://github.com/ieee8023/covid-chestxray-dataset | A database of COVID-19 cases with chest X-ray or CT images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
COVID-CT | https://github.com/UCSD-AI4H/COVID-CT | The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
COVIDx | https://github.com/lindawangg/COVID-Net | A dataset with16,756 chest radiography images across 13,645 patient cases. The current COVIDx dataset is constructed from other open source chest radiography datasets.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CQ500 | http://headctstudy.qure.ai/dataset | We have made the CQ500 dataset of 491 scans with 193,317 slices publicly available so that others can compare and build upon the results we have achieved in the paper. We provide anonymized dicoms for all the 491 scans and the corresponding radiologists' reads. The scans in the CQ500 dataset were generously provided by Centre for Advanced Research in Imaging, Neurosciences and Genomics(CARING), New Delhi, IN. The reads were done by three radiologists with an experience of 8, 12 and 20 years in cranial CT interpretation respectively.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
Credit Risk Analytics Dat | http://www.creditriskanalytics.net | Home equity loans credit data set, mortgage loan level data set, Loss Given Default (LGD) data set and corporate ratings data set. | Free/Paid |
CrowdFix | https://github.com/MemoonaTahira/CrowdFix | Dataset of Human Eye Fixation over Crowd Videos. CrowdFix includes 434 videos with diverse crowd scenes, containing a total of 37,493 frames and 1,249 seconds. The diverse content refers to different crowd activities under three distinct categories - Sparse, Dense Free Flowing and Dense Congested. All videos are at 720p resolution and 30 Hz frame rate.;License information not found | Free |
CrowdFlower Data for Everyone | http://www.crowdflower.com/data-for-everyone | A repository of some of the data sets collected or enhanched by their contributors. | Free/Paid |
CULane | https://xingangpan.github.io/projects/CULane.html | CULane is a large scale challenging dataset for academic research on traffic lane detection. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing. More than 55 hours of videos were collected and 133,235 frames were extracted. Data examples are shown above. We have divided the dataset into 88880 for training set, 9675 for validation set, and 34680 for test set. The test set is divided into normal and 8 challenging categories, which correspond to the 9 examples above.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
CURE-TSD | https://github.com/olivesgatech/CURE-TSD | CURE-TSD: Challenging Unreal and Real Environments for Traffic Sign Detection. The video sequences in the CURE-TSD dataset are grouped into two classes: real data and unreal data. Real data correspond to processed versions of sequences acquired from real world. Unreal data corresponds to synthesized sequences generated in a virtual environment. There are 49 real sequences and 49 unreal sequences that do not include any specific challenge. We have 34 training videos and 15 test videos in both real and unreal sequences that are challenge-Free. There are 300 frames in each video sequence. There are 49 challenge-Free real video sequences processed with 12 different types of effects and 5 different challenge levels. Moreover, there are 49 synthesized video sequences processed with 11 different types of effects and 5 different challenge levels. In total, there are 5,733 video sequences, which include around 1.72 million frames.;License information not found | Free |
Danbooru2018 | https://www.gwern.net/Danbooru2018 | Danbooru2018 is a large-scale anime image database with 3.33m+ images annotated with 99.7m+ tags; It can be useful for machine learning purposes such as image recognition and generation.;License information not found | Free |
Data Planet | http://www.data-planet.com | The largest repository of standardized and structured statistical data. | Paid |
Data.gov | https://catalog.data.gov/dataset#sec-groups | Over 200K Datasets. Topics include Agriculture, Climate, Ecosystems, Energy, Local Government, Maritime | Free |
Data.gov.uk | http://data.gov.uk | Data published by central government, local authorities and public bodies to help you build products and services | Free |
Data.world | https://data.world/ | Datasets from NASA, Twitter, Geospatial, Finance, Sports, Census, Transportation, Environment and more. | Free/Paid |
Datacatalogs.org | http://datacatalogs.org | Aims to be the most comprehensive list of open data catalogs in the world. | Free |
Datahub.io | https://datahub.io/ | Find, Share & Publish Data. Healthcare, Inflation, Education, GeoJSON, Demographics, Football, Climate Change, Stockmarket and more. | Free/Paid |
DataSF.org | https://datasf.org | Datasets from the City and County of San Francisco. | Free |
DBPedia | https://wiki.dbpedia.org/develop/datasets | DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. | Free |
DBPedia, Amazon, Yelp, Yahoo!, Sogou, and AG | https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M | An extensive set of eight datasets for text classification. Datasets from DBPedia, Amazon, Yelp, Yahoo!, Sogou, and AG. Sample size of 120K to 3.6M, ranging from binary to 14 class problems.;Parts of the dataset are under different licenses, check the dataset web page for more information | Free |
DDAD | https://github.com/TRI-ML/DDAD | DDAD (Dense Depth for Autonomous Driving) is a new autonomous driving benchmark from TRI (Toyota Research Institute) for long range (up to 250m) and dense depth estimation in challenging and diverse urban conditions. It contains monocular videos and accurate ground-truth depth (across a full 360 degree field of view) generated from high-density LiDARs mounted on a fleet of self-driving cars operating in a cross-continental setting.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
DeepFashion2 | https://github.com/switchablenorms/DeepFashion2 | It is a versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval. It has 801K clothing items where each item has rich annotations such as style, scale, viewpoint, occlusion, bounding box, dense landmarks and masks. There are also 873K Commercial-Consumer clothes pairs.;License information not found | Free |
Delve | https://delvedatabase.org/ | Delve makes it possible for users to compare their learning methods with other methods on many datasets | Free |
DensePose | http://densepose.org/ | Dense human pose estimation aims at mapping all human pixels of an RGB image to the 3D surface of the human body. We introduce DensePose-COCO, a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K COCO images.;Attribution-NonCommercial 2.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes. | Free |
Dept of Education Data | https://catalog.data.gov/dataset?groups=education2168#topic=education_navigation | Guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more. | Free |
DIODE: A Dense Indoor and Outdoor DEpth Dataset | https://diode-dataset.org/ | DIODE (Dense Indoor and Outdoor DEpth) is a dataset that contains diverse high-resolution color images with accurate, dense, wide-range depth measurements. It is the first public dataset to include RGBD images of indoor and outdoor scenes obtained with one sensor suite.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Paid |
DoQA | http://ixa.eus/node/12931 | DoQA is a dataset for accessing Domain Specific FAQs via conversational QA that contains 2,437 information-seeking question/answer dialogues (10,917 questions in total) on three different domains: cooking, travel and movies.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
DramaQA | https://dramaqa.snu.ac.kr/Dataset | Dataset is built upon the TV drama "Another Miss Oh" and it contains 16,191 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels. We provide 217,308 annotated images with rich character-centered annotations.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Dreyeve | http://imagelab.ing.unimore.it/dreyeve | Composed by 74 video sequences of 5 mins each, we have captured and annotated more than 500,000 frames. The labeling contains drivers’ gaze fixations and their temporal integration providing task-specific saliency maps. Geo-referenced locations, driving speed and course complete the set of released data.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
DROP | https://allennlp.org/drop | DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
DublinCity: Annotated LiDAR Point Cloud | https://v-sense.scss.tcd.ie/DublinCity/ | Urban Modelling Group at University College Dublin (UCD) captured major area of Dublin city centre (i.e. around 5.6 km^2 including partially covered areas) was scanned via an ALS device which was carried out by helicopter in 2015. However, the actual focused area was around 2 km^2 which contains the most densest LiDAR point cloud and imagery dataset. The flight altitude was mostly around 300m and the total journey was performed in 41 flight path strips. The datasets is made up of over 260 million laser scanning points labelled into 100,000 objects.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
DVQA | https://github.com/kushalkafle/DVQA_dataset | DVQA: Understanding Data Visualizations via Question Answering, a dataset that tests many aspects of bar chart understanding in a question answering framework. Contains over 3 million image-question pairs about bar charts. It tests three forms of diagram understanding: a) structure understanding; b) data retrieval; and c) reasoning.;Attribution-NonCommercial 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes. | Free |
Earthdata | https://earthdata.nasa.gov | Earthdata is part of NASA’s Earth Science Data Systems Program, specifically the Earth Observing System Data and Information System (EOSDIS) | Free |
Enron Email Dataset | https://www.cs.cmu.edu/~./enron/ | Ddataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) | Paid |
ETH3D | https://www.eth3d.net/ | A multi-view stereo / 3D reconstruction benchmark covering a variety of indoor and outdoor scenes. Ground truth geometry has been obtained using a high-precision laser scanner. Contains 13 / 12 DSLR datasets for training / testing, 5 / 5 multi-cam rig videos for training / testing, 27 / 20 frames for two-view stereo training / testing.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
EuroCity Persons Dataset | https://eurocity-dataset.tudelft.nl/ | With over 238,200 person instances manually labeled in over 47,300 images, EuroCity Persons is nearly one order of magnitude larger than person datasets used previously for benchmarking. Diversity is gained by recording this dataset throughout Europe. All objects were annotated with tight bounding boxes delineating their full extent. If objects were partly occluded, their full extents were estimated (this is useful for later processing steps such as tracking) and the level of occlusion was annotated.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
ExDARK Dataset | https://github.com/cs-chan/Exclusively-Dark-Image-Dataset | The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes;BSD 3-Clause "New" or "Revised" License - A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent. | Free |
Facebook bAbI | https://research.fb.com/downloads/babi/ | A set of datasets for automatic text understanding and reasoning.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Facebook BISON | http://hexianghu.com/bison/ | Facebook BISON (Binary Image Selection) dataset complements the COCO Captions dataset. BISON-COCO is not a training dataset, but rather an evaluation dataset that can be used to test existing models’ ability for pairing visual content with appropriate text descriptions.;License information not found | Free |
FaceForensics Benchmark | http://kaldir.vc.in.tum.de/faceforensics_benchmark/ | FaceForensics++ is a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. The data has been sourced from 977 youtube videos and all videos contain a trackable mostly frontal face without occlusions which enables automated tampering methods to generate realistic forgeries. As we provide binary masks the data can be used for image and video classification as well as segmentation. In addition, we provide 1000 Deepfakes models to generate and augment new data.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Factivated | https://www.factivated.com/ | Open data from all over the world. For example, EUROSTAT, FAO, USDA, EIA, JODI, UK Land Registry | Free |
Fashion IQ | https://github.com/XiaoxiaoGuo/fashion-iq | A new dataset for natural language based fashion image retrieval. Unlike previous fashion datasets, we provide natural language annotations to facilitate the training of interactive image retrieval systems, as well as the commonly used attribute based labels.;The CDLA agreement is similar to permissive open source licenses in that the publisher of data allows anyone to use, modify and do what they want with the data with no obligations to share any of their changes or modifications. | Free |
Fashion MNIST | https://github.com/zalandoresearch/fashion-mnist | Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
fastMRI Dataset | http://fastmri.org/ | Acollaborative research project from Facebook AI Research (FAIR) and NYU Langone Health to investigate the use of AI to make MRI scans up to 10 times faster. The dataset includes more than 1.5 million anonymous MRI images of the knee, drawn from 10,000 scans, and raw measurement data from nearly 1,600 scans.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Paid |
FBI Data | https://www.fbi.gov/services/cjis/ucr | Criminal Justice Information Services | Free |
FEDSTATS | https://www.cs.umd.edu/hcil/govstat/fedstats/fedstats3.htm | Gateway to statistics from over 100 United States Federal government agencies. Link directly to statistical data from agencies | Free/Paid |
FIMI repository for frequent itemset mining | http://fimi.cs.helsinki.fi/ | FIMI repository containing the source codes of all implementations that were accepted at the FIMI workshops together with several publicly available datasets. | Free |
Financial Data Finder at OSU | http://fisher.osu.edu/fin/fdf/osudata.htm | Catalog of financial data sets. | Free |
FIVE Project - Dartmouth | http://five.dartmouth.edu/ | Free time-series data sets include: historical workstation sales, photolightography, breweries, and shipbuilding | Free/Paid |
FiveThirtyEight | https://github.com/fivethirtyeight/data | Data sets are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License. | Free |
Flickr1024 | https://yingqianwang.github.io/Flickr1024/ | Flickr1024 is a large stereo dataset, which consists of 1024 high-quality images pairs and covers diverse senarios. This dataset can be employed for stereo image super-resolution (SR).;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Flickr30k | http://hockenmaier.cs.illinois.edu/DenotationGraph/ | An image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. This is an extension of the Flickr 8k Dataset. The new images and captions focus on people involved in everyday activities and events.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
FMA: A Dataset For Music Analysis | https://github.com/mdeff/fma | We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and Free-form text such as biographies.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Ford Autonomous Vehicle Dataset | https://avdata.ford.com/ | A challenging multi-agent seasonal dataset collected by a fleet of Ford autonomous vehicles at different days and times during 2017-18. Each log in the dataset is time-stamped and contains raw data from all the sensors, calibration values, pose trajectory, ground truth pose, and 3D maps.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
GDELT | http://www.guardian.co.uk/news/datablog/2013/apr/12/gdelt-global-database-events-location | Global Data on Events, Location and Tone, described by Guardian as "a big data history of life, the universe and everything | Paid |
GEO (GEO Gene Expression Omnibus) | http://www.ncbi.nlm.nih.gov/geo/ | Provide a robust, versatile database in which to efficiently store high-throughput functional genomic data | Paid |
GeoDa Center | http://geodacenter.asu.edu/datalist/ | GeoDa is a Free and open source software tool that serves as an introduction to spatial data analysis. | Free |
Global Health Observatory data | http://www.who.int/gho/en/ | their core goal for better health information worldwide, the World Health Organization makes their data on global health publicly available through the Global Health Observatory (GHO). | Free |
GoEmotions | https://github.com/google-research/google-research/tree/master/goemotions | GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Google Audioset | https://research.google.com/audioset/ | AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Google Cloud Platform Datasets | https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset | Over 180 datasets in all categories. Charge for only large queries and certain use cases. | Free/Paid |
Google Conceptual Captions | https://ai.google.com/research/ConceptualCaptions | We make available Conceptual Captions, a new dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.;License information not found | Paid |
Google Datasets | https://datasetsearch.research.google.com/ | providing datasets with collaboration of many other webstes | Free/Paid |
Google Landmarks V2 | https://github.com/cvdfoundation/google-landmark | This is the second version of the Google Landmarks dataset, which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Google Natural Questions | https://ai.google.com/research/NaturalQuestions | Natural Questions (NQ), a new, large-scale corpus for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions. NQ is large, consisting of 300,000 naturally occurring questions, along with human annotated answers from Wikipedia pages, to be used in training QA systems. We have additionally included 16,000 examples where answers (to the same questions) are provided by 5 different annotators, useful for evaluating the performance of the learned QA systems.;Attribution-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Google Open Images V5 | https://storage.googleapis.com/openimages/web/index.html | Open Images is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. It contains a total of 16M bounding boxes for 600 object classes on 1.9M images, making it the largest existing dataset with object location annotations. Open Images V5 features segmentation masks for 2.8 million object instances in 350 categories. Unlike bounding-boxes, which only identify regions in which an object is located, segmentation masks mark the outline of objects, characterizing their spatial extent to a much higher level of detail.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Google sentence compression | https://github.com/google-research-datasets/sentence-compression | Large corpus of uncompressed and compressed sentences from news articles. Contains over 200,000 sentence compression pairs.;License information not found | Free |
Google Trends | https://trends.google.com/trends/explore | This is one of the widest and most interesting public data sets | Free |
GOT-10k (Generic Object Tracking Benchmark) | http://got-10k.aitestunion.com/ | A large, high-diversity, one-shot database for generic object tracking in the wild. The dataset contains more than 10,000 video segments of real-world moving objects and over 1.5 million manually labeled bounding boxes. The dataset is backboned by WordNet and it covers a majority of 560+ classes of real-world moving objects and 80+ classes of motion patterns.The test set embodies 84 object classes and 32 motion classes with only 180 video segments, allowing for efficient evaluation.;License information not found | Paid |
GQA | https://cs.stanford.edu/people/dorarad/gqa/ | The dataset consists of 22M questions about various day-to-day images. Each image is associated with a scene graph of the image's objects, attributes and relations, a new cleaner version based on Visual Genome.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
GTA dataset | https://download.visinf.tu-darmstadt.de/data/from_games/ | The datasets consists of 24,966 densely labelled frames split into 10 parts for convenience. The class labels are compatible with the CamVid and CityScapes datasets.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
HACS | http://hacs.csail.mit.edu/ | This project introduces a novel video dataset, named HACS (Human Action Clips and Segments). It consists of two kinds of manual annotations. HACS Clips contains 1.55M 2-second clip annotations; HACS Segments has complete action segments (from action start to end) on 50K videos. The large-scale dataset is effective for pretraining action recognition and localization models, and also serves as a new benchmark for temporal action localization.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
HASY | https://zenodo.org/record/259444 | HASY is a publicly available, Free of charge dataset of single symbols similar to MNIST. It contains 168233 instances of 369 classes.;License information not found | Free |
Hateful Memes Challenge | https://ai.facebook.com/hatefulmemes | A new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
HD1K Benchmark Suite | http://hci-benchmark.org/ | An autonomous driving dataset and benchmark for optical flow. > 1000 frames at 2560x1080 with diverse lighting and weather scenarios, reference data with error bars for optical flow, evaluation masks for dynamic objects, specific robustness evaluation on challenging scenes. The dataset includes: 110,500 vehicles 44,500 driven kilometers 147 driven hours;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Healthdata.gov | https://healthdata.gov/ | 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics. | Free |
HHS.gov | https://www.hhs.gov/about/agencies/omha/about/health-data-sets/index.html | Data from the Office of Medicare and Appeals | Free |
HighD - The Highway Drone Dataset | https://www.highd-dataset.com/ | The highD dataset is a new dataset of naturalistic vehicle trajectories recorded on German highways. Using a drone, typical limitations of established traffic data collection methods such as occlusions are overcome by the aerial perspective. Traffic was recorded at six different locations and includes more than 110 500 vehicles.;Non-commercial and commercial licenses available | Paid |
HisData.com | http://histdata.com | Forex Historical Data | Free |
HitCompanies Datasets | http://www.grainmarketresearch.com/ | UK Companies Dataset contains information on random 10,000 UK companies sampled from aiHit database | Free/Paid |
Holopix50k | https://leiainc.github.io/holopix50k/ | A novel in-the-wild stereo image dataset, comprising 49,368 image pairs contributed by users of the Holopix™ mobile social platform.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
HotpotQA | https://hotpotqa.github.io/ | HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. The dataset is composed of 113,000 QA pairs based on Wikipedia.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Human Activity Knowledge Engine (HAKE) | http://hake-mvig.cn/home/ | Human Activity Knowledge Engine (HAKE) aims at promoting the human activity/action understanding. As a large-scale knowledge base, HAKE is built upon existing activity datasets, and supplies human instance action labels and corresponding body part level atomic action labels (Part States). Dataset contains 104 K+ images, 154 activity classes, 677 K+ human instances.;License information not found | Free |
humans in the loop | https://humansintheloop.org/datasets/ | updated datasets on current situations | Free |
IBM Diversity in Faces Dataset | https://www.research.ibm.com/artificial-intelligence/trusted-ai/diversity-in-faces/ | The Diversity in Faces(DiF)is a large and diverse dataset that seeks to advance the study of fairness and accuracy in facial recognition technology. The first of its kind available to the global research community, DiF provides a dataset of annotations of 1 million human facial images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
ICWSM-2009 dataset | http://www.icwsm.org/2009/data/ | ICWSM 2009 is making a dataset available to researchers in the blog and social media fields | Free |
ImageMonkey | https://imagemonkey.io/ | ImageMonkey is a Free, public open source dataset. ImageMonkey provides a platform where users can drop their photos, tag them with a label, and put them into public domain. Contains over 100,000 images.;CC-0 - No Copyright | Free |
ImageNet | http://www.image-net.org/ | ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
iMat Fashion 2019 | https://github.com/visipedia/imat_comp | While early work in computer vision addressed related clothing recognition tasks, these are not designed with fashion insiders’ needs in mind, possibly due to the research gap in fashion design and computer vision. To address this, we first propose a fashion taxonomy built by fashion experts, informed by product description from the internet. To capture the complex structure of fashion objects and ambiguity in descriptions obtained from crawling the web, our standardized taxonomy contains 46 apparel objects (27 main apparel items and 19 apparel parts), and 92 related fine-grained attributes. Secondly, a total of around 50K clothing images (10K with both segmentation and fine-grained attributes, 40K with apparel instance segmentation) in daily-life, celebrity events, and online shopping are labeled by both domain experts and crowd workers for fine-grained segmentation.;License information not found | Free |
IMDB-WIKI faces | https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/ | Faces from the list of the most popular 100,000 actors as listed on the IMDb website and (automatically) crawled from their profiles date of birth, name, gender and all images related to that person. 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus 523,051 in total.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Infochimps | http://infochimps.org/ | open catalog and market place for datasets | Free |
INTERACTION Dataset | https://interaction-dataset.com/ | The INTERACTION dataset contains naturalistic motions of various traffic participants in a variety of highly interactive driving scenarios. Using drones and traffic cameras, trajectories were captured from different countries, including the US, Germany, China and other countries.;Research and commercial licenses available. | Free |
International Macroeconomic Data Set - U.S. Dept of Agriculture Economic Research Service | http://www.ers.usda.gov/data-products/international-macroeconomic-data-set.aspx | Useful for projections, the USDA's International Macroeconomic Data Set "provides data from 1969 through 2030 for real (adjusted for inflation) gross domestic product (GDP), population, real exchange rates, and other variables for the 190 countries and 34 regions that are most important for U.S. agricultural trade." | Free/Paid |
International Monetary Fund | http://data.imf.org/ | IMF Data - Macroeconomical and Financial Data | Free |
Intersection Drone Dataset | https://www.ind-dataset.com/ | The inD dataset is a new dataset of naturalistic vehicle trajectories recorded at German intersections. Using a drone, typical limitations of established traffic data collection methods like occlusions are overcome. Traffic was recorded at four different locations. The trajectory for each road user and its type is extracted.;Research and commercial licenses available. | Paid |
Investor Links | http://www.investorlinks.com/ | Financial data sets available | Free/Paid |
JHU-CROWD++ | http://www.crowd-counting.com/ | A large-scale unconstrained crowd counting dataset A comprehensive dataset with 4,372 images and 1.51 million annotations. In comparison to existing datasets, the proposed dataset is collected under a variety of diverse scenarios and environmental conditions.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
JMP Public featured datasets | https://public.jmp.com/featured?utm_source=kdnuggets&utm_medium=advertisement&utm_campaign=datasetlisting | Data sets provided to connect people and spread information | Free |
JRDB | https://jrdb.stanford.edu/dataset/about | JRDB is the largest benchmark data for 2D-3D person tracking, including: Over 60K frames (67 minutes) sensor data captured from 5 stereo camera and two LiDAR sensors, 54 sequences from different locations, during day and night time, indoors and outdoors in a university campus environment. Around 2 milion high quality 2D bounding box annotations on 360° cylindrical video streams generated from 5 stereo cameras;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Kaggle Datasets | https://www.kaggle.com/datasets | Find and use datasets or complete tasks | Free/Paid |
KeypointNet | https://github.com/qq456cvb/KeypointNet | KeypointNet is a large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations, based on ShapeNet models.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
KITTI | http://www.cvlibs.net/datasets/kitti/index.php | A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, 6 hours of traffic scenarios recorded at 10-100 Hz. The scenarios are diverse, capturing real-world traffic situations and range from Freeways over rural areas to innercity scenes with many static and dynamic objects.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
KnowIT VQA | https://knowit-vqa.github.io/ | KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
KONECT | http://www.sigkdd.org/kddcup/index.php | A project to collect large network datasets of all types in order to perform research in network science and related fields, collected by the Institute of Web Science and Technologies at the University of Koblenz –Landau. | Free/Paid |
Large Movie Review Dataset | http://ai.stanford.edu/~amaas/data/sentiment/ | This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.;License information not found | Free |
Large Scale Chinese Corpus for NLP | https://github.com/brightmart/nlp_chinese_corpus | Dataset contents: 1. Wikipedia (wiki2019zh), 1 million well-formed Chinese entries 2. News corpus (news2016zh), 2.5 million news, including keywords, description 3. Encyclopedia question and answer (baike2018qa), 1.5 million questions and answers with question types 4. Community Q&A json version (webtext2019zh), 4.1 million high quality community Q&A, suitable for training oversized models 5. Translation corpus (translation2019zh), 5.2 million pairs of Chinese and English sentences;The dataset contains data from several sources, check the links on the website for individual licenses | Free |
LibriSpeech | http://www.openslr.org/12/ | Large-scale (1000 hours) corpus of read English speech.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Logo-2k+ | https://github.com/msn199959/Logo-2k-plus-Dataset | A Large-Scale Logo Dataset for Scalable Logo Classification. Our resulting logo dataset contains 167,140 images with 10 root categories and 2,341 categories.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
LSUN | http://www.yf.io/p/lsun | LSUN contains around one million labeled images for each of 10 scene categories and 20 object categories.;License information not found | Free |
LVIS | https://www.lvisdataset.org/ | LVIS is a new dataset for long tail object instance segmentation. 1000+ Categories: found by data-driven object discovery in 164k images. More than 2.2 million high quality instance segmentation masks.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Lyft Level 5 | https://level5.lyft.com/dataset/ | A comprehensive, large-scale dataset featuring the raw sensor camera and LiDAR inputs as perceived by a fleet of multiple, high-end, autonomous vehicles in a bounded geographic area. This dataset also includes high quality, human-labelled 3D bounding boxes of traffic agents, an underlying HD spatial semantic map. Contains over 55,000 human-labeled 3D annotated frames; data from 7 cameras and up to 3 lidars; a drivable surface map; and, an underlying HD spatial semantic map. A semantic map provides context to reason about the presence and motion of the agents in the scenes. The provided map has over 4000 lane segments (2000 road segment lanes and about 2000 junction lanes) , 197 pedestrian crosswalks, 60 stop signs, 54 parking zones, 8 speed bumps, 11 speed humps.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
Mapillary Street-level Sequences Dataset | https://www.mapillary.com/dataset/places | Mapillary Street-Level Sequences (MSLS) is the largest, most diverse dataset for place recognition, containing 1.6 million images in a large number of short sequences.;Research and commercial licenses available. | Paid |
Mapillary Traffic Sign Dataset | https://www.mapillary.com/dataset/trafficsign | A diverse street-level imagery dataset with bounding box annotations for detecting and classifying traffic signs around the world. 100,000 high-resolution images from all over the world with bounding box annotations of over 300 classes of traffic signs. The fully annotated set of the Mapillary Traffic Sign Dataset (MTSD) includes a total of 52,453 images with 257,543 traffic sign bounding boxes. The additional, partially annotated dataset contains 47,547 images with more than 80,000 signs that are automatically labeled with correspondence information from 3D reconstruction.;Research and commercial licenses available. | Paid |
Mapillary Vistas | https://www.mapillary.com/dataset/vistas | The Mapillary Vistas Dataset is the most diverse publicly available dataset of manually annotated training data for semantic segmentation of street scenes. 25,000 images pixel-accurately labeled into 152 object categories, 100 of those instance-specific.;Research and commercial licenses available. | Paid |
MeasuringWorth.com | https://www.measuringworth.com/ | This site offers calculators and data sets related to measures of worth over long time periods. | Free/Paid |
Medicare | https://data.medicare.gov/ | A federal government website managed by the Centers for Medicare & Medicaid Services | Free/Paid |
MegaFace | http://megaface.cs.washington.edu/ | The MF2 training dataset is the largest (in number of identities) publicly available facial recognition dataset with a 4.7 million faces, 672K identities, and their respective bounding boxes. All images obtained from Flickr (Yahoo's dataset) and licensed under Creative Commons.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Mid-Air | https://midair.ulg.ac.be/ | Mid-Air is a multi-modal synthetic dataset for low altitude drone flights in unstructured environments. It contains synchronized data captured by multiple sensors for a total of 54 trajectories and more than 420k video frames simulated in various climate conditions.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
MIMIC | https://mimic.physionet.org/ | MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more. The latest version of MIMIC is MIMIC-III v1.4, which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates. The data spans June 2001 - October 2012. The database, although de-identified, still contains detailed information regarding the clinical care of patients, so must be treated with appropriate care and respect.;License information not found | Free |
MIMIC-CXR | https://www.physionet.org/physiobank/database/mimiccxr/ | MIMIC-CXR is a large, publicly-available database comprising of de-identified chest radiographs from patients admitted to the Beth Israel Deaconess Medical Center between 2011 and 2016. The dataset contains 371,920 chest x-rays associated with 227,943 imaging studies. Each imaging study can pertain to one or more images, but most often are associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding Free-text radiology reports.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
ML Data | https://www.mldata.io/ | High quality datasets to use in your favorite Machine | Free/Paid |
MongoDB | https://www.mongodb.com/ | MongoDB is a document-oriented NoSQL database used for high volume data storage. It is a database which came into light around the mid-2000s. It falls under the category of a NoSQL database. | Free |
MoVi | https://www.biomotionlab.ca/movi/ | MoVi is the first human motion dataset to contain synchronized pose, body meshes and video recordings. Dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Mozilla Common Voice | https://voice.mozilla.org/ | Mozilla crowdsources the largest dataset of human voices available for use, including 18 different languages, adding up to almost 1,400 hours of recorded voice data from more than 42,000 contributors.;CC-0 - No Copyright | Paid |
MRNet | https://stanfordmlgroup.github.io/competitions/mrnet/ | The MRNet dataset consists of 1,370 knee MRI exams performed at Stanford University Medical Center. The dataset contains 1,104 (80.6%) abnormal exams, with 319 (23.3%) ACL tears and 508 (37.1%) meniscal tears; labels were obtained through manual extraction from clinical reports.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
MS MARCO | http://www.msmarco.org/ | Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer. It contains 1,010,916 user queries and 182,669 natural language answers.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
MSeg | https://github.com/mseg-dataset/mseg-api | MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. More than 220,000 object masks in more than 80,000 images.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
MultiNLI | https://www.nyu.edu/projects/bowman/multinli/ | The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.;The majority of the corpus is released under the OANC’s license, which allows all content to be Freely used, modified, and shared under permissive terms. The data in the FICTION section falls under several permissive licenses; Seven Swords is available under a Creative Commons Share-Alike 3.0 Unported License, and with the explicit permission of the author, Living History and Password Incorrect are available under Creative Commons Attribution 3.0 Unported Licenses; the remaining works of fiction are in the public domain in the United States (but may be licensed differently elsewhere). | Free |
MultiWOZ | https://www.repository.cam.ac.uk/handle/1810/280608 | The MultiWOZ dataset is a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora. The dialogue are set between a tourist and a clerk in the information. It spans over 7 domains.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
MURA | https://stanfordmlgroup.github.io/competitions/mura/ | MURA (musculoskeletal radiographs) is a large dataset of bone X-rays that can be used to train algorithms tasked with detecting abnormalities in X-rays. MURA is believed to be the world’s largest public radiographic image dataset with 40,561 labeled images.;Stanford University School of Medicine MURA Dataset Research Use Agreement (see website for license) | Free |
MySQL | https://www.mysql.com/ | MySQL is an open-source relational database which runs on a number of different platforms such as Windows, Linux, and Mac OS, etc. | Free/Paid |
NarrativeQA | https://github.com/deepmind/narrativeqa | NarrativeQA is a dataset built to encourage deeper comprehension of language. This dataset involves reasoning over reading entire books or movie scripts. This dataset contains approximately 45K question answer pairs in Free form text. There are two modes of this dataset (1) reading comprehension over summaries and (2) reading comprehension over entire books/scripts.;Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. | Free |
NASA Earth Data | http://search.earthdata.nasa.gov/search | Atmosphere, solar radiance, the cryosphere (arctic/frozen areas), the ocean, land surface (gravity, geomagnetism, tectonics), and human environments. | Free |
NASDAQ Data Store | https://data.nasdaq.com/ | Provision of access to market datasets | Paid |
National Climatic Data Center | https://www.ncdc.noaa.gov/data-access/quick-links | Here you can find an archive of climate and weather data sets across the US | Free |
National Government Statistical Web Sites | http://www.archive-it.org/ | Datasets, reports, statistical yearbooks, press releases | Free/Paid |
National Institute of Health | https://dash.nichd.nih.gov/ | NIH related only | Free |
National Institute on Alcohol Abuse and Alcoholism | https://www.niaaa.nih.gov/research/niaaa-data-archive | MUST BE ASSOCIATED WITH NIH TO ACCESS | Free |
National Science Foundation | https://www.nsf.gov/statistics/data.cfm | NSF data sets | Free/Paid |
National Space Science Data Center | http://nssdc.gsfc.nasa.gov/ | The NASA Space Science Data Coordinated Archive serves as the permanent archive for NASA space science mission data | Paid |
NetworkRepository: Interactive Data Repository | http://www.networkrepository.com/ | The first interactive data and network data repository with real-time visual analytics | Paid |
NewsQA | https://www.microsoft.com/en-us/research/project/newsqa-dataset | The purpose of the NewsQA dataset is to help the research community build algorithms that are capable of answering questions requiring human-level comprehension and reasoning skills. Leveraging CNN articles from the DeepMind Q&A Dataset, we prepared a crowd-sourced machine reading comprehension dataset of 120K Q&A pairs.;Parts of the dataset are under different licenses, check the dataset web page for more information | Free |
NHSA Health and Social Care Information Centre | http://www.forbes.com/health/ | Health data sets from the UK National Health Service. | Free/Paid |
Niderhoff ’s NLP datasets | https://github.com/niderhoff/nlp-datasets | Alphabetical list of Free/public domain datasets with text data for use in Natural Language Processing (NLP) | Free/Paid |
NIH Open Data Portal | https://opendata.ncats.nih.gov/covid19/index.html | Datasets and assay protocols used to generate them to include COVID-19. | Free |
NOAA.gov | https://www.ncdc.noaa.gov/cdo-web/datasets | Climate data. | Free |
NSynth | https://magenta.tensorflow.org/datasets/nsynth | A large-scale and high-quality dataset of annotated musical notes. The NSynth Dataset is an audio dataset containing ~300k musical notes, each with a unique pitch, timbre, and envelope. Each note is annotated with three additional pieces of information based on a combination of human evaluation and heuristic algorithms: the method of sound production for the note's instrument, the high-level family of which the note's instrument is a member and sonic qualities of the note.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
nuScenes | https://www.nuscenes.org/ | The nuScenes dataset is a large-scale autonomous driving dataset. It features: ● Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS) ● 1000 scenes of 20s each ● 1,440,000 camera images ● 400,000 lidar sweeps ● Two diverse cities: Boston and Singapore;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
Nvidia DG-Market | https://github.com/NVlabs/DG-Net#dg-market | Generated human image dataset. We provide our generated images and make a large-scale synthetic dataset called DG-Market. This dataset is generated by our DG-Net and consists of 128,307 images (613MB), about 10 times larger than the training set of original Market-1501.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
NVIDIA Flickr-Faces-HQ Dataset | https://github.com/NVlabs/ffhq-dataset | Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN). The dataset consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. It also has good coverage of accessories such as eyeglasses, sunglasses, hats, etc.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
ObjectNet | https://objectnet.dev/ | ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random. Collected to intentionally show objects from new viewpoints on new backgrounds. 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint. 313 object classes with 113 overlapping ImageNet;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Objects365 | https://www.objects365.org/overview.html | Objects365 is a brand new dataset, designed to spur object detection research with a focus on diverse objects in the Wild: 365 categories 600k images 10 million bounding boxes;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Paid |
Open Data Census | http://census.okfn.org/ | Which is used to compare the progress made by different cities and local areas in releasing Open Data | Free/Paid |
Open Images V6 | https://storage.googleapis.com/openimages/web/index.html?v6 | Open Images V6 expands the annotation of the Open Images dataset with a large set of new visual relationships, human action annotations, and image-level labels. This release also adds localized narratives, a completely new form of multimodal annotations that consist of synchronized voice, text, and mouse traces over the objects being described. In Open Images V6, these localized narratives are available for 500k of its images. It also includes localized narratives annotations for the full 123k images of the COCO dataset.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
OpenBookQA | https://github.com/allenai/OpenBookQA | OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations.;Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. | Free |
OpenData from Socrata | http://opendata.socrata.com/ | 10,000 datasets including business, education, government etc | Paid |
OpenWebText | https://skylion007.github.io/OpenWebTextCorpus/ | Open WebText – an open source effort to reproduce OpenAI’s WebText dataset. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Dataset was created by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The the remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents.;Dataset packaging is licensed under CC-0 but contains content that can have a different license, check the dataset download for more details. | Free |
OPIEC | https://www.uni-mannheim.de/en/dws/research/resources/opiec/ | OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time, etc.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
OrientDB | https://orientdb.com/ | OrientDB is an open-source NoSQL multi-model database which helps organizations to unlock the power of graph databases without deploying multiple systems to handle other data types | Free |
Oxford Radar RobotCar Dataset | https://dbarnes.github.io/radar-robotcar-dataset/ | The Oxford Radar RobotCar Dataset is a radar extension to The Oxford RobotCar Dataset. We provide data from a Navtech CTS350-X Millimetre-Wave FMCW radar and Dual Velodyne HDL-32E LIDARs with optimised ground truth radar odometry for 280 km of driving around Oxford, UK (in addition to all sensors in the original Oxford RobotCar Dataset).;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Oxford RobotCar Dataset | http://robotcar-dataset.robots.ox.ac.uk/ | The Oxford RobotCar Dataset contains over 100 repetitions of a consistent route through Oxford, UK, captured over a period of over a year. The dataset captures many different combinations of weather, traffic and pedestrians, along with longer term changes such as construction and roadworks.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
PANDA | http://www.panda-dataset.com/ | PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The scenes may contain 4k head counts with over 100× scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
PandaSet | https://scale.com/open-datasets/pandaset | PandaSet combines Hesai’s best-in-class LiDAR sensors with Scale AI’s high-quality data annotation. PandaSet features data collected using a forward-facing LiDAR with image-like resolution (PandarGT) as well as a mechanical spinning LiDAR (Pandar64). The collected data was annotated with a combination of cuboid and segmentation annotation (Scale 3D Sensor Fusion Segmentation). 48,000 camera images and 16,000 LiDAR sweeps.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Paid |
Paris500k | https://www.vision.rwth-aachen.de/page/paris500k | The Paris500k dataset consists of 501,356 geotagged images collected from Flickr and Panoramio. The dataset was collected from a geographic bounding box rather than using keyword queries. Thus, the images have a "natural" distribution, as shown in the figure on the right. The dataset is very challenging due to the presence of duplicates and near-duplicates, as well as a large fraction of unrelated images, such as photos of parties, pets, etc.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
PASCAL VOC 2012 | http://host.robots.ox.ac.uk/pascal/VOC/ | PASCAL VOC (2012 version) has 20 classes. The train/val data has 11,530 images containing 27,450 ROI annotated objects and 6,929 segmentations.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
P-DESTRE | http://p-destre.di.ubi.pt/index.html | P-DESTRE is a multi-session dataset of videos of pedestrians in outdoor public environments, fully annotated at the frame level.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
PedX | http://pedx.io/ | PedX is a large-scale multi-modal collection of pedestrians at complex urban intersections. The dataset provides high-resolution stereo images and LiDAR data with manual 2D and automatic 3D annotations. The data was captured using two pairs of stereo cameras and four Velodyne LiDAR sensors.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
PETA | http://mmlab.ie.cuhk.edu.hk/projects/PETA.html | Pedestrian Attribute Recognition At Far Distance dataset. The PETA dataset consists of 19000 images, with resolution ranging from 17-by-39 to 169-by-365 pixels. Those 19000 images include 8705 persons, each annotated with 61 binary and 4 multi-class attributes.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Pew Research | https://www.pewresearch.org/download-datasets | Policy, Journalism, American Trends, Internet & Tech and more. | Free |
Places2 | http://places2.csail.mit.edu/ | Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
PostgreSQL | https://www.postgresql.org/ | PostgreSQL also allows linking with other data stores like NoSQL, which act as a federated hub for polyglot databases | Free |
Projectdatasets | https://perso.telecom-paristech.fr/eagan/class/igr204/datasets | Simple multidimensional datasets that are for the most part classic infovis datasets. | Free |
ProPublic Data Store | https://www.propublica.org/datastore/datasets | Browse data sets about Health, Criminal Justice, Education, Politics, Business, Transportation, Military, Environment, Finance, or Religion. | Free/Paid |
QASC | https://github.com/allenai/qasc | QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.;License information not found | Free |
Qlik DataMarket | https://www.qlik.com/us/products/qlik-data-market | Connect multiple data sets with your existing data from one place, in one format. | Paid |
QMUL-OpenLogo | https://ai.google/tools/datasets/coached-conversational-preference-elicitation | QMUL-OpenLogo contains 27,083 images from 352 logo classes, built by aggregating and refining 7 existing datasets and establishing an open logo detection evaluation protocol.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Quandl | https://www.quandl.com/ | The premier source for financial, economic, and alternative datasets, serving investment professionals | Paid |
Quantopian | https://www.quantopian.com/docs/data-reference/overview | The Data Reference provides an overview of the data available on Quantopian as well as documentation for each dataset. | Free/Paid |
Question Answering in Context (QuAC) | https://quac.ai/ | QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of Freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Recipe1M | http://pic2recipe.csail.mit.edu/ | Recipe1M, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data.;License information not found | Paid |
RecipeQA | https://hucvl.github.io/recipeqa/ | RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.;RecipeQA contains question answer pairs generated from copyright Free recipes found online under a variety of licences. The corresponding licence for each recipe is also provided in the dataset, see recipes.json. | Free |
Reddit comments | https://www.reddit.com/r/datasets/comments/65o7py/updated_reddit_comment_dataset_as_torrents/ | Reddit Comments from 2005-12 to 2017-03. Downloaded from https://files.pushshift.io/comments.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Reddit Datasets | https://www.reddit.com/r/datasets/ | Has dataset listings and requests for data. | Free |
Rejustify | https://rejustify.com/ | More than 600 million data set series from more than 60 most trusted statistical sources and counting | Paid |
Replica | https://github.com/facebookresearch/Replica-Dataset | The Replica Dataset is a dataset of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information, planar segmentation as well as semantic class and instance segmentation.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
RISE Video Dataset | https://github.com/CMU-CREATE-Lab/deep-smoke-machine#dataset | We introduce RISE, the first large-scale video dataset for Recognizing Industrial Smoke Emissions. Our dataset contains 12,567 clips with 19 distinct views from cameras on three sites that monitored three different industrial facilities.;License information not found | Free |
RoadText-1K | http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtext-1k | Dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. Each video is from the BDD100K dataset.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Robert Schiller data | http://www.econ.yale.edu/~shiller/data.htm | Housing, stock market, and more from his book Irrational Exuberance. | Paid |
ScanNet | http://www.scan-net.org/ | ScanNet is an RGB-D video dataset containing 2.5 million views in more than 1500 scans, annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
SceneNet RGB-D | https://robotvault.bitbucket.io/scenenet-rgbd.html | It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. A set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses.;GPL - You are Free to: copy, distribute and modify the software as long as you track changes/dates in source files. Under the following terms: any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions. | Free |
Schema-Guided Dialogue | https://github.com/google-research-datasets/dstc8-schema-guided-dialogue | Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation.;License information not found | Free |
SciTLDR | https://github.com/allenai/scitldr | A dataset of almost ~4,000 TLDRs written about AI research papers hosted on the 'OpenReview' publishing platform. SciTLDR includes at least two high-quality TLDRs for each paper.;Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. | Free |
Semantic Drone Dataset | http://dronedataset.icg.tugraz.at/ | The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above ground. A high resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
SemanticKITTI | http://www.semantic-kitti.org/index.html | SemanticKITTI is based on the KITTI Vision Benchmark and we provide semantic annotation for all sequences of the Odometry Benchmark. The dataset contains 28 classes including classes distinguishing non-moving and moving objects.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
SEN12MS | https://www.isprs-ann-photogramm-remote-sens-spatial-inf-sci.net/IV-2-W7/153/2019/ | SEN12MS is a dataset consisting of 180,748 corresponding image triplets containing Sentinel-1 dual-pol SAR data, Sentinel-2 multi-spectral imagery, and MODIS-derived land cover maps.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
ShapeNet | https://shapenet.org/ | ShapeNet is an ongoing effort to establish a richly-annotated, large-scale dataset of 3D shapes. ShapeNet is organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, the majority of them being nouns (80,000+).;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
SIDD | https://www.eecs.yorku.ca/~kamel/sidd/ | The Smartphone Image Denoising Dataset (SIDD), of ~30,000 noisy images from 10 scenes under different lighting conditions using five representative smartphone cameras and generated their ground truth images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
SNLI | https://nlp.stanford.edu/projects/snli/ | The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
Social-IQ | https://www.thesocialiq.com/ | The dataset contains rigorously annotated and validated videos, questions and answers, as well as annotations for the complexity level of each question and answer. Social-IQ brings novel challenges to the field of artificial intelligence which sparks future research in social intelligence modeling, visual reasoning, and multimodal question answering. 1,250 videos, 7,500 questions, 33,000 correct answers, 22,500 incorrect answers.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Sociometrics | https://www.socio.com/products/data | Datasets for secondary analysis by researchers | Paid |
Socrata | https://opendata.socrata.com/ | The Socrata data platform enables governments to use data as a strategic asset in the design, management, and delivery of programs. Data flows easily between staff and departments leading to more efficient programs and better decision making. | Free |
Spacecraft Pose Estimation Dataset (SPEED) | https://kelvins.esa.int/satellite-pose-estimation-challenge/home/ | SPEED consists of synthetic as well as actual camera images of a mock-up of the Tango spacecraft from the PRISMA mission. The synthetic images are created by fusing OpenGL-based renderings of the spacecraft’s3D model with actual images of the Earth captured by the Himawari-8 meteorolog-ical satellite. Dataset contains over 12,000 images with a resolution of 1920×1200 pixels.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Spacenet | http://explore.digitalglobe.com/spacenet | SpaceNet is an online repository of Freely available satellite imagery, co-registered map data to train algorithms, and a series of public challenges designed to accelerate innovation in machine learning using geospatial data. This first of its kind open innovation project for the geospatial industry is a collaboration between CosmiQ Works, DigitalGlobe and NVIDIA. In the first year, over 5,700 km2 of very high-resolution imagery and more than 520,000 vectors were released through SpaceNet on AWS.;Parts of the dataset are under different licenses, check the dataset web page for more information | Free |
Spider 1.0 | https://yale-lily.github.io/spider | Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset. Spider consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Sports Statistics | https://sports-statistics.com/ | This website is providing with data for Soccer, NBA, NFL, NHL, and more | Free |
SQLite | http://cassandra.apache.org/ | SQLite is an open-source, embedded, relational database management system, designed circa 2000. It is a database, with zero configuration, no requirements of a server or installation | Free |
Stanford cars | https://ai.stanford.edu/~jkrause/cars/car_dataset.html | Stanford Cars dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split. Classes are typically at the level of Make, Model, Year, e.g. 2012 Tesla Model S or 2012 BMW M3 coupe.;License information not found | Free |
Stanford Drone Dataset | http://cvgl.stanford.edu/projects/uav_data/ | A large scale dataset that collects images and videos of various types of agents (not just pedestrians, but also bicyclists, skateboarders, cars, buses, and golf carts) that navigate in a real world outdoor environment such as a university campus. In the above images, pedestrians are labeled in pink, bicyclists in red, skateboarders in orange, and cars in green. 60 videos of 8 distinct scenes.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Stanford Sentiment Treebank | https://nlp.stanford.edu/sentiment/code.html | A dataset for sentiment analysis that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality.;License information not found | Free |
StatLib | http://lib.stat.cmu.edu/datasets/ | Carnegie Mellon University Datasets Archive | Free |
StreetHazards | https://github.com/hendrycks/anomaly-seg | We leverage a simulated driving environment to create a dataset for anomaly segmentation, which we call StreetHazards. It contains 5125 traning images, 1500 test images containing 250 anomaly types.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
SuperGLUE benchmark | https://super.gluebenchmark.com/ | SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Full citation list of the datasets contained: {The CommitmentBank}: Investigating projection in naturally occurring discourse, Choice of plausible alternatives: An evaluation of commonsense causal reasoning, Looking beyond the surface: A challenge set for reading comprehension over multiple sentences, The {PASCAL} recognising textual entailment challenge, The second {PASCAL} recognising textual entailment challenge, The third {PASCAL} recognizing textual entailment challenge, The Fifth {PASCAL} Recognizing Textual Entailment Challenge, {WiC}: The Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations, The {W}inograd schema challenge.;The dataset contains data from several sources, check the links on the website for individual licenses | Paid |
Supervisely Person | http://supervise.ly/ | Dataset consists of 5,711 images with 6,884 high-quality annotated person instances. Can be found on Supervisaly.ai under “Datasets library”.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
SVHN Street View House Numbers | http://ufldl.stanford.edu/housenumbers/ | SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
SVIRO | https://sviro.kl.dfki.de/ | SVIRO is a Synthetic dataset for Vehicle Interior Rear seat Occupancy detection and classification. The dataset consists of 25.000 sceneries across ten different vehicles and we provide several simulated sensor inputs and ground truth data.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Paid |
SWAG | https://rowanzellers.com/swag/ | Situations With Adversarial Generations is a large-scale dataset for this task of grounded commonsense inference, unifying natural language inference and physically grounded reasoning. The dataset consists of 113k multiple choice questions about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene. The correct answer is the (real) video caption for the next event in the video; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
Synscapes | https://7dlabs.com/synscapes-overview | A photorealistic synthetic dataset for street scene parsing. The images in the dataset do not follow a driven path through a single virtual world. Instead, an entirely unique scene was procedurally generated for each of the 25,000 images. As a result, the dataset contains a wide range of variations and unique combinations of features.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
SYNTHIA | http://synthia-dataset.net/ | The SYNTHetic collection of Imagery and Annotations, is a dataset that has been generated with the purpose of aiding semantic segmentation and related scene understanding problems in the context of driving scenarios. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations. It contains: +200,000 HD images from video streams and +20,000 HD images from independent snapshots. Scene diversity: European style town, modern city, highway and green areas. Variety of dynamic objects: cars, pedestrians and cyclists.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Synthinel-1 | https://github.com/timqqt/Synthinel/blob/master/README.md | A collection of high resolution synthetic overhead imagery for building segmentation. Synthinel-1 consists of 2,108 synthetic images generated in nine distinct building styles within a simulated city. These images are paired with "ground truth" annotations that segment each of the buildings. Synthinel also has a subset dataset called Synth-1, which contains 1,640 images spread across six styles.;License information not found | Free |
TabFact: A Large-scale Dataset for Table-based Fact Verification | https://tabfact.github.io/ | We introduce a large-scale dataset called TabFact(website: https://tabfact.github.io/), which consists of 117,854 manually annotated statements with regard to 16,573 Wikipedia tables, their relations are classified as ENTAILED and REFUTED.;MIT - You are Free to: use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the work. Under the following terms: the work is provided "as is", you must include copyright and the license in all copies or substantial uses of the work. | Free |
TACO (Trash Annotations in Context) | http://tacodataset.org/ | Taco is an open image dataset of waste in the wild. It contains photos of litter taken under diverse environments, from tropical beaches to London streets. These images are manually labeled and segmented according to a hierarchical taxonomy to train and evaluate object detection algorithms.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
TAO | http://taodataset.org/ | TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Taskmaster-1 | https://ai.google/tools/datasets/taskmaster-1 | The dataset consists of 13,215 task-based dialogs, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. Each conversation falls into one of six domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Ten Thousand German News Articles Dataset | https://tblock.github.io/10kGNAD/ | The 10kGNAD dataset is intended to solve part of this problem as the first german topic classification dataset. It consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics. These articles are a till now unused part of the One Million Posts Corpus.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
Tencent ML — Images | https://github.com/Tencent/tencent-ml-images | Tencent ML — Images is the largest open-source multi-label image dataset, including 17,609,752 training and 88,739 validation image URLs which are annotated with up to 11,166 categories.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
TextVQA | https://textvqa.org/ | TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. Dataset contains 28,408 images from OpenImages, 45,336 questions, 453,360 ground truth answers.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
The Boxy Vehicles Dataset | https://boxy-dataset.com/boxy/ | A large dataset of almost two million annotated vehicles for training and evaluating object detection methods. 200,000 images. 1,990,000 annotated vehicles. 5 Megapixel resolution.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
The Cancer Imaging Archive | https://www.cancerimagingarchive.net/ | TCIA is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage.;Dataset are under different licenses, check the dataset web page for more information | Paid |
The DriveU Traffic Light Dataset (DTLD) | http://www.traffic-light-data.com/ | DTLD contains more than 230 000 annotated traffic lights in camera images with a resolution of 2 megapixels. The dataset was recorded in 11 cities in Germany with a frequency of 15 Hz. Due to additional annotation attributes such as the traffic light pictogram, orientation or relevancy 344 unique classes exist. In addition to camera images and labels we provide stereo information in form of disparity images allowing stereo-based detection and depth-dependent evaluations.;License information not found | Free |
The German Traffic Sign Recognition Benchmark | http://benchmark.ini.rub.de/?section=gtsrb | The German Traffic Sign Benchmark is a multi-class, single-image classification challenge held at the IJCNN 2011. The dataset contains: more than 40 classes, more than 50,000 images in total.;License information not found | Free |
The Massively Multilingual Image Dataset (MMID) | http://multilingual-images.org/ | MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. By far the largest dataset of its kind, it has 98 languages (including English) and up to 10,000 words per language! (and many more for English.);License information not found | Free |
The Quick, Draw! Dataset | https://github.com/googlecreativelab/quickdraw-dataset | The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located. You can browse the recognized drawings on quickdraw.withgoogle.com/data.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
The Stanford Question Answering Dataset (SQuAD) 2.0 | https://rajpurkar.github.io/SQuAD-explorer/ | Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.;Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions. | Free |
The Unsupervised Labeled Lane Markers Dataset | https://unsupervised-llamas.com/llamas/ | The Unsupervised Llamas dataset was annotated by creating high definition maps for automated driving including lane markers based on Lidar. The automated vehicle can be localized against these maps and the lane markers are projected into the camera frame. The 3D projection is optimized by minimizing the difference between already detected markers in the image and projected ones. Further improvements can likely be achieved by using better detectors, optimizing difference metrics, and adding some temporal consistency. Over 100,000 annotated images. Annotations of over 100 meters. Resolution of 1276 x 717 pixels.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Total Text | https://github.com/cs-chan/Total-Text-Dataset | The Total-Text consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.;BSD 3-Clause "New" or "Revised" License - A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent. | Free |
Toyota Smarthome dataset | https://project.inria.fr/toyotasmarthome/ | Smarthome has been recorded in an apartment equipped with 7 Kinect v1 cameras. It contains 31 daily living activities and 18 subjects. The videos were clipped per activity, resulting in a total of 16,115 video samples.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
TrackingNet | https://tracking-net.org/ | A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. > 30K Video Sequences, > 14M Bounding Boxes. Diversity ensured by Youtube.;License information not found | Free |
Tsinghua-Tencent 100k (traffic signs) | https://cg.cs.tsinghua.edu.cn/traffic-sign/ | It provides 100,000 images containing 30,000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask.;License information not found | Free/Paid |
TVQA+ | http://tvqa.cs.unc.edu/ | TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. TVQA+ contains 310.8k bounding boxes, linking depicted objects to visual concepts in questions and answers.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Twitter100k | https://github.com/huyt16/Twitter100k | Twitter100k dataset is characterized by two aspects: 1) it has 100,000 image-text pairs randomly crawled from Twitter and thus has no constraint in the image categories; 2) text in Twitter100k is written in informal language by the users.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
TyDi QA | https://github.com/google-research-datasets/tydiqa | TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora.;Apache License 2.0 - A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code. | Free |
U.S. Census | http://www.statsci.org/index.html | The Census Bureau's mission is to serve as the nation's leading provider of quality data about its people and economy. | Free |
UC Irvine ML Repository | https://archive.ics.uci.edu/ml/index.php | They manage a list of over 450 datasets for Machine Learning projects. Various subjects. | Free |
UCI KDD Database Repository | https://www.census.gov/data/data-tools.html | This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas | Free |
UCI Machine Learning Repository | http://kdd.ics.uci.edu/ | This project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. | Free |
UMD Faces | http://www.umdfaces.io/ | The dataset contains 367,888 face annotations for 8,277 subjects divided into 3 batches. Contains bounding boxes, the extimated pose (yaw, pitch, and roll), locations of twenty-one keypoints, and gender information generated by a pre-trained neural network. The second part contains 3,735,476 annotated video frames extracted from a total of 22,075 for 3,107 subjects.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
UN Data | http://data.un.org/ | World Data from United Nations Databases | Free |
UNICEF Data | https://data.unicef.org | UNICEF data on children and women around the world | Free |
United States Environmental Protection Agency | http://archive.ics.uci.edu/ml/ | All types of data regarding Environmental Dataset for research | Free |
University of Missouri | https://edg.epa.gov/metadata/catalog/main/home.page | Many datasets on different areas of research avilable | Free/Paid |
University-1652 | https://github.com/layumi/University1652-Baseline | A Multi-view Multi-source Benchmark for Drone-based Geo-localization annotates 1652 buildings in 72 universities around the world.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
US Department of Justice | https://www.justice.gov/open/open-data | Dept. of Justice Open Data Sets | Free |
UTKFace | https://susanqq.github.io/UTKFace/ | UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The dataset consists of over 20,000 face images with annotations of age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, occlusion, resolution, etc. This dataset could be used on a variety of tasks, e.g., face detection, age estimation, age progression/regression, landmark localization, etc.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
VCR (Visual Commonsense Reasoning) | https://visualcommonsense.com/ | Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding. It contains: 290k multiple choice questions 290k correct answers and rationales: one per question 110k images Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach Answers are 7.5 words on average; rationales are 16 words. High human agreement (>90%) Scaffolded on top of 80 object categories from COCO;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Vehicle-1M | http://www.nlpr.ia.ac.cn/iva/homepage/jqwang/Vehicle1M.htm | The Vehicle-1M dataset is constructed by National Laboratory of Pattern Recognition, Institute of Automation, University of Chinese Academy of Sciences (NLPR, CASIA). This dataset involves vehicle images captured across day and night, from head or rear, by multiple surveillance cameras installed in several cities in China. There are totally 936,051 images from 55,527 vehicles and 400 vehicle models in the dataset. Each image is attached with a vehicle ID label denoting its identity in real world as well as a vehicle model label indicating the make, model and year of the vehicle(i.e. "Audi-A6-2013").;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
VERI-Wild | https://github.com/PKU-IMRE/VERI-Wild | A large-scale vehicle ReID dataset in the wild (VERI-Wild) is captured from a large CCTV surveillance system consisting of 174 cameras across one month (30× 24h) under unconstrained scenarios. The cameras are distributed in a large urban district of more than 200km2. After data cleaning and annotation, 416,314 vehicle images of 40,671 identities are collected.;License information not found | Free |
VGG-Sound | http://www.robots.ox.ac.uk/~vgg/data/vggsound/ | VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube. 200,000+ videos, 550+ hours, 310+ classes.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Vhinny | https://www.vhinny.com/about | Vhinny is an investment research platform created by Vitalii Dodonov as his personal project in Spring 2019. As a young individual investor, developer, and a data scientist, he wanted to build a system that would assist him in screening various investment opportunities. With most commercial data providers being too expensive for individual investors, he has built his own tool which has become www.vhinny.com. | Paid |
VIOLIN | https://github.com/jimmy646/violin | Violin (VIdeO-and-Language INference), consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows).;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
VisDrone2019 | http://www.aiskyeye.com/ | The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining , Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes).;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Visual Genome | http://visualgenome.org/ | Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. It contains: 108,077 Images 5.4 Million Region Descriptions 1.7 Million Visual Question Answers 3.8 Million Object Instances;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
VisualData.io | https://www.visualdata.io/ | Place where you can share and find computer vision related datasets. | Free |
Visualisingdata | https://www.visualisingdata.com/ | A big collection of sites and services for accessing data | Free |
VOiCES | https://voices18.github.io/ | The Voices Obscured in Complex Environmental settings (VOiCES) corpus presents audio recorded in acoustically challenging conditions. Source Material: a total of 15 hours (3,903 audio files).;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
VoxCeleb | http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. It contains data from 7,000+ speakers, 1 million+ utterances, 2,000+ hours. VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.;"Attribution-ShareAlike 4.0 International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit, ShareAlike - if you make changes, you must distribute your contributions." | Paid |
VQA Visual Question Answering | http://www.visualqa.org/ | VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language. It contains 265,016 images (COCO and abstract scenes), at least 3 questions (5.4 questions on average) per image, 10 ground truth answers per question.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
Waymo Open Dataset | https://waymo.com/open/ | The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology. The Waymo Open Dataset currently contains lidar and camera data from 1,000 segments (20s each): 1,000 segments of 20s each, collected at 10Hz (200,000 frames) in diverse geographies and conditions, Labels for 4 object classes - Vehicles, Pedestrians, Cyclists, Signs, 12M 3D bounding box labels with tracking IDs on lidar data, 1.2M 2D bounding box labels with tracking IDs on camera data...;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Webhose free datasets | https://webhose.io/datasets | Include data from a range of different sources, languages and categories | Free/Paid |
WebLogo-2M | http://www.eecs.qmul.ac.uk/~hs308/WebLogo-2M.html/ | The WebLogo-2M dataset is a weakly labelled (at image level rather than object bounding box level) logo detection dataset. The dataset was constructed automatically by sampling the Twitter stream data. It contains 194 unique logo classes and over 2 million logo images.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
WIDER | http://yjxiong.me/event_recog/WIDER/ | WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels. We provide a split of 50% for training and 50% for testing.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
WIDER Face | http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/ | WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
WiderPerson | http://www.cbsr.ia.ac.cn/users/sfzhang/WiderPerson/ | The WiderPerson dataset is a pedestrian detection benchmark dataset in the wild, of which images are selected from a wide range of scenarios, no longer limited to the traffic scenario. We choose 13,382 images and label about 400K annotations with various kinds of occlusions. We randomly select 8000/1000/4382 images as training, validation and testing subsets.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
WikiHow-Dataset | https://github.com/mahnazkoupaee/WikiHow-Dataset | WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base. Please refer to the paper for more information regarding the dataset and its properties. Each article consists of multiple paragraphs and each paragraph starts with a sentence summarizing it. By merging the paragraphs to form the article and the paragraph outlines to form the summary, the resulting version of the dataset contains more than 200,000 long-sequence pairs.;Attribution-NonCommercial-ShareAlike International - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, Under the following terms: Attribution - you must give approprate credit, NonCommercial - you may not use the material for commercial purposes, ShareAlike - if you make changes, you must distribute your contributions. | Free |
WildDash | http://www.wilddash.cc/ | The main focus of this dataset is testing. It contains data recorded under real world driving situations. Aims of it are: to compile and provide standard data which can be used for evaluation. to establish accepted evaluation protocols, data and measures. to boost the algorithm development on driving applications using computer vision techniques. The WildDash dataset does not offer enough material to train algorithms by itself.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
World Bank | https://data.worldbank.org/ | Free and open access to global development data | Free |
xBD | https://xview2.org/dataset | A dataset for assessing building damage from satellite imagery. With over 850,000 building polygons from six different types of natural disaster around the world, covering a total area of over 45,000 square kilometers, the xBD dataset is one of the largest and highest quality public datasets of annotated high-resolution satellite imagery.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Yahoo Flickr Creative Commons 100M | https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67 | This dataset contains a list of photos and videos. This list is compiled from data available on Yahoo! Flickr. All the photos and videos provided in the list are licensed under one of the Creative Commons copyright licenses.;Parts of the dataset are under different licenses, check the dataset web page for more information | Paid |
Yahoo Sandbox datasets | http://webscope.sandbox.yahoo.com/catalog.php | The Yahoo Webscope Program is a reference library of interesting and scientifically useful datasets for non-commercial use by academics and other scientists. Language, Graph, Ratings, Advertising and Marketing, Competition | Paid |
Yelp open dataset | https://www.yelp.com/dataset | The Yelp dataset contains data about businesses, reviews, and user data for use in personal, educational, and academic purposes. Available in both JSON and SQL files.;Can only be used for research and educational purposes. Commercial use is prohibited. | Free |
Yoga-82 | https://sites.google.com/view/yoga-82/home | Yoga-82: A New Dataset for Fine-grained Classification of Human Poses. A dataset for yoga pose classification with 3 level hierarchy based on body pose. It is constructed from web images and consists of 82 yoga poses.;Can only be used for research and educational purposes. Commercial use is prohibited. | Paid |
Youtube-8M 2018 | https://research.google.com/youtube8m/index.html | YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and associated labels from a diverse vocabulary of 4700+ visual entities. It comes with precomputed state-of-the-art audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Paid |
Youtube-BoundingBoxes | https://research.google.com/youtube-bb/index.html | YouTube-BoundingBoxes is a large-scale data set of video URLs with densely-sampled high-quality single-object bounding box annotations. The data set consists of approximately 380,000 15-20s video segments extracted from 240,000 different publicly visible YouTube videos, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera.;Attribution 4.0 International (CC BY 4.0) - You are Free to: Share - copy and redistribute, Adapt - remix, transform, and build upon, even commercialy, Under the following terms: Attribution - you must give approprate credit. | Free |
OpenCorporates | https://opencorporates.com/ | The largest open database of companies | Paid |
Uppsala Conflict Data Program | https://www.pcr.uu.se/research/UCDP/ | The Uppsala Conflict Data Program (UCDP) is the world’s main provider of data on organized violence and the oldest ongoing data collection project for civil war, with a history of almost 40 years. | Free |
Instagram Graph API | https://developers.facebook.com/docs/instagram-api | The Instagram Graph API allows Instagram Professional accounts — Businesses and Creators — to use your app to manage their presence on Instagram. | Free |
Atlas of Economic Complexity | https://atlas.cid.harvard.edu/ | Harvard Growth Lab’s research and data visualization tool used to understand the economic dynamics and new growth opportunities for every country worldwide. | Free |
BFI Statistical Yearbook | https://www.bfi.org.uk/education-research/film-industry-statistics-research | Provide research data and market intelligence to anyone with an interest in the UK film industry and film culture. | Free |
Statistica | https://www.statista.com/ | Statista managed to establish itself as a leading provider of market and consumer data. | Free/Paid |
London Datastore | https://data.london.gov.uk/ | The London Datastore is a free and open data-sharing portal where anyone can access data relating to the capital. Whether you’re a citizen, business owner, researcher or developer, the site provides over 700 datasets to help you understand the city and develop solutions to London’s problems. | Free |
UK Data Service | https://www.ukdataservice.ac.uk/ | The UK Data Service collection includes major UK government-sponsored surveys, cross-national surveys, longitudinal studies, UK census data, international aggregate, business data, and qualitative data. | Free |
BNONews | https://bnonews.com/index.php/2020/02/yesterdays-data-tracking-coronavirus-in-the-u-s/ | Tracking Coronavirus in the US (Covid-19) | Free |
Johns Hopkins Univ. Data | https://github.com/CSSEGISandData/COVID-19 | COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. | Free |
DXY.cn | https://docs.google.com/spreadsheets/d/1jS24DjSPVWa4iuxuD4OAXrE3QeI8c9BC1hSlqr-NMiU/edit#gid=1187587451 | DXY is a leading connector and digital service provider in the healthcare industry of China. Has patient Covid-19 data to include, gender, age, data of symptoms and location etc. | Free |
Nextstrain | https://nextstrain.org/sars-cov-2/ | Nextstrain SARS-CoV-2 (Covid-19) resources Around the world, people are sequencing and sharing SARS-CoV-2 genomic data. This page lists publicly available SARS-CoV-2 analyses that use Nextstrain from groups all over the world. | Free |
Kaggle | https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset | Daily level information on the number of Covid 19 affected cases across the globe. | Free |
ECDC | https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide | European Centre for Disease Prevention and Control historical data (to 14 December 2020) on the daily number of new reported COVID-19 cases and deaths worldwide. | Free |
Data.world | https://data.world/datasets/bitcoin | Bitcoin and cryptocurrency datasets. | Free/Paid |
Kaggle | https://www.kaggle.com/search?q=bitcoin | Bitcoin data sources. | Free |
CryptoDataDownload | https://www.cryptodatadownload.com/data/ | Free data for cryptocurrency enthusiasts or risk analysts to do their own research or practice their skills. Bitcoin. | Free |
Coinmetrics | https://coinmetrics.io/community-network-data/ | Download historical cryptocurrency community data in CSV format for any supported asset. Bitcoin. | Free |
CryptoDatum.io | https://cryptodatum.io/csv_downloads | Cryptocurrency data for machine learning. | Free |
IEEE.org | https://ieee-dataport.org/open-access/bitcoin-hacked-transactions-2010-2013 | BITCOIN HACKED TRANSACTIONS 2010-2013 | Free |
IEEE.org | https://ieee-dataport.org/topic-tags/covid-19 | Datasets related to Covid-19. | Free |
IEEE.org | https://ieee-dataport.org/topic-tags/computer-vision | Datasets for ComputerVision | Free |
IEEE.org | https://ieee-dataport.org/topic-tags/iot | Datasets for IoT - Internet of Things. | Free |
HUD | https://www.huduser.gov/portal/pdrdatas_landing.html#dataset-title | HUD provides interested researchers with access to the original data sets generated by PD&R-sponsored data collection efforts, including the American Housing Survey, median family incomes and income limits, as well as microdata from research initiatives on topics such as housing discrimination, the HUD-insured multifamily housing stock, and the public housing population. | Free |
FFHQ | https://github.com/NVlabs/ffhq-dataset | Flickr-Faces-HQ Dataset. The dataset consists of 70,000 high-quality PNG images at 1024×1024 resolution and contains considerable variation in terms of age, ethnicity and image background. | Free |
Google Facial Recognition | https://research.google/tools/datasets/google-facial-expression/ | This dataset is a large-scale facial expression dataset consisting of face image triplets along with human annotations that specify which two faces in each triplet form the most similar pair in terms of facial expression. | Free |