Tuomas Siipola Articles Projects

High-quality photograph datasets

Many times I've found myself without good test images while developing image-related software or websites. The web is full of images but their quality varies greatly and most of them are distributed under unclear or restrictive licenses. Furthermore, most websites don't have any option to easily download a larger collection of images to my hard drive.

To solve this problem, I've created a couple of datasets based on Wikimedia Commons. The website is full of freely-licensed content with varying quality and it also provides many types of media, like illustrations and videos, which I don't need here. Luckily there are categories such as Featured pictures and Quality images which are curated collections of high-quality photographs. Here's a sneak peek of the quality images dataset:

Preview of the dataset

The datasets were created from images in the mentioned categories matching the additional criteria:

Using Wikimedia Commons as a source with these criteria has some drawbacks. Overall the photographs may not be very interesting or aesthetically pleasing because, unlike stock photographs for instance, they have an educational focus. Moreover, photographs of people are mostly excluded from the datasets because of personality rights.

Here are the final datasets. These files are licensed under CC BY-SA 3.0 as they're derived from Wikimedia Commons.

Here's a smaller dataset of winners from Picture of the Year competition. Unlike the other datasets, the images are released under various licenses other than CC0.

After downloading one of the CSV files, you can download ten random photographs:

wget -P images $(tail -n +2 quality.csv | cut -d, -f1 | shuf -n10)

The datasets contain some metadata like filename and categories. You can use this information to download photographs of birds for instance:

wget -P birds $(tail -n +2 quality.csv | grep Animals/Birds | cut -d, -f1 | shuf -n3)

Python scripts used to create the datasets are available here. You can modify them to create your own dataset with some specific search term or different criteria like wider range of accepted licenses.