Christopher Markert

✍️ Author Biography

Christopher Markert

🌍 American 📚 1 free book

The MNIST database, a foundational dataset for machine learning, was derived from various US government sources.

The MNIST database, a widely recognized collection of handwritten digits, was developed by combining and modifying datasets originally compiled by U.S. government agencies. Its creation aimed to provide a more suitable resource for training and testing machine learning algorithms compared to the raw NIST datasets. The original NIST data had a perceived bias due to its sources (Census Bureau employees versus high school students) and underwent processing, including normalization into 28x28 pixel images and the introduction of grayscale, which altered the original black and white format.

The MNIST dataset comprises 60,000 training images and 10,000 testing images, with samples drawn equally from NIST's original training and testing collections. The development process involved significant re-mixing and normalization of data from NIST's Special Databases, which themselves originated from various government initiatives like the Census Bureau's need for automated digitization and the US Postal Service's zip code data. Early versions and related datasets, such as the USPS database and NIST's Special Databases 1, 3, and 7, laid the groundwork for MNIST's structure and content.

Origins and Development

The MNIST database originated from a need to create a more balanced and standardized dataset for machine learning, particularly for image recognition tasks. It was constructed by re-mixing and processing samples from the National Institute of Standards and Technology (NIST) original datasets. The creators noted that the NIST datasets, one sourced from Census Bureau employees and another from high school students, presented a potential imbalance for machine learning experiments. To address this, images were normalized into 28x28 pixel bounding boxes, and anti-aliasing was applied, introducing grayscale variations to the original black and white images. This process resulted in a dataset containing 60,000 training and 10,000 testing images, with equal proportions drawn from NIST's original training and testing collections.

Foundational Datasets

The MNIST database's foundation lies in several precursor datasets developed by U.S. government entities. A significant source was the USPS database, created in 1988, which contained digitized grayscale images of handwritten zip codes. Further crucial components came from NIST's 'Special Databases' (SD-1, SD-3, and SD-7), developed in the late 1980s and early 1990s for evaluating optical character recognition (OCR) systems. These databases were derived from forms filled out by Census Bureau employees and high school students. SD-3, in particular, provided a substantial set of digitized alphanumeric characters, while SD-7 served as a challenging test set. The MNIST dataset was compiled by combining and processing images from these NIST Special Databases.

Evolution and Variants

Following the creation of the original MNIST database, several subsequent versions and related datasets have been developed. In 2019, the QMNIST dataset was introduced, restoring the full 60,000-image test set that had been partially discarded in the original MNIST construction. More significantly, NIST released the Extended MNIST (EMNIST) dataset in 2017, intended as a successor, which incorporates a broader range of alphanumeric characters from the compiled NIST Special Databases. Another notable variant, Fashion MNIST, also released in 2017, offers a more challenging alternative by replacing digits with images of fashion products across ten categories, maintaining the same 28x28 grayscale format.

Key Ideas

Dataset creation through re-mixing and normalization of existing government data.
Addressing bias in training data by combining diverse sources.
Image processing techniques including resizing, anti-aliasing, and grayscale conversion.
Development of successor and alternative datasets for machine learning benchmarks.