- Feb 25, 2019

Building an Image Data Set

This blog post is adopted from a lecture I watched by Chris Burns AI/ML Senior Solution Architect [1]. There are a few keys to building an image data set to train on the Sagemaker ResNet algorithm. First, one must collect Images. Many options are available here hopefully the client has a large collection of images for the objects in question. Image classification is dependent on clean abundant data source. One thousand images per class is a satisfactory minimum. Second, discard unusable images. This includes corrupted images, images with a view point that varies, images with an inappropriate scale, images that are occluded, deformed images, poorly illuminated images, and images with too much background clutter. Third, remove duplicate images. Fourth, label the images. "This is one of the soul destroying parts of building a data set”, according to Chris Burns. Instead of tediously labeling every image label the folders the images are in as opposed to every image. Fifth, convert all images to a standard format (i.e. jpg). Don’t start by mixing black and white images and pay attention to color depth. Resize all images to a standard dimension. Crop, resize and add padding where it is necessary. Seventh, delete unrelated images. Finally, split images into training and validation sets. The ratio is training channel 70%, validation channel 29%, and test Channel 1%.