Datasets List

General Description of a Dataset

Dataset Composition

A Benzina dataset is, in essence, an indexing over a concatenation of inputs, targets and possibly filenames with indexing

Dataset Structure

A Benzina dataset is structured using the mp4 format

ftyp:

Defines the compatibilities of the mp4 container

mdat:

Concatenation in 2-3 blocks of the inputs, targets and possibly filenames

moov:

Contains the metadata needed to load and present the raw data of mdat

mvhd:

Defines the timescale and the duration of the container

timescale:How many units elapse in 1 second
duration:Duration of the container in timescale units
next_track_id:The id of the next track that could be appended to moov
trak:
  • Benzina input samples track: This is the first track and it references
    all the input samples
  • Benzina target track
  • Benzina filename track: This track is optional
  • Video track: This track is optional. If present it should be
    positioned last

Each track can have a train, validation and test variants to reference the sets

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:Defines if the track should be displayed
width:Width of the video
height:Height of the video
mdia:

Contains definitions related to the media type of the data

mdhd:

Redefines the timescale and the duration for the track

timescale:How many units elapse in 1 second
duration:Duration of the track in timescale units
hdlr:

Defines the media type of the track

handler_type:Defines the type of handler that should be used to decode the data referenced by the track
name:Human readable name for the track type (used for debugging)
minf:

Defines the characteristics of the media in the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

stts:

Defines the mapping from decoding time to sample number

sample_count:The number of samples in the track
sample_delta:The interval in timescale units for which a new sample should be decoded
stsz:

Defines the size of each samples

sample_count:Number of samples in the track
entry_size:Size of the sample. This field is repeated for each sample
stsc:

Defines the chunks splitting the data

stco:

Defines the chunks offset

entry_count:Number of chunks
chunk_offset:The chunk offset. This field is repeated for each chunk

Dataset’s Input Sample Structure

A Benzina dataset’s input sample can also be structured using the mp4 format. It is roughly the same as the dataset’s structure with the differences that mdat will contains the raw concatenation of a single input, its target, possibly filename and possibly a 512 x 512 thumbnails stream.

ImageNet 2012

ImageNet 2012 classification dataset. It contains two size of the images along with their classification target and filename:

  • Resized high resolution images each with a smaller edge of at most 512 while preserving the aspect ratio. This set is accessed by referencing the bzna_input track of the input samples.
  • Resized images each with a longer edge of at most 512 while preserving the aspect ratio. This set is accessed by referencing the bzna_thumb track of the input samples.

The dataset is represented by ImageNet which simplifies the iteration of the data as a classification dataset.

Warning

81 images are currently missing from the dataset and 111 had to be first transcoded to PNG prior to the final H.265 format. More details can be found in the dataset’s README.

Warning

High resolution images stored in the the bzna_input track of the input samples are currently not available through the DataLoader. Their widely varying sizes prevent them from being decoded using a single hardware decoder configuration. The selected solution is to represent the images in the HEIF format which will be completed in future development.

Dataset Composition

The dataset is composed of a train set, followed by a validation set then a test set for a total of 1 431 167 entries. Targets and filenames are provided for each sets:

  • Train set
    Entries 1 to 1281167 (1 281 167 entries)
  • Validation set
    Entries 1281168 to 1331167 (50 000 entries)
  • Test set
    Entries 1331168 to 1431167 (100 000 entries)

Dataset Structure

ilsvrc2012.bzna

ftyp:

Defines the compatibilities of the mp4 container

major_brand:isom
minor_version:0
compatible_brands:
 bzna, isom
mdat:

Raw concatenation in 3 blocks of the images, targets and filenames

  • Concatenation of .mp4 files containing a single image, a thumbnail of a maximum size of 512 x 512 if the image does not already fit this resolution, the image’s original filename and the target associated with the image
  • Concatenation of images’ targets as little-endian int64
  • Concatenation of images’ original filename
moov:

Contains the metadata needed to load and present the raw data of mdat

mvhd:

Defines the timescale and the duration of the container

timescale:20
duration:20 * 1 431 167
next_track_id:The id of the next track that could be appended to moov
trak:

Benzina input samples track

This track references all the images of the dataset

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000000 – This value informs that the track is not for display purpose
width:0.0 – This value reflects the variance in size of the frames
height:0.0 – This value reflects the variance in size of the frames
mdia:

Contains definitions related to the media type of the data

mdhd:

Redefines the timescale and the duration for the track

timescale:20
duration:20 * 1 431 167
hdlr:

Defines the media type of the track

handler_type:meta
name:bzna_input
minf:

Defines the characteristics of the media in the track

nmhd:

No specific media header is identified for the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

mett:

Defines the metadata as being text based

mime_format:application/octet-stream
stts:

Defines the mapping from decoding time to sample number

sample_count:1 431 167
sample_delta:20
stsz:

Defines the size of each samples

sample_count:1 431 167
entry_size:Size of the sample. This field is repeated for each sample
stsc:

Defines the chunks splitting the data

first_chunk:1
samples_per_chunk:
 1
sample_description_index:
 1

This definition means to consider that all samples are contained in their own chunk

stco:

Defines the chunks offset

entry_count:1 431 167
chunk_offset:The chunk offset. This field is repeated for each chunk, i.e. for each sample
trak:

Benzina target track

This track is roughly the same as the Benzina input track with the following differences

mdia:

Contains definitions related to the media type of the data

hdlr:

Defines the media type of the track

handler_type:meta
name:bzna_target
trak:

Benzina filename track

This track is roughly the same as the Benzina input track with the following differences

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000003 – This value informs that the track is enabled and can be used in the presentation
width:0.0 – This value informs that no width has be predefined for this track
height:0.0 – This value informs that no height has be predefined for this track
mdia:

Contains definitions related to the media type of the data

hdlr:

Defines the media type of the track

handler_type:meta
name:bzna_fname
minf:

Defines the characteristics of the media in the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

mett:

Defines the metadata as being text based

mime_format:text/plain
trak:

Video track

This track allows to play the thumbnails of the dataset’s frames

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000003 – This value informs that the track is enabled and can be used in the presentation
width:512.0
height:512.0
mdia:

Contains definitions related to the media type of the data

mdhd:

Redefines the timescale and the duration for the track

timescale:20
duration:1 431 167
hdlr:

Defines the media type of the track

handler_type:vide
name:VideoHandler
minf:

Defines the characteristics of the media in the track

vmhd:

Video media header is identified for the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

avc1:

Defines the AVC coding information

width:512
height:512
horizresolution:
 72
horizresolution:
 72
stts:

Defines the mapping from decoding time to sample number

sample_count:1 431 167
sample_delta:1
stsz:

Defines the size of each samples

sample_count:1 431 167
entry_size:Size of the sample. This field is repeated for each sample
stsc:

Defines the chunks splitting the data

first_chunk:1
samples_per_chunk:
 1
sample_description_index:
 1

This definition means to consider that all samples are contained in their own chunk

stco:

Defines the chunks offset

entry_count:1 431 167
chunk_offset:The chunk offset. This field is repeated for each chunk, i.e. for each sample

Dataset’s Input Samples Structure

A Benzina ImageNet dataset’s input sample is structured using the mp4 format.

ftyp:

Defines the compatibilities of the mp4 container

major_brand:isom
minor_version:0
compatible_brands:
 bzna, isom
mdat:

Raw concatenation of the image, thumbnail, target and filename:

  • A single image in H.265 format. The image is put in a frame with a size of a product of 512 in the 2 dimensions. The padding to make the image fit is filled with a smear of the image’s borders
  • A thumbnail in H.265 format. The image is put in a frame of size 512 x 512. The image is first resized to have its longest side be of 512. The padding to make the thumbnail fit the frame is filled with a smear of the image’s borders. There will be no explicit thumbnail if the image already fit the thumbnail’s frame
  • The image’s target in a little-endian int64
  • The image’s original filename
moov:

Contains the metadata needed to load and present the raw data of mdat

mvhd:

Defines the timescale and the duration of the container

timescale:20
duration:20
next_track_id:The id of the next track that could be appended to moov
trak:

Benzina input track

This track references an image

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000000 – This value informs that the track is not for display purpose
width:Width of the image without padding
height:Height of the image without padding
mdia:

Contains definitions related to the media type of the data

mdhd:

Redefines the timescale and the duration for the track

timescale:20
duration:20
hdlr:

Defines the media type of the track

handler_type:vide
name:bzna_input
minf:

Defines the characteristics of the media in the track

vmhd:

Video media header is identified for the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

avc1:

Defines the AVC coding information

width:

Width of the image’s frame. This is a product of 512

height:

Height of the image’s frame. This is a product of 512

horizresolution:
 

72

horizresolution:
 

72

clap:

Defines the clean aperture of the image to remove the padding

clean_aperture_width_n:
 Width of the image without padding
clean_aperture_width_d:
 1
clean_aperture_height_n:
 Height of the image without padding
clean_aperture_height_d:
 1
horiz_off_n:The negative value of the width’s padding
horiz_off_d:2
vert_off_n:The negative value of the height’s padding
vert_off_d:2
stts:

Defines the mapping from decoding time to sample number

sample_count:1
sample_delta:20
stsz:

Defines the size of each samples

sample_count:1
entry_size:Size of the input
stsc:

Defines the chunks splitting the data

first_chunk:1
samples_per_chunk:
 1
sample_description_index:
 1
stco:

Defines the chunks offset

entry_count:1
chunk_offset:The chunk offset
trak:

Benzina thumbnail track

This track references an image’s thumbnail. If the image already fits a thumbnail’s frame, then this track will reference the same data as in the Benzina input track. In any case, it is roughly the same as the Benzina input track with the following differences

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000003 – This value informs that the track is enabled and can be used in the presentation
width:Width of the thumbnail without padding
height:Height of the thumbnail without padding
mdia:

Contains definitions related to the media type of the data

hdlr:

Defines the media type of the track

handler_type:vide
name:bzna_thumb
trak:

Benzina target track

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000000 – This value informs that the track is not for display purpose
width:0.0 – This value informs that the width has not been predefined for this track
height:0.0 – This value informs that no height has not been predefined for this track
mdia:

Contains definitions related to the media type of the data

mdhd:

Redefines the timescale and the duration for the track

timescale:20
duration:20
hdlr:

Defines the media type of the track

handler_type:meta
name:bzna_target
minf:

Defines the characteristics of the media in the track

nmhd:

No specific media header is identified for the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

mett:

Defines the metadata as being text based

mime_format:application/octet-stream
trak:

Benzina filename track

This track is roughly the same as the Benzina target track with the following differences

tkhd:

Defines the resolution of the video and if the track should be displayed by an mp4 player

flags:000003 – This value informs that the track is enabled and can be used in the presentation
width:0.0 – This value informs that no width has be predefined for this track
height:0.0 – This value informs that no height has be predefined for this track
mdia:

Contains definitions related to the media type of the data

hdlr:

Defines the media type of the track

handler_type:meta
name:bzna_fname
minf:

Defines the characteristics of the media in the track

stbl:

Defines the data indexing of the media samples in the track along with coding information, if needed, to decode them

stsd:

Provides the information needed to decode the media samples

mett:

Defines the metadata as being text based

mime_format:text/plain