Skip to content

Python API

Jeff Johnson edited this page Apr 22, 2022 · 4 revisions

There are three main APIs: the simple API which takes as an input a list of tensors to compress (or decompress) and returns a list of compressed (or decompressed) tensors; the split_size API which for compression is passed a single input tensor and list of sizes for the splits that define the batch (decompression is passed a single output tensor and list of sizes for the splits that define where the output goes), and the main API which is more complicated, and allows pre-allocation of output tensors.

The first argument to each of the functions is a boolean value; True if we wish to compress floating point data (with special handling of the exponent), and False if we want to use the raw ANS byte-wise compressor. The ANS compressor can take any PyTorch dtype (byte, int, float, double, etc), while the float compressor at the moment supports torch.float16, torch.bfloat16 and torch.float32.

The general ANS compressor, which can compress anything (floating point or not) only operates on a byte level, and does not deal well when the data is structured at above a byte level, as statistics for entropy encoding are calculated across all bytes. This compressor is appropriate to use on, say, int4 or int8 quantized data.

All tensors used in the API are required to be CUDA tensors, and to be resident on the same device on which computation is taking place.

Reviewing the Python tests (e.g., float_test.py) can give some idea on how to use the API.

Compression

compress_data(bool compress_as_float, Tensor[] ts_in, Tensor? temp_mem=None, Tensor? out_compressed=None, Tensor? out_compressed_bytes=None) -> (Tensor, Tensor, int)

compress_data_split_size(bool compress_as_float, Tensor t_in, Tensor t_in_split_sizes, Tensor? temp_mem=None, Tensor? out_compressed=None, Tensor? out_compressed_bytes=None) -> (Tensor[], Tensor, int)

compress_data_simple(bool compress_as_float, Tensor[] ts_in, int? temp_mem=67108864) -> Tensor[]

The first argument is the float vs. bytewise ANS compression choice. The second argument is a list of tensors to be independently compressed (ts_in) for a batch of independent tensors, or for the split_size version, a single tensor and a list of split sizes (as int64).

Each tensor can be of a different size and for the ts_in versions may contain any number of dimensions (to the compressor, all data is treated as 1-dimensional), but must be contiguous. The split_size input must be 1-dimensional. These tensors must be resident on the same GPU device as the compressor is run.

For ANS compression, the start address of each tensor or split size must be 4 byte aligned. For float compression, we only requires word (natural) alignment; e.g., 2 byte alignment for float16/bfloat16 or 4 byte alignment for float32, which should be naturally satisfied.

ANS compression can take any tensor dtype or even mixed dtypes (they are just treated as a sequence of bytes). compress_float requires that all tensors presented are of the same floating point dtype.

temp_mem is the first optional argument. The compressor and decompressor require a region of temporary memory on the same device, usually of 64-256 MB or so, to use as a scratchpad. As the PyTorch allocator and/or cudaMalloc/cudaFree are slow, and as several variable sized allocations are required inside the codec, it is preferred that the user pre-allocates this region of temporary memory and pass the temporary memory each time the API is used in order to reduce allocator overheads. If multiple streams are used concurrently with the compressor, a separate region of temporary memory is required per each stream (or stream synchronization is required to reuse a region of temporary memory).

out_compressed is the second optional argument. If provided, the compressed data is written here; otherwise, the API will allocate a new Tensor and return it. It is preferable if the API is being called many times to cache an output tensor and keep returning it, unless you want to depend upon the PyTorch caching allocator. (If the sizes are different, then you can bound this and just slice a tensor to return a 2D sub-slice that will fit the compressed data.) If provided, this argument is returned as the first argument of the function. This tensor is 2D, with the number of rows and columns required being reported by max_float_compressed_output_size for float compression or max_any_compressed_output_size for ANS compression. It has a number of rows being the batch size (number of tensors in the ts_in list), and the column size (in bytes) that bounds the maximum compressed size for an input tensor (in fact, this is larger than the input tensor, since the worst case scenario the input tensor is not compressible and the metadata needed and compression overhead results in expanding the size of the tensor).

out_compressed_sizes is the third optional argument. If provided, this is a 1d tensor of the same size as the batch (i.e. len(ts_in)). The actual compressed size (in bytes) of each compressed tensor in the batch is reported here; this is the actual size used in out_compressed / the first return argument in each row. If not provided, we allocate a tensor internally and return it.

max_float_compressed_output_size(Tensor[] ts) -> (int, int)

max_any_compressed_output_size(Tensor[] ts) -> (int, int)

max_float_compressed_output_size takes as an input a collection of tensors for floating point compression, and returns a 2d matrix size rows x cols in bytes which is the returned size of the compressed data. Note that this is an upper bound; the actual length of each row used is provided by out_compressed_sizes or the second return value of the compression function.

Return arguments for compression

The simple version of the compressor simply returns exactly-sized tensors corresponding to the compressed data. These are actually views into the compression matrix given above, so in order to realize data savings, the view should be copied into an independent exactly-sized tensor. The simple version however requires a d2h copy in order to get the compressed sizes back to the CPU.

The other two APIs return a 2D matrix containing the compressed tensor data, which should be the torch.uint8 dtype. This is the same as out_compressed if that was passed by the user, otherwise a new tensor is allocated.

The compressed tensors are guaranteed to begin on a 16 byte aligned boundary.

The second is the 1D array indicating the length in bytes of each of the compressed tensors in the batch. This is the same as out_compressed_sizes if that was passed by the user, otherwise a new tensor is allocated. In other words, the actual data used for ts[i] is given by out_compressed[i][0:out_compressed_sizes[i]].

The third is the actual temporary memory size used in bytes for the compression job.

Decompression

decompress_data(bool compress_as_float, Tensor[] ts_in, Tensor[] ts_out, Tensor? temp_mem=None, Tensor? out_status=None, Tensor? out_decompressed_words=None) -> (int)

decompress_data_split_size(bool compress_as_float, Tensor[] ts_in, Tensor t_out, Tensor t_out_split_sizes, Tensor? temp_mem=None, Tensor? out_status=None, Tensor? out_decompressed_words=None) -> (int)

decompress_data_simple(bool compress_as_float, Tensor[] ts_in, int? temp_mem=67108864) -> Tensor[]

ts_in is a list of byte tensors that contain the compressed data. It need not be resized exactly to the real compressed data; i.e., the output of the compression functions, a 2d matrix with columns equal to the upper bound of compressed size, can have each row passed as ts_in, even though not all the data in the compressed matrix row is necessarily valid. Each compressed tensor is required to begin on a 16 byte aligned boundary.

Just as the CPU does not know the compressed tensor size (only the device knows that, and it is computed in the act of compressing the tensor), the CPU need not know the true decompressed size of a tensor either.

However, ts_out contains the destination for the decompressed data. This must be at least as large as the true decompressed data. The kernel will fail to decompress if not, with out_status containing 0 for the tensors that failed to decompress for reasons of insufficient space. out_decompressed_words (floating point words for float compressor) or out_decompressed_bytes for ANS compression contain either the successful decompressed size, or upon failure, will report the required size in order to successfully decompress.

temp_mem is more or less the same as with the compression functions. The required space for decompression is approximately (not exactly) equal to what is needed during compression, and will be something around 64-256 MB in normal application. Warnings will be reported via stderr if it is undersized.

out_status is an optional int tensor which contains the success (1) or failure (0) status of decompression for each tensor in the batch if passed.

out_decompressed_bytes is an optional int tensor that contains the size (in floating point words for float, bytes for ANS) either successfully decompressed (if out_status is 1 for the given tensor), or the size required for successful decompression if ts_out[i] is not large enough.

Clone this wiki locally