Introduce a fluent API to construct tensors from external data. #54530

cbalioglu · 2021-03-23T19:13:03Z

Summary:
This diff introduces the following changes and improvements:

Introduces a new fluent API to construct tensors from external data as an alternative to from_blob overloads. See below for an example.
Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.
Exposes a new (lightweight) way to construct tensors by passing a naked context and context_deleter pair as an alternative to the existing deleter parameter.
Updates the existing from_blob overloads to internally use the fluent API.

// Example 1
at::Tensor tensor = at::for_blob(data, sizes)
  .strides(strides)
  .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); })
  .options(...)
  .target_device(...)
  .make_tensor();

// Example 2
at::Tensor tensor = at::for_blob(data, sizes).make_tensor();

// Example 3
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter(...)
  .make_tensor();

Test Plan:
Below are the folly Benchmark results for the following two equivalent operations:

// The fluent API
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter([buffer](void*) mutable { buffer.reset(); })
  .options(dtype(c10::ScalarType::Float))
  .make_tensor();

// The original `from_blob` overload
at::Tensor tensor = at::from_blob(
  data,
  sizes,
  [buffer](void*) mutable { buffer.reset(); },
  dtype(c10::ScalarType::Float));

============================================================================
scripts/balioglu/from_blob_exp/main.cpp         relative  time/iter  iters/s
============================================================================
fluent                                                     298.34ns    3.35M
from_blob                                         55.19%   540.51ns    1.85M
============================================================================

Various similar experiments show an approximate %50 reduction in tensor construction times.

Differential Revision: D27269344

facebook-github-bot · 2021-03-23T19:13:11Z

💊 CI failures summary and remediations

As of commit fcf0927 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

facebook-github-bot · 2021-03-23T19:13:24Z

This pull request was exported from Phabricator. Differential Revision: D27269344

ezyang · 2021-03-23T22:49:44Z

Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.

What are the small buffer optimizations? I skimmed through the PR and didn't see anything that jumped out.

cbalioglu · 2021-03-23T23:36:26Z

Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.

What are the small buffer optimizations? I skimmed through the PR and didn't see anything that jumped out.

Instead of zero_sizes() returning a std::vector<int64_t> makeTempSizes() returns a SmallVector<int64_t, 5>.
We do not call detail::defaultStrides() anymore (which returns another heap-allocated vector that gets discarded right after constructing the tensor). If no strides are specified, we call set_sizes_contiguous() instead.
The deleter requires a heap-allocated context deleter to be passed to DataPtr. With the withContext() we avoid that, by passing a context pointer and a raw function pointer.

codecov · 2021-03-24T02:08:44Z

Codecov Report

Merging #54530 (fcf0927) into master (556fc8d) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #54530      +/-   ##
==========================================
- Coverage   77.46%   77.45%   -0.01%     
==========================================
  Files        1893     1893              
  Lines      185939   185939              
==========================================
- Hits       144041   144026      -15     
- Misses      41898    41913      +15

ezyang · 2021-03-24T14:32:22Z

aten/src/ATen/templates/Functions.h

+  TensorOptions opts_{};
+};
+
+inline TensorMaker forBlob(void* data, IntArrayRef sizes) noexcept {


camel case function name is inconsistent with other APIs (case in point: it's at::from_blob)

ezyang · 2021-03-24T15:00:50Z

Thanks, the optimizations are much appreciated.

I think my primary commentary is about naming bikeshed: function name (forBlob versus for_blob) as well as naming of methods (withStrides versus strides). We have a preexisting set of helpers for defining fluent interfaces for NN modules, see torch/csrc/api/include/torch/arg.h and torch/csrc/api/include/torch/nn/options/linear.h; while you don't have to actually directly use the code here, it would be good if we lined up the conventions here.

cbalioglu · 2021-03-24T16:17:56Z

Thanks, the optimizations are much appreciated.

You are welcome!

I think my primary commentary is about naming bikeshed: function name (forBlob versus for_blob) as well as naming of methods (withStrides versus strides). We have a preexisting set of helpers for defining fluent interfaces for NN modules, see torch/csrc/api/include/torch/arg.h and torch/csrc/api/include/torch/nn/options/linear.h; while you don't have to actually directly use the code here, it would be good if we lined up the conventions here.

Thanks a lot for the pointers. I am fairly new to the code base; good to know that we have an existing convention for fluent APIs. Let me check them. Hope to have a second revision later today.

…rch#54530) Summary: Pull Request resolved: pytorch#54530 This diff introduces the following changes and improvements: - Introduces a new fluent API to construct tensors from external data as an alternative to `from_blob` overloads. See below for an example. - Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times. - Exposes a new (lightweight) way to construct tensors by passing a naked `context` and `context_deleter` pair as an alternative to the existing `deleter` parameter. - Updates the existing `from_blob` overloads to internally use the fluent API. ``` // Example 1 at::Tensor tensor = at::for_blob(data, sizes) .strides(strides) .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); }) .options(...) .target_device(...) .make_tensor(); // Example 2 at::Tensor tensor = at::for_blob(data, sizes).make_tensor(); // Example 3 at::Tensor tensor = at::for_blob(data, sizes) .deleter(...) .make_tensor(); ``` Test Plan: Below are the folly Benchmark results for the following two equivalent operations: ``` // The fluent API at::Tensor tensor = at::for_blob(data, sizes) .deleter([buffer](void*) mutable { buffer.reset(); }) .options(dtype(c10::ScalarType::Float)) .make_tensor(); // The original `from_blob` overload at::Tensor tensor = at::from_blob( data, sizes, [buffer](void*) mutable { buffer.reset(); }, dtype(c10::ScalarType::Float)); ``` ``` ============================================================================ scripts/balioglu/from_blob_exp/main.cpp relative time/iter iters/s ============================================================================ fluent 298.34ns 3.35M from_blob 55.19% 540.51ns 1.85M ============================================================================ ``` Various similar experiments show an approximate %50 reduction in tensor construction times. Differential Revision: D27269344 fbshipit-source-id: 26cad33986836ec6576bacb485c2a01c10ff8005

facebook-github-bot · 2021-03-24T21:14:31Z

This pull request was exported from Phabricator. Differential Revision: D27269344

ezyang

yey

facebook-github-bot · 2021-03-25T13:26:31Z

This pull request has been merged in 9029d0d.

swolchok · 2021-04-01T04:05:39Z

How does this new implementation ensure that the requested device, the device of the passed-in data void pointer, and the device specified in TensorOptions all match?

swolchok · 2021-04-01T04:10:27Z

Also it seems unfortunate that for CPU we first make an empty CPU tensor with its own allocated Storage (see

pytorch/aten/src/ATen/Utils.cpp

Line 66 in 2309173

auto storage_impl = c10::make_intrusive<StorageImpl>(

) and then throw that Storage away and provide our own. If we really did assume CPU tensors, we could avoid all that by using detail::make_tensor directly and providing our own Storage from the beginning.
Perhaps we could/should provide a dispatched way to make a Tensor with a given Storage? Not sure on that solution; there seems to be an extra degree of freedom in here somewhere that maybe we don't need...

cbalioglu · 2021-04-01T13:52:15Z

Thanks for the feedback @swolchok. I deliberately did not touch the actual tensor construction logic. I just copied and cleaned up the majority of the existing implementation. Besides doing the optimizations I mentioned above and introducing a new API the internal machinery is more or less the same. As I am fairly new to the code base and this is such a core API, I did not want to accidentally break anything.

How does this new implementation ensure that the requested device, the device of the passed-in data void pointer, and the device specified in TensorOptions all match?

I found the target_device parameter particularly confusing and still not sure about its use case; still, kept the existing logic intact. Similarly not sure why we are comparing the target_device and TensorOptions's device only when a device index is specified.

As you already mentioned, I am fairly confident that there is room for improvement in how we construct tensors from external data. I would be happy to work further on this if deemed necessary.

ezyang · 2021-04-01T14:09:01Z

Yes, an easy further improvement is to factor empty_generic further so that you can pass in a DataPtr directly instead of calling Allocate on allocator, then making for_blob use it.

cbalioglu · 2021-04-09T20:09:41Z

Yes, an easy further improvement is to factor empty_generic further so that you can pass in a DataPtr directly instead of calling Allocate on allocator, then making for_blob use it.

See #55705.

Summary: This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with #54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.) Testing is done with the existing unit and integration tests as there is no user-observable API change. Pull Request resolved: #55705 Reviewed By: ezyang Differential Revision: D27686043 Pulled By: cbalioglu fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272

Summary: This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with pytorch#54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.) Testing is done with the existing unit and integration tests as there is no user-observable API change. Pull Request resolved: pytorch#55705 Reviewed By: ezyang Differential Revision: D27686043 Pulled By: cbalioglu fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272

facebook-github-bot added the cla signed label Mar 23, 2021

facebook-github-bot added the fb-exported label Mar 23, 2021

cbalioglu requested review from ezyang, IvanKobzarev, dzhulgakov, pritamdamania87, rohan-varma, mrshenli and ljk53 March 23, 2021 19:22

ezyang reviewed Mar 24, 2021

View reviewed changes

cbalioglu force-pushed the export-D27269344 branch from 30c6125 to fcf0927 Compare March 24, 2021 21:14

ezyang approved these changes Mar 24, 2021

View reviewed changes

facebook-github-bot closed this in 9029d0d Mar 25, 2021

facebook-github-bot added the Merged label Mar 25, 2021

cbalioglu mentioned this pull request Apr 9, 2021

Optimize constructing tensors from external data #55705

Closed

cbalioglu mentioned this pull request Jun 21, 2021

Enhance ProcessGroupWrapper with additional checks + refactor #60237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a fluent API to construct tensors from external data. #54530

Introduce a fluent API to construct tensors from external data. #54530

cbalioglu commented Mar 23, 2021 •

edited

Loading

facebook-github-bot commented Mar 23, 2021 •

edited

Loading

facebook-github-bot commented Mar 23, 2021

ezyang commented Mar 23, 2021

cbalioglu commented Mar 23, 2021

codecov bot commented Mar 24, 2021 •

edited

Loading

ezyang Mar 24, 2021

ezyang commented Mar 24, 2021

cbalioglu commented Mar 24, 2021

facebook-github-bot commented Mar 24, 2021

ezyang left a comment

facebook-github-bot commented Mar 25, 2021

swolchok commented Apr 1, 2021

swolchok commented Apr 1, 2021

cbalioglu commented Apr 1, 2021

ezyang commented Apr 1, 2021

cbalioglu commented Apr 9, 2021

Introduce a fluent API to construct tensors from external data. #54530

Introduce a fluent API to construct tensors from external data. #54530

Conversation

cbalioglu commented Mar 23, 2021 • edited Loading

facebook-github-bot commented Mar 23, 2021 • edited Loading

💊 CI failures summary and remediations

facebook-github-bot commented Mar 23, 2021

ezyang commented Mar 23, 2021

cbalioglu commented Mar 23, 2021

codecov bot commented Mar 24, 2021 • edited Loading

Codecov Report

ezyang Mar 24, 2021

Choose a reason for hiding this comment

ezyang commented Mar 24, 2021

cbalioglu commented Mar 24, 2021

facebook-github-bot commented Mar 24, 2021

ezyang left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 25, 2021

swolchok commented Apr 1, 2021

swolchok commented Apr 1, 2021

cbalioglu commented Apr 1, 2021

ezyang commented Apr 1, 2021

cbalioglu commented Apr 9, 2021

cbalioglu commented Mar 23, 2021 •

edited

Loading

facebook-github-bot commented Mar 23, 2021 •

edited

Loading

codecov bot commented Mar 24, 2021 •

edited

Loading