Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a fluent API to construct tensors from external data. #54530

Closed
wants to merge 1 commit into from

Conversation

cbalioglu
Copy link
Contributor

@cbalioglu cbalioglu commented Mar 23, 2021

Summary:
This diff introduces the following changes and improvements:

  • Introduces a new fluent API to construct tensors from external data as an alternative to from_blob overloads. See below for an example.
  • Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.
  • Exposes a new (lightweight) way to construct tensors by passing a naked context and context_deleter pair as an alternative to the existing deleter parameter.
  • Updates the existing from_blob overloads to internally use the fluent API.
// Example 1
at::Tensor tensor = at::for_blob(data, sizes)
  .strides(strides)
  .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); })
  .options(...)
  .target_device(...)
  .make_tensor();

// Example 2
at::Tensor tensor = at::for_blob(data, sizes).make_tensor();

// Example 3
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter(...)
  .make_tensor();

Test Plan:
Below are the folly Benchmark results for the following two equivalent operations:

// The fluent API
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter([buffer](void*) mutable { buffer.reset(); })
  .options(dtype(c10::ScalarType::Float))
  .make_tensor();

// The original `from_blob` overload
at::Tensor tensor = at::from_blob(
  data,
  sizes,
  [buffer](void*) mutable { buffer.reset(); },
  dtype(c10::ScalarType::Float));
============================================================================
scripts/balioglu/from_blob_exp/main.cpp         relative  time/iter  iters/s
============================================================================
fluent                                                     298.34ns    3.35M
from_blob                                         55.19%   540.51ns    1.85M
============================================================================

Various similar experiments show an approximate %50 reduction in tensor construction times.

Differential Revision: D27269344

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 23, 2021

💊 CI failures summary and remediations

As of commit fcf0927 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D27269344

@ezyang
Copy link
Contributor

ezyang commented Mar 23, 2021

Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.

What are the small buffer optimizations? I skimmed through the PR and didn't see anything that jumped out.

@cbalioglu
Copy link
Contributor Author

Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.

What are the small buffer optimizations? I skimmed through the PR and didn't see anything that jumped out.

  • Instead of zero_sizes() returning a std::vector<int64_t> makeTempSizes() returns a SmallVector<int64_t, 5>.
  • We do not call detail::defaultStrides() anymore (which returns another heap-allocated vector that gets discarded right after constructing the tensor). If no strides are specified, we call set_sizes_contiguous() instead.
  • The deleter requires a heap-allocated context deleter to be passed to DataPtr. With the withContext() we avoid that, by passing a context pointer and a raw function pointer.

@codecov
Copy link

codecov bot commented Mar 24, 2021

Codecov Report

Merging #54530 (fcf0927) into master (556fc8d) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #54530      +/-   ##
==========================================
- Coverage   77.46%   77.45%   -0.01%     
==========================================
  Files        1893     1893              
  Lines      185939   185939              
==========================================
- Hits       144041   144026      -15     
- Misses      41898    41913      +15     

TensorOptions opts_{};
};

inline TensorMaker forBlob(void* data, IntArrayRef sizes) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camel case function name is inconsistent with other APIs (case in point: it's at::from_blob)

@ezyang
Copy link
Contributor

ezyang commented Mar 24, 2021

Thanks, the optimizations are much appreciated.

I think my primary commentary is about naming bikeshed: function name (forBlob versus for_blob) as well as naming of methods (withStrides versus strides). We have a preexisting set of helpers for defining fluent interfaces for NN modules, see torch/csrc/api/include/torch/arg.h and torch/csrc/api/include/torch/nn/options/linear.h; while you don't have to actually directly use the code here, it would be good if we lined up the conventions here.

@cbalioglu
Copy link
Contributor Author

Thanks, the optimizations are much appreciated.

You are welcome!

I think my primary commentary is about naming bikeshed: function name (forBlob versus for_blob) as well as naming of methods (withStrides versus strides). We have a preexisting set of helpers for defining fluent interfaces for NN modules, see torch/csrc/api/include/torch/arg.h and torch/csrc/api/include/torch/nn/options/linear.h; while you don't have to actually directly use the code here, it would be good if we lined up the conventions here.

Thanks a lot for the pointers. I am fairly new to the code base; good to know that we have an existing convention for fluent APIs. Let me check them. Hope to have a second revision later today.

…rch#54530)

Summary:
Pull Request resolved: pytorch#54530

This diff introduces the following changes and improvements:

- Introduces a new fluent API to construct tensors from external data as an alternative to `from_blob` overloads. See below for an example.
- Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times.
- Exposes a new (lightweight) way to construct tensors by passing a naked `context` and `context_deleter` pair as an alternative to the existing `deleter` parameter.
- Updates the existing `from_blob` overloads to internally use the fluent API.

```
// Example 1
at::Tensor tensor = at::for_blob(data, sizes)
  .strides(strides)
  .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); })
  .options(...)
  .target_device(...)
  .make_tensor();

// Example 2
at::Tensor tensor = at::for_blob(data, sizes).make_tensor();

// Example 3
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter(...)
  .make_tensor();
```

Test Plan:
Below are the folly Benchmark results for the following two equivalent operations:

```
// The fluent API
at::Tensor tensor = at::for_blob(data, sizes)
  .deleter([buffer](void*) mutable { buffer.reset(); })
  .options(dtype(c10::ScalarType::Float))
  .make_tensor();

// The original `from_blob` overload
at::Tensor tensor = at::from_blob(
  data,
  sizes,
  [buffer](void*) mutable { buffer.reset(); },
  dtype(c10::ScalarType::Float));
```

```
============================================================================
scripts/balioglu/from_blob_exp/main.cpp         relative  time/iter  iters/s
============================================================================
fluent                                                     298.34ns    3.35M
from_blob                                         55.19%   540.51ns    1.85M
============================================================================
```

Various similar experiments show an approximate %50 reduction in tensor construction times.

Differential Revision: D27269344

fbshipit-source-id: 26cad33986836ec6576bacb485c2a01c10ff8005
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D27269344

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yey

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9029d0d.

@swolchok
Copy link
Contributor

swolchok commented Apr 1, 2021

How does this new implementation ensure that the requested device, the device of the passed-in data void pointer, and the device specified in TensorOptions all match?

@swolchok
Copy link
Contributor

swolchok commented Apr 1, 2021

Also it seems unfortunate that for CPU we first make an empty CPU tensor with its own allocated Storage (see

auto storage_impl = c10::make_intrusive<StorageImpl>(
) and then throw that Storage away and provide our own. If we really did assume CPU tensors, we could avoid all that by using detail::make_tensor directly and providing our own Storage from the beginning.
Perhaps we could/should provide a dispatched way to make a Tensor with a given Storage? Not sure on that solution; there seems to be an extra degree of freedom in here somewhere that maybe we don't need...

@cbalioglu
Copy link
Contributor Author

Thanks for the feedback @swolchok. I deliberately did not touch the actual tensor construction logic. I just copied and cleaned up the majority of the existing implementation. Besides doing the optimizations I mentioned above and introducing a new API the internal machinery is more or less the same. As I am fairly new to the code base and this is such a core API, I did not want to accidentally break anything.

How does this new implementation ensure that the requested device, the device of the passed-in data void pointer, and the device specified in TensorOptions all match?

I found the target_device parameter particularly confusing and still not sure about its use case; still, kept the existing logic intact. Similarly not sure why we are comparing the target_device and TensorOptions's device only when a device index is specified.

As you already mentioned, I am fairly confident that there is room for improvement in how we construct tensors from external data. I would be happy to work further on this if deemed necessary.

@ezyang
Copy link
Contributor

ezyang commented Apr 1, 2021

Yes, an easy further improvement is to factor empty_generic further so that you can pass in a DataPtr directly instead of calling Allocate on allocator, then making for_blob use it.

@cbalioglu
Copy link
Contributor Author

Yes, an easy further improvement is to factor empty_generic further so that you can pass in a DataPtr directly instead of calling Allocate on allocator, then making for_blob use it.

See #55705.

facebook-github-bot pushed a commit that referenced this pull request Apr 11, 2021
Summary:
This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with #54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.)

Testing is done with the existing unit and integration tests as there is no user-observable API change.

Pull Request resolved: #55705

Reviewed By: ezyang

Differential Revision: D27686043

Pulled By: cbalioglu

fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with pytorch#54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.)

Testing is done with the existing unit and integration tests as there is no user-observable API change.

Pull Request resolved: pytorch#55705

Reviewed By: ezyang

Differential Revision: D27686043

Pulled By: cbalioglu

fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants