-
Notifications
You must be signed in to change notification settings - Fork 22.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a fluent API to construct tensors from external data. #54530
Conversation
💊 CI failures summary and remediationsAs of commit fcf0927 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This pull request was exported from Phabricator. Differential Revision: D27269344 |
What are the small buffer optimizations? I skimmed through the PR and didn't see anything that jumped out. |
|
Codecov Report
@@ Coverage Diff @@
## master #54530 +/- ##
==========================================
- Coverage 77.46% 77.45% -0.01%
==========================================
Files 1893 1893
Lines 185939 185939
==========================================
- Hits 144041 144026 -15
- Misses 41898 41913 +15 |
aten/src/ATen/templates/Functions.h
Outdated
TensorOptions opts_{}; | ||
}; | ||
|
||
inline TensorMaker forBlob(void* data, IntArrayRef sizes) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camel case function name is inconsistent with other APIs (case in point: it's at::from_blob
)
Thanks, the optimizations are much appreciated. I think my primary commentary is about naming bikeshed: function name ( |
You are welcome!
Thanks a lot for the pointers. I am fairly new to the code base; good to know that we have an existing convention for fluent APIs. Let me check them. Hope to have a second revision later today. |
…rch#54530) Summary: Pull Request resolved: pytorch#54530 This diff introduces the following changes and improvements: - Introduces a new fluent API to construct tensors from external data as an alternative to `from_blob` overloads. See below for an example. - Leverages several small-buffer optimizations which result in %50 reduction in tensor construction times. - Exposes a new (lightweight) way to construct tensors by passing a naked `context` and `context_deleter` pair as an alternative to the existing `deleter` parameter. - Updates the existing `from_blob` overloads to internally use the fluent API. ``` // Example 1 at::Tensor tensor = at::for_blob(data, sizes) .strides(strides) .context(context, [](void *ctx) { delete static_cast<Ctx*>(ctx); }) .options(...) .target_device(...) .make_tensor(); // Example 2 at::Tensor tensor = at::for_blob(data, sizes).make_tensor(); // Example 3 at::Tensor tensor = at::for_blob(data, sizes) .deleter(...) .make_tensor(); ``` Test Plan: Below are the folly Benchmark results for the following two equivalent operations: ``` // The fluent API at::Tensor tensor = at::for_blob(data, sizes) .deleter([buffer](void*) mutable { buffer.reset(); }) .options(dtype(c10::ScalarType::Float)) .make_tensor(); // The original `from_blob` overload at::Tensor tensor = at::from_blob( data, sizes, [buffer](void*) mutable { buffer.reset(); }, dtype(c10::ScalarType::Float)); ``` ``` ============================================================================ scripts/balioglu/from_blob_exp/main.cpp relative time/iter iters/s ============================================================================ fluent 298.34ns 3.35M from_blob 55.19% 540.51ns 1.85M ============================================================================ ``` Various similar experiments show an approximate %50 reduction in tensor construction times. Differential Revision: D27269344 fbshipit-source-id: 26cad33986836ec6576bacb485c2a01c10ff8005
This pull request was exported from Phabricator. Differential Revision: D27269344 |
30c6125
to
fcf0927
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yey
This pull request has been merged in 9029d0d. |
How does this new implementation ensure that the requested device, the device of the passed-in |
Also it seems unfortunate that for CPU we first make an empty CPU tensor with its own allocated pytorch/aten/src/ATen/Utils.cpp Line 66 in 2309173
Storage away and provide our own. If we really did assume CPU tensors, we could avoid all that by using detail::make_tensor directly and providing our own Storage from the beginning.Perhaps we could/should provide a dispatched way to make a Tensor with a given Storage ? Not sure on that solution; there seems to be an extra degree of freedom in here somewhere that maybe we don't need...
|
Thanks for the feedback @swolchok. I deliberately did not touch the actual tensor construction logic. I just copied and cleaned up the majority of the existing implementation. Besides doing the optimizations I mentioned above and introducing a new API the internal machinery is more or less the same. As I am fairly new to the code base and this is such a core API, I did not want to accidentally break anything.
I found the As you already mentioned, I am fairly confident that there is room for improvement in how we construct tensors from external data. I would be happy to work further on this if deemed necessary. |
Yes, an easy further improvement is to factor |
See #55705. |
Summary: This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with #54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.) Testing is done with the existing unit and integration tests as there is no user-observable API change. Pull Request resolved: #55705 Reviewed By: ezyang Differential Revision: D27686043 Pulled By: cbalioglu fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272
Summary: This PR optimizes the way tensors are constructed from external data. It avoids allocating an empty tensor beforehand and directly constructs the target tensor by passing the newly-initialized `DataPtr`. Running some Facebook-internal benchmarks showed that combined with pytorch#54530 this PR achieves performance parity with Caffe2 tensor construction. (Overall ~2x speed improvement over the original `at::from_blob()` implementation.) Testing is done with the existing unit and integration tests as there is no user-observable API change. Pull Request resolved: pytorch#55705 Reviewed By: ezyang Differential Revision: D27686043 Pulled By: cbalioglu fbshipit-source-id: b365c614476bcf0567797dfaf2add1b76fb6c272
Summary:
This diff introduces the following changes and improvements:
from_blob
overloads. See below for an example.context
andcontext_deleter
pair as an alternative to the existingdeleter
parameter.from_blob
overloads to internally use the fluent API.Test Plan:
Below are the folly Benchmark results for the following two equivalent operations:
Various similar experiments show an approximate %50 reduction in tensor construction times.
Differential Revision: D27269344