Add sharding constraint for the output tensor in the model #18536

qlzh727 · 2023-10-02T19:58:05Z

This will be used for sharding the intermediate state (eg activations).

Updates to the API:

keras.distribute.relayout for relayout/set sharding constraint for a tensor value.
LayoutMap are used to support layer name as the key for the output layout of the layer.

Unit test has been updated for JAX as a demonstration.

This should fix #18521

codecov-commenter · 2023-10-02T20:12:01Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (25e4fa6) 78.00% compared to head (2fc70fc) 78.03%.
Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #18536      +/-   ##
==========================================
+ Coverage   78.00%   78.03%   +0.02%     
==========================================
  Files         334      334              
  Lines       32351    32406      +55     
  Branches     6313     6322       +9     
==========================================
+ Hits        25237    25287      +50     
- Misses       5546     5548       +2     
- Partials     1568     1571       +3

Flag	Coverage Δ
keras	`77.93% <93.93%> (+0.02%)`	⬆️
keras-jax	`63.41% <93.93%> (-0.01%)`	⬇️
keras-numpy	`57.49% <42.42%> (-0.04%)`	⬇️
keras-tensorflow	`63.38% <33.33%> (-0.05%)`	⬇️
keras-torch	`64.26% <33.33%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
keras/backend/jax/core.py	`89.13% <100.00%> (ø)`
keras/backend/jax/trainer.py	`95.51% <100.00%> (ø)`
keras/distribution/distribution_lib.py	`95.37% <100.00%> (+0.31%)`	⬆️
keras/layers/layer.py	`88.40% <100.00%> (+0.16%)`	⬆️
keras/backend/jax/distribution_lib.py	`88.57% <81.81%> (-4.29%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fchollet

Thanks for the PR! Two points of discussion to finalize the API.

fchollet · 2023-10-03T11:07:22Z

keras/backend/jax/distribution_lib.py

@@ -43,6 +43,13 @@ def distribute_value(value, tensor_layout):
    return jax.device_put(value, tensor_layout)


+def relayout(value, tensor_layout):


Right now the distinction between "distribute_value" and "relayout" is not clear -- according to the docstrings, both are about setting a layout on a tensor. I wonder if we could use function names that make the difference clearer. When would you use one and when would you use the other?

Indeed. I think the major difference here is one of them is suppose to work in jitted function, and the other one works in eager. Currently I don't think we have a good way to detect what mode is user code in, and auto choose the proper API for them.

I think the jax with_sharding_constraint has a good indication that it applies the constraint to a intermediate tensor/state within the function. Maybe we should just aligned with that?

Q: what's the difference between with_sharding_constraint applied to the input tensor (i.e. the data array), vs calling device_put on the input tensor? Right now we use device_put for data distribution. Does with_sharding_constraint work for that?

So I have a some chat in the jax user group, and the conclusion is that the with_sharding_constraint is designed to be used only in jitted function. It might not work in a lot of cases outside of jax.jit.

I think we should rely on the device_put in the pure eager context, for input data, as well as variable initialization.

Let's do this, I think:

Have a separate distribute_variable and distribute_tensor API. Apparently this may be needed for TF?

In JAX distribute_tensor, check if we're in a tracing context. If so, use sharding_constraint. If not, use device_put.

Done. I have added at check for distribute_tensor for whether its in the jitted context, and I think it might not be a cheap check if we have to do it very often. I also didn't find a proper way to do this kind of check google/jax#9241. We might want to check with Jax team for this.

fchollet · 2023-10-03T11:09:12Z

keras/distribution/distribution_lib.py

+    if isinstance(value, KerasTensor):
+        # keras tensor is only used for building functional model, and can't be
+        # used to alter layout/sharding.
+        return value


Should KerasTensors still have a layout attribute? In case we need to read it on the tensor? Or is that not useful.

Great question. The issue I hit when using kerasTensor with the jax sharding API is that it will always try to convert the KerasTensor to jax array, which result into error. It might make sense to add layout only when KerasTensor is a subclass or jax array or tf.Tensor.

fchollet · 2023-10-03T11:12:52Z

keras/layers/layer.py

+                distribution = distribution_lib.distribution()
+                if distribution is not None:
+                    current_layer_path = current_path()
+                    layout = distribution.get_tensor_layout(current_layer_path)


So for setting the layout of the output of a Dense layer in a subclassed model, you'd do like layout_map["model/layers/dense_1"] = (...)?

Is there a risk of confusing variable layouts and intermediate tensor layouts?

Should we be more specific, e.g. layout_map["model/layers/dense_1/output"] = (...) ? This could also leave the door open for input if ever needed.

Is the full path too much information? What about layout_map["dense_1/output"] = (...)? Is that confusing?

So for setting the layout of the output of a Dense layer in a subclassed model, you'd do like layout_map["model/layers/dense_1"] = (...)?

Correct, its based on the path/name scope of the subclass model.

Is there a risk of confusing variable layouts and intermediate tensor layouts?

It does. And my original intent was actually same as option 2.

Should we be more specific, e.g. layout_map["model/layers/dense_1/output"] = (...) ? This could also leave the door open for input if ever needed.

That will definitely make the it more explicit. And also, this open the option for mapping to any intermediate keras operations within the layer body.

Is the full path too much information? What about layout_map["dense_1/output"] = (...)? Is that confusing?

Matt has the same question, and he was proposing the use regex.search instead of regex.match so that user can skip the prefix. My original implementation was trying to be a bit strict, so that the layout won't accidentally map to unwanted weights. In the case that there are overlapping rule that apply to same weights, currently the first one wins. Maybe we can take the regex.match approach, and raise an error when multiple rules is mapped to the same weights/tensor. (Probably I will do this in a separate PR.)

Maybe we can take the regex.match approach, and raise an error when multiple rules is mapped to the same weights/tensor

I think that's a good idea, we can use search and then make sure that each variable matches at most 1 rule.

Ack, let me do this in a separate PR.

@mattdangerw .

qlzh727 · 2023-10-03T18:43:14Z

@mattdangerw

qlzh727 · 2023-10-05T19:03:09Z

PTAL again.

fchollet

Thanks for the updates!

fchollet · 2023-10-06T08:39:48Z

keras/backend/jax/distribution_lib.py

-    return jax.device_put(value, tensor_layout)
+
+    # TODO(scottzhu): This might not be a cheap check, we should consider
+    # have some proper JAX API for doing this check.


I consulted mattjj and that is not something they are considering.

Ack. Thanks for the confirmation.

fchollet · 2023-10-06T08:40:11Z

keras/backend/jax/distribution_lib.py

+    # have some proper JAX API for doing this check.
+    if jax_utils.is_in_jax_tracing_scope():
+        return jax.lax.with_sharding_constraint(tensor, tensor_layout)
+    else:


You can remove the indent block and just do return

fchollet · 2023-10-06T08:41:56Z

keras/backend/jax/distribution_lib.py

@@ -27,8 +29,33 @@ def list_devices(device_type=None):
    return [f"{device.device_kind}:{device.id}" for device in jax_devices]


-def distribute_value(value, tensor_layout):
-    """Distribute the value based on the layout.
+def distribute_variable(value, tensor_layout):


it's a bit weird to have distribute_variable(value, tensor_layout) if we also have distribute_tensor(value, tensor_layout). I suggest switching to distribute_variable(value, layout) and distribute_tensor(value, layout)

fchollet · 2023-10-06T08:42:31Z

keras/backend/jax/distribution_lib.py

+def distribute_variable(value, tensor_layout):
+    """Create a distributed variable for JAX.
+
+    Since JAX doesn't have variable class, this will just return a jax.Array


a variable class

fchollet · 2023-10-06T08:42:39Z

keras/backend/jax/distribution_lib.py

+def distribute_variable(value, tensor_layout):
+    """Create a distributed variable for JAX.
+
+    Since JAX doesn't have variable class, this will just return a jax.Array


Use backticks for code keywords.

fchollet · 2023-10-06T08:43:34Z

keras/distribution/distribution_lib.py

+    """Change the layout of a Tensor value in the jit function execution.
+
+    Note that this might not work outside of the jitted function for certain
+    backend. To change the layout of a value eagerly, please use


It should work in both situations in JAX, right?

It works for JAX, but might not for all the other backend, eg tf.dtensor.relayout() will only work with dtensor instance and also in a tf.function, and pytorch is also unknown for now.

I think we will update the docstring once we have all the backend implemented, and reflect the final behavior.

fchollet

LGTM -- Thank you!

qlzh727 added 4 commits September 27, 2023 11:32

WIP for shard the intermediate tensor.

642857f

Merge branch 'master' into sharding_constraint

f7d7c98

Merge branch 'keras-team:master' into sharding_constraint

01364c2

Add unit test for sharding constraint

9025681

qlzh727 requested a review from fchollet October 2, 2023 19:58

google-ml-butler bot added the size:M label Oct 2, 2023

google-ml-butler bot assigned gbaned Oct 2, 2023

google-ml-butler bot added the awaiting review label Oct 2, 2023

Remove unused variable.

76bc579

fchollet reviewed Oct 3, 2023

View reviewed changes

qlzh727 added 2 commits October 3, 2023 13:24

Merge branch 'keras-team:master' into sharding_constraint

6df8693

Address review comments.

43f9fc8

qlzh727 requested a review from fchollet October 3, 2023 20:37

qlzh727 added 2 commits October 5, 2023 10:16

Merge branch 'keras-team:master' into sharding_constraint

478233a

Address review comments.

febe922

This was referenced Oct 5, 2023

Add multi-process/worker support for JAX distribution. #18551

Merged

add DTensor Variable #18542

Closed

fchollet reviewed Oct 6, 2023

View reviewed changes

qlzh727 added 2 commits October 6, 2023 09:29

Merge branch 'keras-team:master' into sharding_constraint

3936ee8

Address review comments.

2fc70fc

qlzh727 requested a review from fchollet October 6, 2023 17:20

fchollet approved these changes Oct 6, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 6, 2023

fchollet merged commit c57e454 into keras-team:master Oct 6, 2023
7 checks passed

google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase kokoro:force-run labels Oct 6, 2023

qlzh727 deleted the sharding_constraint branch October 10, 2023 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sharding constraint for the output tensor in the model #18536

Add sharding constraint for the output tensor in the model #18536

qlzh727 commented Oct 2, 2023 •

edited

Loading

codecov-commenter commented Oct 2, 2023 •

edited

Loading

fchollet left a comment

fchollet Oct 3, 2023

qlzh727 Oct 3, 2023

fchollet Oct 4, 2023

qlzh727 Oct 4, 2023

fchollet Oct 5, 2023

qlzh727 Oct 5, 2023

fchollet Oct 3, 2023

qlzh727 Oct 3, 2023

fchollet Oct 3, 2023 •

edited

Loading

qlzh727 Oct 3, 2023 •

edited

Loading

fchollet Oct 4, 2023

qlzh727 Oct 4, 2023

qlzh727 Oct 5, 2023

qlzh727 commented Oct 3, 2023

qlzh727 commented Oct 5, 2023

fchollet left a comment

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet Oct 6, 2023

qlzh727 Oct 6, 2023

fchollet left a comment

		@@ -43,6 +43,13 @@ def distribute_value(value, tensor_layout):
		return jax.device_put(value, tensor_layout)


		def relayout(value, tensor_layout):

Add sharding constraint for the output tensor in the model #18536

Add sharding constraint for the output tensor in the model #18536

Conversation

qlzh727 commented Oct 2, 2023 • edited Loading

codecov-commenter commented Oct 2, 2023 • edited Loading

Codecov Report

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

qlzh727 Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qlzh727 commented Oct 3, 2023

qlzh727 commented Oct 5, 2023

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet left a comment

Choose a reason for hiding this comment

qlzh727 commented Oct 2, 2023 •

edited

Loading

codecov-commenter commented Oct 2, 2023 •

edited

Loading

fchollet Oct 3, 2023 •

edited

Loading

qlzh727 Oct 3, 2023 •

edited

Loading