Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7/n] [reconfigurator] combine inventory and omicron zones #6739

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Oct 1, 2024

Currently, when the collection process makes two separate queries to sled-agent: one to obtain basic inventory, and the other to obtain Omicron zones. With this approach, it is possible for one of the two queries to fail -- in that case, the sled_agents and omicron_zones maps will not have the same set of keys.

In general, consumers of collections must be able to handle these kinds of failing cases, because collections contain information from a variety of sources, not all of which are guaranteed to work all the time. But this specific case (sled-agent inventory vs Omicron zones) is not intrinsic to the system -- it is an artifact of our APIs.

Change the API so that the /inventory endpoint also returns Omicron zones. The format remains the same, so that PUT /omicron-zones mirrors GET /inventory | jq .omicron_zones.

I had to make a few changes to our tests, since we now have a unified notion of Omicron zones and sled agents. But this exposed what feels to me like a modeling deficiency in the way our tests work. I hope to fix that up at some point soon.

There is also one behavior change to the reconfigurator CLI -- now, inventory-generate preserves zone information passed into it from the system. This matches the behavior of inventory generation within tests themselves.

Depends on:

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers
Copy link
Contributor Author

sunshowers commented Oct 2, 2024

(realized we can simplify this a lot by creating a explicit series of system descriptions along with their derived outputs -- will take a look at that tomorrow)

(will tackle this separately)

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Comment on lines +934 to +935
// TODO: mutating example.system doesn't automatically update
// example.collection -- this should be addressed via API improvements.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate -- users must remember to do this themselves. This is what drove me to investigate how to make feedback from tests to SystemDescription work.

@sunshowers sunshowers marked this pull request as ready for review October 4, 2024 00:38
Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [reconfigurator] combine inventory and omicron zones [3/n] [reconfigurator] combine inventory and omicron zones Oct 4, 2024
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

usable_physical_ram: s.usable_physical_ram.into(),
reservoir_size: s.reservoir_size.into(),
omicron_zones,
// TODO: the unwrap_or_defaults here look incorrect -- all the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to fix this in this PR? If not, mind opening an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see two possibilities here:

  • If disks, zpools, datasets are like Omicron zones, where there's a "parent" record for the list itself (the inv_sled_omicron_zones), then I expect that this TODO is probably right and instead of unwrap_or_default() we probably want ok_or_else(|| Error::internal_error(&format!("sled {}: missing physical disks that we expected to see", sled_id))?.
  • If disks, zpools, and datasets don't have a separate record like this, then unwrap_or_default() seems correct? We would have only populated the maps for the rows that we found, but if a sled just had 0 disks, zpools, or datasets, then we wouldn't have found any rows, it wouldn't have shown up in the map, and then the empty Vec is correct here, right?

All of it makes me wonder if we have test coverage for serializing and deserializing a collection containing a sled with no disks, zpools, or datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, it does look like the latter -- so yes, the unwrap_or_default is correct. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in a comment.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Reviewing this prompted me to remember one reason I think I had separated these initially, which is that in the sled agent, the Omicron zones information was protected by a lock, and that lock gets held for the entire duration of the zone reconfiguration that happens when you PUT /omicron-zones. That could be a while. I was worried this would prevent us from getting any information about the sled while zone reconfiguration was happening. I'm not sure if the locking still works that way, but either way, it doesn't seem to have been a meaningful problem. (If it were, presumably we'd have a lot of partially failed inventory collections. I don't think we've noticed that.) If it is or becomes a problem, we can always fix the locking. All that's to say: this looks good.

nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/inventory.rs Outdated Show resolved Hide resolved
usable_physical_ram: s.usable_physical_ram.into(),
reservoir_size: s.reservoir_size.into(),
omicron_zones,
// TODO: the unwrap_or_defaults here look incorrect -- all the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see two possibilities here:

  • If disks, zpools, datasets are like Omicron zones, where there's a "parent" record for the list itself (the inv_sled_omicron_zones), then I expect that this TODO is probably right and instead of unwrap_or_default() we probably want ok_or_else(|| Error::internal_error(&format!("sled {}: missing physical disks that we expected to see", sled_id))?.
  • If disks, zpools, and datasets don't have a separate record like this, then unwrap_or_default() seems correct? We would have only populated the maps for the rows that we found, but if a sled just had 0 disks, zpools, or datasets, then we wouldn't have found any rows, it wouldn't have shown up in the map, and then the empty Vec is correct here, right?

All of it makes me wonder if we have test coverage for serializing and deserializing a collection containing a sled with no disks, zpools, or datasets.

Comment on lines 313 to 321
/// # Notes
///
/// While bootstrapping an example system, the order is:
///
/// 1. Create a collection with sleds with no zones.
/// 2. From the collection, generate a blueprint with zones.
/// 3. Seed the system with zones from the blueprint.
///
/// This method is used in step 3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused by this comment. What's the purpose of having it here? I feel like maybe you're trying to say that there's some assumptions it makes that are only true in this situation, and so it wouldn't be valid if you were calling it for some other reason? Or maybe that it wasn't suitable for other situations?

I do feel like it could use a comment explaining what it means to set the zones for a particular sled. I assume it means "these zones will be reported in the inventory for this sled" and that the caller is allowed to change this however it wants (e.g., even going backwards or transitions in other illegal ways, even though that shouldn't happen in a real system) -- nothing internal to SystemDescription will break. (An alternative reading would be: make these the zones that the sled should be running, though it might not yet be, which probably doesn't make sense in the context of SystemDescription.)

(My confusion is probably not relevant but in case it's useful: when I first read it, I thought it was going to describe the three-step process this function itself was going to carry out. Then I got to the end and realized it wasn't that. Then I thought "method" meant "approach". It took me a few readings to parse this as as "context about how this function is used".)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about not being clear here! Yes, it means "when Self::to_collection_builder is called next, it will contain these Omicron zones for the sled".

I agree that describing the bootstrap process isn't very useful (you can discover this by looking for the callers of sled_set_omicron_zones), so I've dropped it.

sim.system.to_collection_builder().context("generating inventory")?;
// For an inventory we just generated from thin air, pretend like each sled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the behavior here?

Copy link
Contributor Author

@sunshowers sunshowers Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does -- and I think what I was intending to do here was to try and figure out what this really means. I'll work on reconfigurator-cli in another PR that this depends on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put up a few more PRs in the series, and established this behavior change via new tests. I think it makes sense -- the system now carries around zone information so it's weird for the inventory to ignore it.

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [3/n] [reconfigurator] combine inventory and omicron zones [7/n] [reconfigurator] combine inventory and omicron zones Oct 6, 2024
@sunshowers sunshowers changed the base branch from sunshowers/spr/main.reconfigurator-combine-inventory-and-omicron-zones to main October 11, 2024 21:30
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) October 11, 2024 21:31
@sunshowers sunshowers merged commit c9fc674 into main Oct 12, 2024
17 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/reconfigurator-combine-inventory-and-omicron-zones branch October 12, 2024 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants