Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gateway] ingest sensor measurements from SPs into oximeter #6354

Merged
merged 78 commits into from
Aug 24, 2024
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
a7271cc
sketch schema for sled sensor measurements
hawkw Aug 14, 2024
7d49243
add errors to schema
hawkw Aug 15, 2024
41c071e
wip
hawkw Aug 15, 2024
358bd96
also add chrono dep (whoops)
hawkw Aug 15, 2024
8f13131
change schema to use component as target
hawkw Aug 15, 2024
0e8a1c4
TOML syntax is bizarre...
hawkw Aug 15, 2024
71de57f
add switch_component target
hawkw Aug 15, 2024
d7ee2a4
urgh you can't have two targets in a schema
hawkw Aug 15, 2024
8c64069
basically all the plumbing
hawkw Aug 15, 2024
38de166
oh, okay, that's how you talk to the SPs!
hawkw Aug 15, 2024
12d660b
schema munging
hawkw Aug 15, 2024
f584881
more schema munging
hawkw Aug 15, 2024
7fd33a7
oops i forgot to do watts
hawkw Aug 16, 2024
0bd914b
draw the rest of the owl
hawkw Aug 16, 2024
f9ff0f2
actually update our understanding
hawkw Aug 16, 2024
f38bf1b
remove dbgs
hawkw Aug 16, 2024
4b03501
more schema tweaks
hawkw Aug 16, 2024
a84f394
sp_sim_config.test.toml has no sensors
hawkw Aug 16, 2024
a41dc72
add dev-only MGS args instead of hardcoding
hawkw Aug 16, 2024
0e82aa5
thread through metrics dev args
hawkw Aug 16, 2024
af5b827
actually finish plumbing through metrics dev args
hawkw Aug 16, 2024
9e13ee7
misc cleanup
hawkw Aug 16, 2024
2c7a029
rough pass on sensor errors
hawkw Aug 16, 2024
7ae1ab6
add some more simulated devices
hawkw Aug 16, 2024
0f66a39
include MGS uuids in metrics
hawkw Aug 16, 2024
9d4b53c
remove obnoxious log line
hawkw Aug 16, 2024
b96f9c9
give the other sim gimlet some sensors also
hawkw Aug 16, 2024
45f54ef
rewrite the whole thing to use less memory etc
hawkw Aug 16, 2024
c4402b9
i guess we have to increment cumulative metrics?
hawkw Aug 17, 2024
3de80cb
start doing poll error metrics too
hawkw Aug 17, 2024
74d82e1
redo error handling, record poll error metrics
hawkw Aug 17, 2024
7f5c1e2
move metrics stuff to the config file
hawkw Aug 18, 2024
8b25a89
whoops fix nexus-test
hawkw Aug 19, 2024
52b8d74
update omdb output
hawkw Aug 19, 2024
21597a3
add more sim components to test
hawkw Aug 19, 2024
2869fb0
lol oops
hawkw Aug 19, 2024
5b56dbd
blergh
hawkw Aug 19, 2024
8fa9ad8
don't churn restarting pollers for non-present SPs
hawkw Aug 20, 2024
b01d0fb
use `sp_addr_watch` to wait for SPs to appear
hawkw Aug 20, 2024
8f3eae6
smallish logging tweaks
hawkw Aug 20, 2024
6efcdf7
i forgot to add the new producer kind to the db
hawkw Aug 20, 2024
2c5f83b
post rebase remove explicit simulator sensor IDs
hawkw Aug 20, 2024
72ccdd8
GAH i hate sql syntax
hawkw Aug 20, 2024
5443e65
update OMDB success cases again
hawkw Aug 20, 2024
4fd919b
rename most of the schema fields
hawkw Aug 20, 2024
ce9817d
add component descriptions to target
hawkw Aug 21, 2024
5b69235
discard samples if SP state changes mid-poll
hawkw Aug 21, 2024
c11992a
add a bit more config validation
hawkw Aug 21, 2024
830bb20
way less complex poller manager
hawkw Aug 21, 2024
23ab449
add a simple test that metrics make it to oximeter
hawkw Aug 21, 2024
3c766b3
that was supposed to be milliseconds
hawkw Aug 22, 2024
8a6252a
SP poller tasks should never need to be restarted
hawkw Aug 22, 2024
7aa3f77
schema grammar/wording suggestions from @bnaeker
hawkw Aug 22, 2024
90f950a
comment edits
hawkw Aug 22, 2024
6b45732
i before e, except after c
hawkw Aug 22, 2024
0d33578
rename schema file to match the name of the target
hawkw Aug 22, 2024
084d465
record missing samples on sensor errors
hawkw Aug 22, 2024
817e257
don't panic if the runtime is going away
hawkw Aug 22, 2024
3b89711
pretty-print toml parse errors when loading MGS config
hawkw Aug 22, 2024
ac86ecf
just use serde's duration parser in the config file
hawkw Aug 22, 2024
7f55277
get rid of poll interval configurability
hawkw Aug 22, 2024
64ae3a0
reduce panickiness
hawkw Aug 22, 2024
a62b2d4
oh, right: you have to actually remove the Result
hawkw Aug 22, 2024
d130a5b
i guess we need to delete the MGS logs
hawkw Aug 22, 2024
f49c0ec
if we're removing metrics stuff from the config, it needs to be optional
hawkw Aug 22, 2024
b49acd9
bleh
hawkw Aug 22, 2024
67e2a71
you have to delete it here too
hawkw Aug 22, 2024
681e4a2
disable metrics in `test_sp_updater_delivers_progress`
hawkw Aug 23, 2024
d497472
`get_or_insert_with` is nicer
hawkw Aug 23, 2024
d13aff5
gah, typo
hawkw Aug 23, 2024
66b8d11
GAH borrow
hawkw Aug 23, 2024
8ea3a29
gah typechecker
hawkw Aug 23, 2024
8705d65
make the test way fancier
hawkw Aug 23, 2024
d4d305b
simplify metrics config
hawkw Aug 23, 2024
aa68ed9
if you change the config format, you have to change the config files
hawkw Aug 23, 2024
55490d1
learn to type good, idiot
hawkw Aug 23, 2024
af300d0
Merge branch 'main' into eliza/sensor-metric
hawkw Aug 24, 2024
e7d2430
celsius
hawkw Aug 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions clients/nexus-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@ impl From<omicron_common::api::internal::nexus::ProducerKind>
fn from(kind: omicron_common::api::internal::nexus::ProducerKind) -> Self {
use omicron_common::api::internal::nexus::ProducerKind;
match kind {
ProducerKind::ManagementGateway => Self::ManagementGateway,
ProducerKind::SledAgent => Self::SledAgent,
ProducerKind::Service => Self::Service,
ProducerKind::Instance => Self::Instance,
Expand Down Expand Up @@ -390,6 +391,9 @@ impl From<types::ProducerKind>
fn from(kind: types::ProducerKind) -> Self {
use omicron_common::api::internal::nexus::ProducerKind;
match kind {
types::ProducerKind::ManagementGateway => {
ProducerKind::ManagementGateway
}
types::ProducerKind::SledAgent => ProducerKind::SledAgent,
types::ProducerKind::Instance => ProducerKind::Instance,
types::ProducerKind::Service => ProducerKind::Service,
Expand Down
1 change: 1 addition & 0 deletions clients/oximeter-client/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ impl From<omicron_common::api::internal::nexus::ProducerKind>
fn from(kind: omicron_common::api::internal::nexus::ProducerKind) -> Self {
use omicron_common::api::internal::nexus;
match kind {
nexus::ProducerKind::ManagementGateway => Self::ManagementGateway,
nexus::ProducerKind::Service => Self::Service,
nexus::ProducerKind::SledAgent => Self::SledAgent,
nexus::ProducerKind::Instance => Self::Instance,
Expand Down
2 changes: 2 additions & 0 deletions common/src/api/internal/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,8 @@ pub enum ProducerKind {
Service,
/// The producer is a Propolis VMM managing a guest instance.
Instance,
/// The producer is a management gateway service.
ManagementGateway,
}

/// Information announced by a metric server, used so that clients can contact it and collect
Expand Down
1 change: 1 addition & 0 deletions dev-tools/mgs-dev/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ futures.workspace = true
gateway-messages.workspace = true
gateway-test-utils.workspace = true
libc.workspace = true
omicron-gateway.workspace = true
omicron-workspace-hack.workspace = true
signal-hook-tokio.workspace = true
tokio.workspace = true
23 changes: 21 additions & 2 deletions dev-tools/mgs-dev/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ use clap::{Args, Parser, Subcommand};
use futures::StreamExt;
use libc::SIGINT;
use signal_hook_tokio::Signals;
use std::net::SocketAddr;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
Expand Down Expand Up @@ -36,7 +37,12 @@ enum MgsDevCmd {
}

#[derive(Clone, Debug, Args)]
struct MgsRunArgs {}
struct MgsRunArgs {
/// Override the address of the Nexus instance to use when registering the
/// Oximeter producer.
#[clap(long)]
nexus_address: Option<SocketAddr>,
}

impl MgsRunArgs {
async fn exec(&self) -> Result<(), anyhow::Error> {
Expand All @@ -46,9 +52,22 @@ impl MgsRunArgs {
let mut signal_stream = signals.fuse();

println!("mgs-dev: setting up MGS ... ");
let gwtestctx = gateway_test_utils::setup::test_setup(
let (mut mgs_config, sp_sim_config) =
gateway_test_utils::setup::load_test_config();
if let Some(addr) = self.nexus_address {
mgs_config.metrics.dev =
Some(omicron_gateway::metrics::DevConfig {
bind_loopback: true,
nexus_address: Some(addr),
});
}

let gwtestctx = gateway_test_utils::setup::test_setup_with_config(
"mgs-dev",
gateway_messages::SpPort::One,
mgs_config,
&sp_sim_config,
None,
)
.await;
println!("mgs-dev: MGS is running.");
Expand Down
25 changes: 20 additions & 5 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,16 @@ SP DETAILS: type "Sled" slot 0

COMPONENTS

NAME DESCRIPTION DEVICE PRESENCE SERIAL
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
dev-0 FAKE temperature sensor fake-tmp-sensor Failed None
NAME DESCRIPTION DEVICE PRESENCE SERIAL
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
dev-0 FAKE temperature sensor fake-tmp-sensor Failed None
dev-1 FAKE temperature sensor tmp117 Present None
dev-2 FAKE Southeast temperature sensor tmp117 Present None
dev-6 FAKE U.2 Sharkfin A VPD at24csw080 Present None
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
dev-39 FAKE T6 temperature sensor tmp451 Present None
dev-53 FAKE Fan controller max31790 Present None

CABOOSES: none found

Expand All @@ -167,8 +174,16 @@ SP DETAILS: type "Sled" slot 1

COMPONENTS

NAME DESCRIPTION DEVICE PRESENCE SERIAL
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
NAME DESCRIPTION DEVICE PRESENCE SERIAL
sp3-host-cpu FAKE host cpu sp3-host-cpu Present None
dev-0 FAKE temperature sensor tmp117 Present None
dev-1 FAKE temperature sensor tmp117 Present None
dev-2 FAKE Southeast temperature sensor tmp117 Present None
dev-6 FAKE U.2 Sharkfin A VPD at24csw080 Present None
dev-7 FAKE U.2 Sharkfin A hot swap controller max5970 Present None
dev-8 FAKE U.2 A NVMe Basic Management Command nvme_bmc Present None
dev-39 FAKE T6 temperature sensor tmp451 Present None
dev-53 FAKE Fan controller max31790 Present None

CABOOSES: none found

Expand Down
11 changes: 11 additions & 0 deletions gateway-test-utils/configs/config.test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,17 @@ addr = "[::1]:0"
ignition-target = 3
location = { switch0 = ["sled", 1], switch1 = ["sled", 1] }

#
# Configuration for SP sensor metrics polling
#
[metrics]
# Bryan wants to try polling SP sensors at 1Hz.
sp_poll_interval_ms = 1000
# Tell Oximeter to collect our metrics every 10 seconds.
oximeter_collection_interval_secs = 10
# Allow binding the metrics server on localhost.
dev = { bind_loopback = true }

#
# NOTE: for the test suite, if mode = "file", the file path MUST be the sentinel
# string "UNUSED". The actual path will be generated by the test suite for each
Expand Down
166 changes: 166 additions & 0 deletions gateway-test-utils/configs/sp_sim_config.test.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,19 @@ device = "fake-tmp-sensor"
description = "FAKE temperature sensor 1"
capabilities = 0x2
presence = "Present"
sensors = [
{name = "Southwest", kind = "Temperature", last_data.value = 41.7890625, last_data.timestamp = 1234 },
]

[[simulated_sps.sidecar.components]]
id = "dev-1"
device = "fake-tmp-sensor"
description = "FAKE temperature sensor 2"
capabilities = 0x2
presence = "Failed"
sensors = [
{ name = "South", kind = "Temperature", last_error.value = "DeviceError", last_error.timestamp = 1234 },
]

[[simulated_sps.sidecar]]
multicast_addr = "::1"
Expand Down Expand Up @@ -56,6 +62,82 @@ device = "fake-tmp-sensor"
description = "FAKE temperature sensor"
capabilities = 0x2
presence = "Failed"
sensors = [
{ name = "Southwest", kind = "Temperature", last_error.value = "DeviceError", last_error.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-1"
device = "tmp117"
description = "FAKE temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "South", kind = "Temperature", last_data.value = 42.5625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-2"
device = "tmp117"
description = "FAKE Southeast temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "Southeast", kind = "Temperature", last_data.value = 41.570313, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-6"
device = "at24csw080"
description = "FAKE U.2 Sharkfin A VPD"
capabilities = 0x0
presence = "Present"

[[simulated_sps.gimlet.components]]
id = "dev-7"
device = "max5970"
description = "FAKE U.2 Sharkfin A hot swap controller"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "V12_U2A_A0", kind = "Current", last_data.value = 0.45898438, last_data.timestamp = 1234 },
{ name = "V3P3_U2A_A0", kind = "Current", last_data.value = 0.024414063, last_data.timestamp = 1234 },
{ name = "V12_U2A_A0", kind = "Voltage", last_data.value = 12.03125, last_data.timestamp = 1234 },
{ name = "V3P3_U2A_A0", kind = "Voltage", last_data.value = 3.328125, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-8"
device = "nvme_bmc"
description = "FAKE U.2 A NVMe Basic Management Command"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "U2_N0", kind = "Temperature", last_data.value = 56.0, last_data.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-39"
device = "tmp451"
description = "FAKE T6 temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-53"
device = "max31790"
description = "FAKE Fan controller"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "Southeast", kind = "Speed", last_data.value = 2607.0, last_data.timestamp = 1234 },
{ name = "Northeast", kind = "Speed", last_data.value = 2476.0, last_data.timestamp = 1234 },
{ name = "South", kind = "Speed", last_data.value = 2553.0, last_data.timestamp = 1234 },
{ name = "North", kind = "Speed", last_data.value = 2265.0, last_data.timestamp = 1234 },
{ name = "Southwest", kind = "Speed", last_data.value = 2649.0, last_data.timestamp = 1234 },
{ name = "Northwest", kind = "Speed", last_data.value = 2275.0, last_data.timestamp = 1234 },
]


[[simulated_sps.gimlet]]
multicast_addr = "::1"
Expand All @@ -72,6 +154,90 @@ capabilities = 0
presence = "Present"
serial_console = "[::1]:0"


[[simulated_sps.gimlet.components]]
id = "dev-0"
device = "tmp117"
description = "FAKE temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "Southwest", kind = "Temperature", last_data.value = 41.3629, last_data.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-1"
device = "tmp117"
description = "FAKE temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "South", kind = "Temperature", last_data.value = 42.5625, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-2"
device = "tmp117"
description = "FAKE Southeast temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "Southeast", kind = "Temperature", last_data.value = 41.570313, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-6"
device = "at24csw080"
description = "FAKE U.2 Sharkfin A VPD"
capabilities = 0x0
presence = "Present"

[[simulated_sps.gimlet.components]]
id = "dev-7"
device = "max5970"
description = "FAKE U.2 Sharkfin A hot swap controller"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "V12_U2A_A0", kind = "Current", last_data.value = 0.41893438, last_data.timestamp = 1234 },
{ name = "V3P3_U2A_A0", kind = "Current", last_data.value = 0.025614603, last_data.timestamp = 1234 },
{ name = "V12_U2A_A0", kind = "Voltage", last_data.value = 12.02914, last_data.timestamp = 1234 },
{ name = "V3P3_U2A_A0", kind = "Voltage", last_data.value = 3.2618, last_data.timestamp = 1234 },
]

[[simulated_sps.gimlet.components]]
id = "dev-8"
device = "nvme_bmc"
description = "FAKE U.2 A NVMe Basic Management Command"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "U2_N0", kind = "Temperature", last_data.value = 56.0, last_data.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-39"
device = "tmp451"
description = "FAKE T6 temperature sensor"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "t6", kind = "Temperature", last_data.value = 70.625, last_data.timestamp = 1234 },
]
[[simulated_sps.gimlet.components]]
id = "dev-53"
device = "max31790"
description = "FAKE Fan controller"
capabilities = 0x2
presence = "Present"
sensors = [
{ name = "Southeast", kind = "Speed", last_data.value = 2510.0, last_data.timestamp = 1234 },
{ name = "Northeast", kind = "Speed", last_data.value = 2390.0, last_data.timestamp = 1234 },
{ name = "South", kind = "Speed", last_data.value = 2467.0, last_data.timestamp = 1234 },
{ name = "North", kind = "Speed", last_data.value = 2195.0, last_data.timestamp = 1234 },
{ name = "Southwest", kind = "Speed", last_data.value = 2680.0, last_data.timestamp = 1234 },
{ name = "Northwest", kind = "Speed", last_data.value = 2212.0, last_data.timestamp = 1234 },
]


#
# NOTE: for the test suite, the [log] section is ignored; sp-sim logs are rolled
# into the gateway logfile.
Expand Down
Loading
Loading