Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gateway] ingest sensor measurements from SPs into oximeter #6354

Merged
merged 78 commits into from
Aug 24, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 16, 2024

This branch adds code to the Management Gateway Service for periodically
polling sensor measurements from SPs and emitting it to Oximeter. In
particular, this consists of:

  • a task for managing the metrics endpoint, waiting until MGS knows its
    underlay network address to bind the endpoint and register it with the
    control plane,
  • tasks for polling sensor measurements from each individual SP that MGS
    knows about,
  • a task that waits until SP discovery has completed and the rack ID to
    be known, and then spawns a poller task for every discovered SP slot

The SP poller tasks send samples to the Oximeter producer endpoint using
a tokio::sync::broadcast channel, which I've chosen primarily because
it can be used as a bounded ring buffer that actually overwrites the
oldest value when the buffer is full. This mostway, we use a bounded
amount of memory for samples, but prioritize the most recent samples if
we have to throw anything away because Oximeter hasn't come along to
collect them recently.

The poller tasks cache the component inventory and identifying
information from the SP, so that we don't have to re-read all this data
from the SP on every poll. While MGS, running on a host, would probably
be fine with doing this, it seems better to avoid making the SP do
unnecessary work at a 1Hz poll frequency, especially when both switch
zones are polling them. Instead, every time we poll sensor data from an
SP, we first ask it for its current state, and only invalidate our
cached understanding of the SP when the state changes. This way, if a SP
starts reporting new metrics due to a firmware update, or gets replaced
with a different chassis with a new serial number, revision, etc, we
won't continue to report metrics for stale targets, but we don't have to
reload all of that once per second. To detect scenarios where the SP's
state and/or identity has changed in the midst of polling its sensors
(which may result in mislabeled metrics), we check whether the SP's
state at the end of the poll matches its state at the beginning, and if
it's not, we poll again immediately with its new identity.

At present, the timestamps for these metric samples is generated by MGS
--- it's the time when MGS received the sensor data from the SP, as MGS
understands it. Because we don't currently collect data that was
recorded prior to the switch zone coming up, we don't need to worry
about figuring out timestamps for data recorded by the SP prior to the
existence of a wall clock. Figuring out the SP/MGS timebase
synchronization is probably a lot of additional work, although it would
be nice to do in the future. At present, metrics emitted by sled-agent
prior to NTP sync will also be from 1987
, so I think it's fine to do
something similar here, especially because the potential solutions to
that also have their fair share of tradeoffs.

The new metrics use a schema in
oximeter/oximeter/schema/hardware-component.toml. The target of these
metrics is a hardware_component that includes:

  • the rack ID and the identity of the MGS instance that collected the
    metric,
  • information identifying the chassis1 and of the SP that recorded
    them (its serial number, model number, revision, and whether it's a
    switch, a sled, or a power shelf),
  • the SP's Hubris archive version (since the reported sensor data may
    change in future firmware releases)
  • the SP's ID for the hardware component (e.g. "dev-7"), the kind of
    device (e.g. "tmp117", "max5970"), and the humman-readable description
    (e.g. "Southeast temperature sensor", "U.2 Sharkfin A hot swap
    controller", etc.) reported by the SP

Each kind of sensor reading has an individual metric
(hardware_component:temperature, hardware_component:current,
hardware_component:voltage, and so on). These metrics are labeled with
the SP-reported name of the individual sensor measurement channel. For
instance, a MAX5970 hotswap controller on sharkfin will have a voltage
and current metric named "V12_U2A_A0" for the 12V rail, and a voltage
and current metric named "V3P3_U2A_A0" for the 3.3V rail. Finally, a
hardware_component:sensor_errors metric records sensor errors reported
by the SP, labeled with the sensor name, what kind of sensor it is, and
a string representation of the error.

Footnotes

  1. I'm using "chassis" as a generic term to refer to "switch, sled,
    or power shelf".

hawkw added a commit that referenced this pull request Aug 19, 2024
While working on #6354, it turned out that having to manually assign
these was really easy to mess up when you want to simulate a lot of
sensors, and there wasn't an easy way to get "real" ID values from an
inventory of a real life SP, so the config writer has to assign them
sequentially anyway. #6354 ended up using `component_details`
rather than `read_sensor`, so being able to assign sensor IDs identical
to those from Hubris turned out to not be that important, since (IIUC)
the only way to get the `SensorId`s is to look at the Hubris archive,
while *component* IDs can be discovered dynamically from the inventory
APIs.
@hawkw
Copy link
Member Author

hawkw commented Aug 19, 2024

Hmm, testing on london suggests that this code does not nicely handle cases where only some sleds are present:

23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): uh-oh! a known SP's poller task has gone AWOL. restarting it...
    chassis_type = Sled
    file = gateway/src/metrics.rs:461
    sp_slot = 1
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): our SP seems to no longer be present; giving up.
    chassis_type = Sled
    file = gateway/src/metrics.rs:569
    sp_slot = 19
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): our SP seems to no longer be present; giving up.
    chassis_type = Sled
    file = gateway/src/metrics.rs:569
    sp_slot = 5
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): uh-oh! a known SP's poller task has gone AWOL. restarting it...
    chassis_type = Sled
    file = gateway/src/metrics.rs:461
    sp_slot = 19
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): our SP seems to no longer be present; giving up.
    chassis_type = Sled
    file = gateway/src/metrics.rs:569
    sp_slot = 0
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): our SP seems to no longer be present; giving up.
    chassis_type = Sled
    file = gateway/src/metrics.rs:569
    sp_slot = 28
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): uh-oh! a known SP's poller task has gone AWOL. restarting it...
    chassis_type = Sled
    file = gateway/src/metrics.rs:461
    sp_slot = 5
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): uh-oh! a known SP's poller task has gone AWOL. restarting it...
    chassis_type = Sled
    file = gateway/src/metrics.rs:461
    sp_slot = 28
23:56:25.063Z INFO 7522d643-1901-4ded-b5de-d68eb2cb2051 (sensor-poller): uh-oh! a known SP's poller task has gone AWOL. restarting it...
    chassis_type = Sled
    file = gateway/src/metrics.rs:461
    sp_slot = 0

It appears that the inventory contains non-present SPs for every slot and we are endlessly starting new pollers for them. Whoopsie.

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

Also, it looks like we're getting 500 errors from Nexus whilst attempting to register the Oximeter producer. That seems weird...

23:48:03.194Z WARN 7522d643-1901-4ded-b5de-d68eb2cb2051 (producer-server): failed to register as a producer with Nexus, will retry
    delay = 56.00724361s
    error = "Communication Error: error sending request for url (http://[fd00:1122:3344:103::4]:12221/metrics/producers): operation timed out"
    file = oximeter/producer/src/lib.rs:424
23:48:59.295Z WARN 7522d643-1901-4ded-b5de-d68eb2cb2051 (producer-server): failed to register as a producer with Nexus, will retry
    delay = 131.265804997s
    error = "Error Response: status: 500 Internal Server Error; headers: {\\"content-type\\": \\"application/json\\", \\"x-request-id\\": \\"69d6cd88-4c4e-4d7b-abf6-def8fe0308c2\\", \\"content-length\\": \\"124\\", \\"date\\": \\"Mon, 19 Aug 2024 23:48:58 GMT\\"}; value: Error { error_code: Some(\\"Internal\\"), message: \\"Internal Server Error\\", request_id: \\"69d6cd88-4c4e-4d7b-abf6-def8fe0308c2\\" }"
    file = oximeter/producer/src/lib.rs:424
23:51:11.278Z WARN 7522d643-1901-4ded-b5de-d68eb2cb2051 (producer-server): failed to register as a producer with Nexus, will retry
    delay = 256.140367109s
    error = "Error Response: status: 500 Internal Server Error; headers: {\\"content-type\\": \\"application/json\\", \\"x-request-id\\": \\"906de26f-0ce2-402a-b130-745e44d35b99\\", \\"content-length\\": \\"124\\", \\"date\\": \\"Mon, 19 Aug 2024 23:51:10 GMT\\"}; value: Error { error_code: Some(\\"Internal\\"), message: \\"Internal Server Error\\", request_id: \\"906de26f-0ce2-402a-b130-745e44d35b99\\" }"
    file = oximeter/producer/src/lib.rs:424
BRM42220036 #

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

Second thing appears to be because Nexus doesn't understand the "management_gateway" producer type. Seems like an oversight on my part:

23:51:11.277Z INFO 4024c343-cc04-4f48-a661-3c78e23b73d4 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = database error (kind = Unknown): error in argument for $4: invalid input value for enum producer_kind: "management_gateway"\n
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/52d900a/dropshot/src/server.rs:902
    latency_us = 670663
    local_addr = [fd00:1122:3344:103::4]:12221
    method = POST
    remote_addr = [fd00:1122:3344:101::2]:36083
    req_id = 906de26f-0ce2-402a-b130-745e44d35b99
    response_code = 500
    uri = /metrics/producers

gateway/src/metrics.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

Second thing appears to be because Nexus doesn't understand the "management_gateway" producer type. Seems like an oversight on my part:

23:51:11.277Z INFO 4024c343-cc04-4f48-a661-3c78e23b73d4 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = database error (kind = Unknown): error in argument for $4: invalid input value for enum producer_kind: "management_gateway"\n
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/52d900a/dropshot/src/server.rs:902
    latency_us = 670663
    local_addr = [fd00:1122:3344:103::4]:12221
    method = POST
    remote_addr = [fd00:1122:3344:101::2]:36083
    req_id = 906de26f-0ce2-402a-b130-745e44d35b99
    response_code = 500
    uri = /metrics/producers

Hmm, it looks like the OpenAPI specs for both nexus-internal and oximeter do properly have a "management_gateway" producer kind variant:

{
"description": "The producer is a management gateway service.",
"type": "string",
"enum": [
"management_gateway"
]
}

{
"description": "The producer is a management gateway service.",
"type": "string",
"enum": [
"management_gateway"
]
}

So, I'm not really sure what's gone wrong there...

@hawkw hawkw force-pushed the eliza/sensor-metric branch 2 times, most recently from a38faf5 to fe35438 Compare August 20, 2024 17:28
@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

hey look, it works! (on london)

0x〉\l
collection_target:cpus_provisioned
collection_target:ram_provisioned
collection_target:virtual_disk_space_provisioned
component:current
component:fan_speed
component:sensor_error_count
component:temperature
component:voltage

# ...

0x〉get component:temperature | filter serial == "BRM31230004" | last 1

component:temperature

 chassis_type: switch
 component: dev-15
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN3
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.391445694: [35.69047546386719]

 chassis_type: switch
 component: dev-24
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Southwest
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.393100615: [29.9453125]

 chassis_type: switch
 component: dev-27
 device: tmp451
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: vsc7448
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.395108446: [45.875]

# long output snipped ...
All the output from that OxQL query:
0x〉get component:temperature | filter serial == "BRM31230004" | last 1

component:temperature

 chassis_type: switch
 component: dev-15
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN3
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.391445694: [35.69047546386719]

 chassis_type: switch
 component: dev-24
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Southwest
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.393100615: [29.9453125]

 chassis_type: switch
 component: dev-27
 device: tmp451
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: vsc7448
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.395108446: [45.875]

 chassis_type: switch
 component: dev-6
 device: tps546b24a
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V3P3_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.399078761: [43.25]

 chassis_type: switch
 component: dev-11
 device: tmp451
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: tf2
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.462607264: [56.125]

 chassis_type: switch
 component: dev-27
 device: tmp451
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: vsc7448
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.471691881: [45.75]

 chassis_type: switch
 component: dev-12
 device: raa229618
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V0P9_TF2_VDDT
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.390284328: [55]

 chassis_type: switch
 component: dev-2
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: NNE
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.543751523: [41.984375]

 chassis_type: switch
 component: dev-6
 device: tps546b24a
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V3P3_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.469581886: [43.5]

 chassis_type: switch
 component: dev-22
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: South
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.353400438: [31.5703125]

 chassis_type: switch
 component: dev-12
 device: raa229618
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P5_TF2_VDDA
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.390344773: [53]

 chassis_type: switch
 component: dev-3
 device: raa229618
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V0P8_TF2_VDD_CORE
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.469059794: [55]

 chassis_type: switch
 component: dev-8
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_HSC
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.389696096: [46.64285659790039]

 chassis_type: switch
 component: dev-25
 device: tps546b24a
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P0_MGMT
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.464285851: [42]

 chassis_type: switch
 component: dev-8
 device: adm1272
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_HSC
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.461097518: [46.880950927734375]

 chassis_type: switch
 component: dev-22
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: South
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.468429588: [31.5625]

 chassis_type: switch
 component: dev-12
 device: raa229618
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P5_TF2_VDDA
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.466104043: [53]

 chassis_type: switch
 component: dev-5
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN0
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.397080083: [41.16666793823242]

 chassis_type: switch
 component: dev-7
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Northeast
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.470065384: [37.6796875]

 chassis_type: switch
 component: dev-16
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Northwest
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.394568385: [37.640625]

 chassis_type: switch
 component: dev-26
 device: tps546b24a
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.471057126: [40.5]

 chassis_type: switch
 component: dev-13
 device: bmr491
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V12P0_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.467794101: [65.75]

 chassis_type: switch
 component: dev-16
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Northwest
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.470431529: [37.640625]

 chassis_type: switch
 component: dev-23
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Southeast
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.394042330: [32.2578125]

 chassis_type: switch
 component: dev-19
 device: isl68224
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_TF2_VDD
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.541982133: [45]

 chassis_type: switch
 component: dev-0
 device: adm1272
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN1
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.460359487: [44.02381134033203]

 chassis_type: switch
 component: dev-25
 device: tps546b24a
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P0_MGMT
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.399577534: [42]

 chassis_type: switch
 component: dev-19
 device: isl68224
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_TF2_VDD
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.392028385: [45]

 chassis_type: switch
 component: dev-7
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Northeast
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.354287184: [37.6953125]

 chassis_type: switch
 component: dev-15
 device: adm1272
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN3
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.544164256: [35.4523811340332]

 chassis_type: switch
 component: dev-26
 device: tps546b24a
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.353821634: [40.75]

 chassis_type: switch
 component: dev-3
 device: raa229618
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V0P8_TF2_VDD_CORE
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.392649472: [54]

 chassis_type: switch
 component: dev-9
 device: tps546b24a
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V5P0_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.393606386: [45.75]

 chassis_type: switch
 component: dev-24
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Southwest
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.461559575: [29.96875]

 chassis_type: switch
 component: dev-2
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: NNE
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.355226700: [41.953125]

 chassis_type: switch
 component: dev-23
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: Southeast
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.544611165: [32.2578125]

 chassis_type: switch
 component: dev-10
 device: tmp117
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: NNW
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.398668606: [43.921875]

 chassis_type: switch
 component: dev-14
 device: adm1272
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN2
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.466801687: [43.5476188659668]

 chassis_type: switch
 component: dev-9
 device: tps546b24a
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V5P0_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.463152179: [45.5]

 chassis_type: switch
 component: dev-10
 device: tmp117
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: NNW
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.467341853: [43.9296875]

 chassis_type: switch
 component: dev-19
 device: isl68224
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_TF2_VDDA
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.542077294: [43]

 chassis_type: switch
 component: dev-12
 device: raa229618
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V0P9_TF2_VDDT
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.466043047: [54]

 chassis_type: switch
 component: dev-14
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN2
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.354749037: [43.78571319580078]

 chassis_type: switch
 component: dev-5
 device: adm1272
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN0
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:12.541270292: [41.16666793823242]

 chassis_type: switch
 component: dev-0
 device: adm1272
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V54_FAN1
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.395622668: [43.78571319580078]

 chassis_type: switch
 component: dev-13
 device: bmr491
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V12P0_SYS
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.397629431: [66]

 chassis_type: switch
 component: dev-19
 device: isl68224
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: V1P8_TF2_VDDA
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.392084551: [43]

 chassis_type: switch
 component: dev-11
 device: tmp451
 gateway_id: 1ed5fbb8-c1c8-40f9-a68e-4af633dbfb76
 hubris_archive_id: 23683d5ff14cc576
 model: 913-0000006
 name: tf2
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 10
 serial: BRM31230004
 slot: 0
   2024-08-20 19:55:18.388714881: [56.25]
0x〉
0x〉get component:sensor_error_count | last 1

component:sensor_error_count

 chassis_type: sled
 component: dev-48
 device: raa229618
 error: no_reading
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: f64c4a1710ed59b2
 model: 913-0000019
 name: VDD_VCORE
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 6
 sensor_kind: power
 serial: BRM42220030
 slot: 16
   [1986-12-28 00:03:51.764049710, 2024-08-20 19:58:03.497049640]: [1672]

 chassis_type: sled
 component: dev-48
 device: raa229618
 error: no_reading
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: f64c4a1710ed59b2
 model: 913-0000019
 name: VDD_MEM_ABCD
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 6
 sensor_kind: power
 serial: BRM42220030
 slot: 16
   [1986-12-28 00:03:51.764108682, 2024-08-20 19:58:03.497096159]: [1672]

 chassis_type: sled
 component: dev-49
 device: raa229618
 error: no_reading
 gateway_id: 8de7c4ab-ee46-4d3f-af5c-3cc053f4c142
 hubris_archive_id: 7146a896fc8c4525
 model: 913-0000019
 name: VDD_MEM_EFGH
 rack_id: d37c9a19-5bc1-4c5d-8c45-2f3633ad661c
 revision: 6
 sensor_kind: power
 serial: BRM42220062
 slot: 15
   [1986-12-28 00:03:51.702766313, 2024-08-20 19:58:03.472449264]: [1672]

# ... long output snipped ...

@bnaecker
Copy link
Collaborator

I meant to mention this after your wonderful demo on Friday, so thanks for the demo and apologies for the delay! I think it'd help to have a more descriptive target name than component. I understand why it's called that, given the SP code we're reading from. I also think it'll make things much easier to interpret and understand with a more descriptive name. That will be less important once we have descriptions plumbed up through the API, but until then the name is all we have. I'm not sure the best alternative, but here are a few options:

  • hardware_component
  • chassis_component
  • board_component (not great, not all these are part of a single board)
  • hardware_device
  • hardware_sensor ("sensor" might be too specific, if you fell we'll add more, non-sensor metrics under the same target)
  • hardware_resource or maybe chassis_resource

Happy to bikeshed a name with you if that's helpful, and I'm also sure others have better ideas than me, especially @rmustacc. The naming schema in fmtopo output might also be a useful guide, or the FMA programmer's manual.

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

@bnaecker I think you're right that a more descriptive name is in order --- I had originally called it sled_component, as I had started off with totally separate schemas for sled, switch, and power shelf sensor readings, and then realized we actually didn't need to do that.

Of your recommendations, I'm inclined inclined to go with hardware_component. I'd prefer to not use the name "sensor" in the target, not only because I do think the same target could also be used for things that are not strictly speaking sensor measurements in the future, but also because at present, an individual component may have multiple sensors. For instance, a MAX5970 component reports readings from four separate sensors, as it has both a 12V rail and a 3.3V rail, both of which have current and voltage sensors.

Incidentally, you'll note that in the current schema, the actual sensor name is a field on the metrics, rather than the component, so that components that provide multiple sensor readings share a target. Now that I think of it, perhaps the name field on those metrics ought to be renamed to sensor (or sensor_name?) --- what do you think?

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

I might also rename the serial and model fields on hardware_component to chassis_serial and chassis_model, to make it clear that they refer to the serial and model number of the sled/switch/power shelf, rather than the individual hardware device.

@hawkw hawkw marked this pull request as ready for review August 20, 2024 21:43
@hawkw hawkw changed the title [WIP] ingest sensor measurements from SPs into oximeter [gateway] ingest sensor measurements from SPs into oximeter Aug 20, 2024
@bnaecker
Copy link
Collaborator

Of your recommendations, I'm inclined inclined to go with hardware_component. I'd prefer to not use the name "sensor" in the target, not only because I do think the same target could also be used for things that are not strictly speaking sensor measurements in the future, but also because at present, an individual component may have multiple sensors. For instance, a MAX5970 component reports readings from four separate sensors, as it has both a 12V rail and a 3.3V rail, both of which have current and voltage sensors.

I think that makes sense! It gives us flexibility to add non-sensor metrics, and is also more accurate right now.

Incidentally, you'll note that in the current schema, the actual sensor name is a field on the metrics, rather than the component, so that components that provide multiple sensor readings share a target. Now that I think of it, perhaps the name field on those metrics ought to be renamed to sensor (or sensor_name?) --- what do you think?

Yeah, I agree with you, and would vote for sensor_name. Just name is a bit too vague, and could apply to a few different parts of the data.

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

Yeah, I agree with you, and would vote for sensor_name. Just name is a bit too vague, and could apply to a few different parts of the data.

At present, it's just sensor, we have a component_kind/component and a sensor_kind/sensor in the schema. I'm happy to change both, or either, of them to _name if you prefer.

@bnaecker
Copy link
Collaborator

My inclination would be to add _name to both. E.g. component_name and sensor_name. Possibly device_name too, but I'm still unclear about that.

I asked in chat during your demo, but didn't get a good answer. What does the component field tell us that the other fields don't?

@hawkw
Copy link
Member Author

hawkw commented Aug 20, 2024

I asked in chat during your demo, but didn't get a good answer. What does the component field tell us that the other fields don't?

The component is a unique identifier of the individual component on the board. For instance, if we look at the components discovered from a Real Life Gimlet on madrid, we see this:

eliza@jeeves ~ $ faux-mgs --interface=madrid_sw0tp0 --discovery-addr='[fe80::aa40:25ff:fe04:147%15]:11111' inventory
Aug 20 22:02:51.993 INFO creating SP handle on interface madrid_sw0tp0, component: faux-mgs
Aug 20 22:02:52.025 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:147%15]:11111, interface: madrid_sw0tp0, component: faux-mgs
COMPONENT        STATUS       DEVICE           DESCRIPTION (CAPABILITIES)
sp               Present      sp               Service Processor (DeviceCapabilities(1))
sp3-host-cpu     Present      sp3-host-cpu     Gimlet SP3 host cpu (DeviceCapabilities(4))
host-boot-flash  Present      host-boot-flash  Gimlet host boot flash (DeviceCapabilities(1))
system-led       Present      system-led       System attention LED (DeviceCapabilities(8))
dev-0            Present      tmp117           Southwest temperature sensor (DeviceCapabilities(2))
dev-1            Present      tmp117           South temperature sensor (DeviceCapabilities(2))
dev-2            Present      tmp117           Southeast temperature sensor (DeviceCapabilities(2))
dev-3            Present      pca9545          U.2 ABCD mux (DeviceCapabilities(0))
dev-4            Present      pca9545          U.2 EFGH mux (DeviceCapabilities(0))
dev-5            Present      pca9545          U.2 IJ/FRUID mux (DeviceCapabilities(0))
dev-6            Present      at24csw080       U.2 Sharkfin A VPD (DeviceCapabilities(0))
dev-7            Present      max5970          U.2 Sharkfin A hot swap controller (DeviceCapabilities(2))
dev-8            Present      nvme_bmc         U.2 A NVMe Basic Management Command (DeviceCapabilities(2))
dev-9            Present      at24csw080       U.2 Sharkfin B VPD (DeviceCapabilities(0))
dev-10           Present      max5970          U.2 Sharkfin B hot swap controller (DeviceCapabilities(2))
dev-11           Present      nvme_bmc         U.2 B NVMe Basic Management Control (DeviceCapabilities(2))
dev-12           Present      at24csw080       U.2 Sharkfin C VPD (DeviceCapabilities(0))
dev-13           Present      max5970          U.2 Sharkfin C hot swap controller (DeviceCapabilities(2))
dev-14           Present      nvme_bmc         U.2 C NVMe Basic Management Control (DeviceCapabilities(2))
dev-15           Present      at24csw080       U.2 Sharkfin D VPD (DeviceCapabilities(0))
dev-16           Present      max5970          U.2 Sharkfin D hot swap controller (DeviceCapabilities(2))
dev-17           Present      nvme_bmc         U.2 D NVMe Basic Management Control (DeviceCapabilities(2))
dev-18           Present      at24csw080       U.2 Sharkfin E VPD (DeviceCapabilities(0))
dev-19           Present      max5970          U.2 Sharkfin E hot swap controller (DeviceCapabilities(2))
dev-20           Present      nvme_bmc         U.2 E NVMe Basic Management Control (DeviceCapabilities(2))
dev-21           Present      at24csw080       U.2 Sharkfin F VPD (DeviceCapabilities(0))
dev-22           Present      max5970          U.2 Sharkfin F hot swap controller (DeviceCapabilities(2))
dev-23           Present      nvme_bmc         U.2 F NVMe Basic Management Control (DeviceCapabilities(2))
dev-24           Present      at24csw080       U.2 Sharkfin G VPD (DeviceCapabilities(0))
dev-25           Present      max5970          U.2 Sharkfin G hot swap controller (DeviceCapabilities(2))
dev-26           Present      nvme_bmc         U.2 G NVMe Basic Management Control (DeviceCapabilities(2))
dev-27           Present      at24csw080       U.2 Sharkfin H VPD (DeviceCapabilities(0))
dev-28           Present      max5970          U.2 Sharkfin H hot swap controller (DeviceCapabilities(2))
dev-29           Present      nvme_bmc         U.2 H NVMe Basic Management Control (DeviceCapabilities(2))
dev-30           Present      at24csw080       U.2 Sharkfin I VPD (DeviceCapabilities(0))
dev-31           Present      max5970          U.2 Sharkfin I hot swap controller (DeviceCapabilities(2))
dev-32           Present      nvme_bmc         U.2 I NVMe Basic Management Control (DeviceCapabilities(2))
dev-33           Present      at24csw080       U.2 Sharkfin J VPD (DeviceCapabilities(0))
dev-34           Present      max5970          U.2 Sharkfin J hot swap controller (DeviceCapabilities(2))
dev-35           Present      nvme_bmc         U.2 J NVMe Basic Management Control (DeviceCapabilities(2))
dev-36           Present      at24csw080       Gimlet VPD (DeviceCapabilities(0))
dev-37           Present      pca9545          M.2 mux (DeviceCapabilities(0))
dev-38           Present      at24csw080       Fan VPD (DeviceCapabilities(0))
dev-39           Present      tmp451           T6 temperature sensor (DeviceCapabilities(2))
dev-40           Present      tps546b24a       A2 3.3V rail (DeviceCapabilities(2))
dev-41           Present      tps546b24a       A0 3.3V rail (DeviceCapabilities(2))
dev-42           Present      tps546b24a       A2 5V rail (DeviceCapabilities(2))
dev-43           Present      tps546b24a       A2 1.8V rail (DeviceCapabilities(2))
dev-44           Present      max5970          M.2 hot plug controller (DeviceCapabilities(2))
dev-45           Present      sbrmi            CPU via SB-RMI (DeviceCapabilities(0))
dev-46           Present      sbtsi            CPU temperature sensor (DeviceCapabilities(2))
dev-47           Present      idt8a34003       Clock generator (DeviceCapabilities(0))
dev-48           Present      raa229618        CPU power controller (DeviceCapabilities(2))
dev-49           Present      raa229618        SoC power controller (DeviceCapabilities(2))
dev-50           Present      isl68224         DIMM/SP3 1.8V A0 power controller (DeviceCapabilities(2))
dev-51           Present      adm1272          Fan hot swap controller (DeviceCapabilities(2))
dev-52           Present      adm1272          Sled hot swap controller (DeviceCapabilities(2))
dev-53           Present      max31790         Fan controller (DeviceCapabilities(2))
dev-54           Present      tps546b24a       T6 power controller (DeviceCapabilities(2))
dev-55           Present      tmp117           Northeast temperature sensor (DeviceCapabilities(2))
dev-56           Present      tmp117           North temperature sensor (DeviceCapabilities(2))
dev-57           Present      tmp117           Northwest temperature sensor (DeviceCapabilities(2))
dev-58           Present      bmr491           Intermediate bus converter (DeviceCapabilities(2))
dev-59           Present      tse2004av        DIMM A0 (DeviceCapabilities(2))
dev-60           Present      tse2004av        DIMM A1 (DeviceCapabilities(2))
dev-61           Present      tse2004av        DIMM B0 (DeviceCapabilities(2))
dev-62           Present      tse2004av        DIMM B1 (DeviceCapabilities(2))
dev-63           Present      tse2004av        DIMM C0 (DeviceCapabilities(2))
dev-64           Present      tse2004av        DIMM C1 (DeviceCapabilities(2))
dev-65           Present      tse2004av        DIMM D0 (DeviceCapabilities(2))
dev-66           Present      tse2004av        DIMM D1 (DeviceCapabilities(2))
dev-67           Present      tse2004av        DIMM E0 (DeviceCapabilities(2))
dev-68           Present      tse2004av        DIMM E1 (DeviceCapabilities(2))
dev-69           Present      tse2004av        DIMM F0 (DeviceCapabilities(2))
dev-70           Present      tse2004av        DIMM F1 (DeviceCapabilities(2))
dev-71           Present      tse2004av        DIMM G0 (DeviceCapabilities(2))
dev-72           Present      tse2004av        DIMM G1 (DeviceCapabilities(2))
dev-73           Present      tse2004av        DIMM H0 (DeviceCapabilities(2))
dev-74           Present      tse2004av        DIMM H1 (DeviceCapabilities(2))
dev-75           Unavailable  m2_hp_only       M.2 A NVMe Basic Management Command (DeviceCapabilities(2))
dev-76           Unavailable  m2_hp_only       M.2 B NVMe Basic Management Command (DeviceCapabilities(2))
eliza@jeeves ~ $

Note that there are a bunch of entries with the same device (e.g. tmp117 temperature sensors, or max5970 hotswap controllers), but each one has an individual component identifier. Without including the component in the target, sensor measurements from (for example) all ten of a Gimlet's sharkfins would have the same target.

@bnaecker
Copy link
Collaborator

Thanks @hawkw, that's some helpful context! I'm looking at the output you shared. Under the column labeled "Description," I see labels, which do uniquely identify all the entries. Are those something we have access to, and could provide in a field for these timeseries? Or is that awkward or impossible to get at the point we generate the data?

This isn't a huge issue. If we need a unique-ifying key, and that's what we have, then let's use it. But it does appear to me that the dev-N naming is a bit of an implementation detail in Hubris, and it's not clear what someone could usefully do with it. My understanding is they'd need to look up something like the TOML that the Hubris application was built with to see what that actually refers to, is that right? Or is there another way one could connect this with, say, an inventory collection that Nexus collects?

Thanks for entertaining my questions :)

@hawkw
Copy link
Member Author

hawkw commented Aug 21, 2024

@hawkw
Copy link
Member Author

hawkw commented Aug 21, 2024

@bnaecker These are great questions, thanks!

Thanks @hawkw, that's some helpful context! I'm looking at the output you shared. Under the column labeled "Description," I see labels, which do uniquely identify all the entries. Are those something we have access to, and could provide in a field for these timeseries? Or is that awkward or impossible to get at the point we generate the data?

This isn't a huge issue. If we need a unique-ifying key, and that's what we have, then let's use it. But it does appear to me that the dev-N naming is a bit of an implementation detail in Hubris, and it's not clear what someone could usefully do with it. My understanding is they'd need to look up something like the TOML that the Hubris application was built with to see what that actually refers to, is that right? Or is there another way one could connect this with, say, an inventory collection that Nexus collects?

We do also have access to the "description" field in MGS, and we could also add them to the timeseries labels. I didn't originally, because I figured they could be hydrated from the component ID later using the inventory collected by Nexus. And, since it's how the SP identifies the component in its RPC interface, it's guaranteed to be unique, while the description fields just happen to be unique.

I agree with your point that the component IDs are less user-friendly, since they don't actually describe what the thing is. But, they are also included in Nexus' inventory collection. And, I would argue they're not a Hubris implementation detail, in the sense that they're not totally internal to Hubris; they are an implementation detail of the Hubris <--> MGS communication protocol. For instance, if you wanted to use the faux-mgs CLI (or the pilot command that wraps it) to ask a SP for component details, you would use these IDs. Similarly, they're used in the MGS HTTP API when requesting information about a component.

I'd like to keep the component IDs in the schema, since they are guaranteed to be unique relative to the SP's other components (while the description string is not), and because they might be cross-referenced against inventory collection or the MGS HTTP API. But, wth that said, I think it would be fine to include the description field in the metric target as well! Including it is probably a good idea. I didn't because I thought it was extra data to store in Clickhouse, and it could theoretically be hydrated from the inventory collection later...but we don't currently have a mechanism for doing that, so maybe we're better off throwing it in? I guess, since Clickhouse is a TSDB, target fields don't comprise most of the data it stores anyway and maybe it doesn't matter that much?

@bnaecker
Copy link
Collaborator

bnaecker commented Aug 21, 2024

I'd like to keep the component IDs in the schema, since they are guaranteed to be unique relative to the SP's other components (while the description string is not), and because they might be cross-referenced against inventory collection or the MGS HTTP API.

This is a good point! If the descriptions are not guaranteed to be unique, then we definitely should not use them to identify the timeseries. And the fact that we already collect them in the inventory is really good to know! That means someone can always correlate it with the collection to get more information about the component itself. I agree that it's not a Hubris implementation detail anymore, we're pulling it up into the control plane.

But, wth that said, I think it would be fine to include the description field in the metric target as well! Including it is probably a good idea. I didn't because I thought it was extra data to store in Clickhouse, and it could theoretically be hydrated from the inventory collection later...but we don't currently have a mechanism for doing that, so maybe we're better off throwing it in? I guess, since Clickhouse is a TSDB, target fields don't comprise most of the data it stores anyway and maybe it doesn't matter that much?

I think this is a good idea too. Correlating the data today would have to be done by hand, using omdb or some other Oxide-internal mechanism. And without them, it's not really possible to "just see" in the OxQL output what the component is actually measuring without referring to either that inventory collection, looking at Hubris, or some other onerous process.

And yeah, these fields will all be deduplicated over time. ClickHouse will periodically merge parts of the table to sort and compress them, and for these tables, it takes the most recent row with the same sorting key. So we will ultimately end up with one row per timeseries, per field, a minor cost!

@hawkw
Copy link
Member Author

hawkw commented Aug 21, 2024

@bnaecker regarding naming, do you have a preference between the following schemes:

  1. component_kind = "tmp117"
    component = "dev-0"
    component_description = "Southwest temperature sensor"
  2. component_kind = "tmp117"
    component_id = "dev-0"
    component_description = "Southwest temperature sensor"
  3. component_kind = "tmp117"
    component_id = "dev-0"
    component = "Southwest temperature sensor"

@hawkw
Copy link
Member Author

hawkw commented Aug 21, 2024

I went with component_kind, component_id, description in bd86ce7, but I'm open to alternative suggestions.

@bnaecker
Copy link
Collaborator

bnaecker commented Aug 21, 2024

I went with component_kind, component_id, description in bd86ce7, but I'm open to alternative suggestions.

I've got a mild preference for (2), but this also seems fine to me since it's unlikely to matter if one mentally applies the description to something other than the component itself! I'm also an outlier on the willingness-to-type-long-names axis :)

I'll take a more detailed look at the code soon, though I trust your code is all great. The fact that we see the data we want in OxQL is excellent!

@jgallagher
Copy link
Contributor

hmm, well, this is what's happening during that failing test

oh no

Ok, so, this test is super finicky, and I've gone back and reworked it a handful of times to make it not flaky. But the implementation is still super fragile, and it makes assumptions about who's talking to the simulated SP at various points during the test, so I suspect that if this branch causes some "extra" (from that test's perspective) traffic to the sim SP, it's throwing off the bits that are trying to step through progress values.

Given the test by lines of code volume is probably like 30% test, 70% "make the test not flaky", maybe we should just delete the test, or maybe replace it with a unit test? Alternatively, if it were easy to tell mgs_setup::test_setup() to not initialize the metrics subsystem at all, I suspect that would fix the failures, since it would go back to the somewhat-carefully-tuned-non-flakiness it has on main?

@hawkw
Copy link
Member Author

hawkw commented Aug 23, 2024

Given the test by lines of code volume is probably like 30% test, 70% "make the test not flaky", maybe we should just delete the test, or maybe replace it with a unit test? Alternatively, if it were easy to tell mgs_setup::test_setup() to not initialize the metrics subsystem at all, I suspect that would fix the failures, since it would go back to the somewhat-carefully-tuned-non-flakiness it has on main?

So, I'm happy to add the option to disable the metrics subsystem, and set that option for this test. I'm a bit concerned that if the metrics stuff does break update behavior in real life, we wouldn't catch it there. But, for now, I'll, go ahead and do that.

@jgallagher
Copy link
Contributor

So, I'm happy to add the option to disable the metrics subsystem, and set that option for this test. I'm a bit concerned that if the metrics stuff does break update behavior in real life, we wouldn't catch it there. But, for now, I'll, go ahead and do that.

If this turns out to be more painful than it sounds, I'd lean toward removing the test (and filing an issue to add some unit tests to cover what it was checking; SpUpdater is unlikely to change in any major way any time soon, so I don't think it'd be very urgent). At some point other things are going to cause the test to start flaking again, and it just won't be worth working around that fragility indefinitely.

@hawkw
Copy link
Member Author

hawkw commented Aug 23, 2024

So, I'm happy to add the option to disable the metrics subsystem, and set that option for this test. I'm a bit concerned that if the metrics stuff does break update behavior in real life, we wouldn't catch it there. But, for now, I'll, go ahead and do that.

If this turns out to be more painful than it sounds, I'd lean toward removing the test (and filing an issue to add some unit tests to cover what it was checking; SpUpdater is unlikely to change in any major way any time soon, so I don't think it'd be very urgent). At some point other things are going to cause the test to start flaking again, and it just won't be worth working around that fragility indefinitely.

Nah, disabling the metrics stuff from the config file is pretty easy, so I'm going to do that for now. My concern is more that, if there's a possibility that having the metrics stuff running can break SP updates (e.g. because the SP is kept too busy to handle the update requests), I'd really like to be able to catch that...

@jgallagher
Copy link
Contributor

If this turns out to be more painful than it sounds, I'd lean toward removing the test (and filing an issue to add some unit tests to cover what it was checking; SpUpdater is unlikely to change in any major way any time soon, so I don't think it'd be very urgent). At some point other things are going to cause the test to start flaking again, and it just won't be worth working around that fragility indefinitely.

Nah, disabling the metrics stuff from the config file is pretty easy, so I'm going to do that for now. My concern is more that, if there's a possibility that having the metrics stuff running can break SP updates (e.g. because the SP is kept too busy to handle the update requests), I'd really like to be able to catch that...

For the specific example there (keeping the SP too busy), I don't think this test is valid anyway, because it's hitting the simulated SP which has a lot more oomph than a real one.

I realized the test could read the actual SP simulator config file,
and figure out how many metrics should be present. That way, if someone
changes the config file later, it'll still work.
@hawkw
Copy link
Member Author

hawkw commented Aug 23, 2024

For the specific example there (keeping the SP too busy), I don't think this test is valid anyway, because it's hitting the simulated SP which has a lot more oomph than a real one.

Yeah, I was kinda worried when I saw it was failing, but that quickly went from "worried" to "surprised" because the SP sim shouldn't get overloaded...unless the actual UDP packets were getting dropped? I would still like to know what's going on there. But, for now, I just made it turn off metrics collection.

@jgallagher
Copy link
Contributor

Yeah, I was kinda worried when I saw it was failing, but that quickly went from "worried" to "surprised" because the SP sim shouldn't get overloaded...unless the actual UDP packets were getting dropped? I would still like to know what's going on there. But, for now, I just made it turn off metrics collection.

When I said the test was fragile, I really meant it... the test hooks into the simulated SP and limits the number of packets it can receive so that it can inspect intermediate state, and it does that based on the number of packets the test expects the SP to receive at various points. If there's extra traffic the test doesn't know about, the simulated SP will certainly give the appearance of dropped packets (because it will ignore them!).

@hawkw
Copy link
Member Author

hawkw commented Aug 23, 2024

When I said the test was fragile, I really meant it... the test hooks into the simulated SP and limits the number of packets it can receive so that it can inspect intermediate state, and it does that based on the number of packets the test expects the SP to receive at various points. If there's extra traffic the test doesn't know about, the simulated SP will certainly give the appearance of dropped packets (because it will ignore them!).

Ah, okay, yeah, that explains why I would have broken it (and is kinda ghastly!). I feel okay about just turning off the metrics stuff during that test.

there's really no reason to have a separate `dev` struct inside another
layer of Option...
@hawkw
Copy link
Member Author

hawkw commented Aug 24, 2024

Poking around a bit on madrid, I did encounter at least one sample claiming to have been recorded in 1984 (or, really, before NTP sync). Did we ever decide whether it was important to handle that before this gets in? Is it worthwhile to try and rig up some way to wait for NTP to sync before collecting samples, or something?

@bnaecker
Copy link
Collaborator

Is it worthwhile to try and rig up some way to wait for NTP to sync before collecting samples, or something?

I'd only deal with this if it's very easy. We already produce data from 1986 from the sled agents, so this won't be the only thing. Also, many of these are gauges, and we're almost always interested in the very recent past, in which case we'll likely only see timestamps from after sync.

OTOH, if you want to put a tiny check somewhere that's, "If not sync, or this sample has a time < 2000, don't produce anything," then far be it for me to stand in your way :)

@hawkw
Copy link
Member Author

hawkw commented Aug 24, 2024

I'd only deal with this if it's very easy. We already produce data from 1986 from the sled agents, so this won't be the only thing.

I think if we're not being fastidious about avoiding samples from The Land Before Timesync in other components, my preference would be to not worry about it, because the potential solutions I can think of involve collecting less data points. The easiest solution, IMO, is to make the metrics code wait until it's not 1986 anymore before it starts polling SPs, which seems like a bit of a bummer, because then we're just not collecting data at all. Alternatively, we could buffer all the data points along with a monotonic timestamp, until we see a date that's in this millennium, and then go back and update their timestamps like timestamp_utc = now_utc - (now_monotonic - timestamp_monotonic). But, because we don't know exactly how long it takes for NTP to sync, we're either buffering an unbounded amount of data (which I went out of my way to avoid), or letting the earliest samples drop off the end of the ringbuffer, which ends up being morally equivalent to "just don't collect samples until it's not 1986".

So, therefore, I think my preference for now is to accept that samples from prior to NTP sync will be listening to "Rock Me Amadeus" and watching Top Gun, if the alternative is either buffering them forever or throwing them away. We can always go back and implement one of those strategies later, if data from 1986 turns out to be too egregious, but we should probably also do something similar for sled-agent's metrics...

@hawkw hawkw merged commit 5afa0de into main Aug 24, 2024
24 checks passed
@hawkw hawkw deleted the eliza/sensor-metric branch August 24, 2024 18:16
hawkw added a commit that referenced this pull request Aug 24, 2024
Now that #6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the *SP* to perform a certain operation, which could
be a valuable signal.

This commit adds an `oximeter_instruments::http::LatencyTracker` to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev
run`. Then, I made a few HTTP requests to various MGS APIs using `curl`;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

```
0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit
```
hawkw added a commit that referenced this pull request Aug 26, 2024
Now that #6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the *SP* to perform a certain operation, which could
be a valuable signal.

This commit adds an `oximeter_instruments::http::LatencyTracker` to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev
run`.
Then, I made a few HTTP requests to various MGS APIs using `curl`;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

```
0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit
```
nickryand pushed a commit to nickryand/omicron that referenced this pull request Aug 27, 2024
Now that oxidecomputer#6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the *SP* to perform a certain operation, which could
be a valuable signal.

This commit adds an `oximeter_instruments::http::LatencyTracker` to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev
run`.
Then, I made a few HTTP requests to various MGS APIs using `curl`;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

```
0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants