Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Pulse #57108

Merged
merged 14 commits into from
Mar 11, 2020
Merged

[RFC] Pulse #57108

merged 14 commits into from
Mar 11, 2020

Conversation

afharo
Copy link
Member

@afharo afharo commented Feb 7, 2020

Summary

RFC for the new Pulse service.
Rendered RFC

Edit history:

  • 2020.02.07 Initial commit

[skip ci]

@elasticmachine
Copy link
Contributor

Pinging @elastic/pulse (Team:Pulse)

@afharo afharo added Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes v8.0.0 labels Feb 7, 2020
@afharo afharo changed the title [RFC] Pulse [RFC][skip-ci] Pulse Feb 7, 2020
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@afharo
Copy link
Member Author

afharo commented Feb 10, 2020

@elasticmachine merge upstream

@afharo afharo added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! labels Feb 12, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-security (Team:Security)

It covers the following scenarios:

- Track the behaviour of our users in the UI, reporting UI events throughout our platform.
- Report to Elastic when an unexpected error occurs and keep track of it. When it's fixed, it lets the user know, encouraging them to update to their deployment to the latest release (PR [#56724](https://github.com/elastic/kibana/pull/56724)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed error messages can frequently contain sensitive information. Are there any mitigations put in place to prevent end-users from viewing the details of these error messages? Are we at all concerned about these error messages containing sensitive enough of information that they shouldn't even be sent to Elastic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kobelb that's a very valid concern. We are aware of it. For the purpose of the POC, we didn't approach that issue, but we are aware of it and plan to do something about it.

It's not only a security concern, but we also need a way to group similar errors, removing the user-specific data.

For instance:

  • In the stack trace, we can show relative paths vs. the actual absolute ones.
  • The Pulse Remote collector for this specific channel can be updated and maintained with a logic to clean up error messages before storing them. Sure, every newly-found error will report the full message until we come up with a logic to clean it up, but we can do it remotely and update the entries at any given time. This will also force us to categorise any new error and create repo issues to deal with them, so, of course, we need some manual labour here to deal with every new error. But that's part of the purpose for this channel :)

I didn't include all this in the RFC because it doesn't affect the Pulse service design as such. It's just one of the possible use-cases for the new channel-based telemetry. But I'm happy to include it in the RFC if you think it should be detailed.


#### Data storage

For the purpose of transparency, we want the user to be able to retrieve the telemetry we send at any point, so we should store the information we send for each channel in their own local internal indices (similar to a copy of the `pulse-raw-*` and `pulse-instructions-*` indices in our remote service). In the same effort, we could even provide some _dashboards_ in Kibana for specific roles in the cluster to understand more about their deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these local internal indices, will they be "dot indices"? Otherwise, we'll want some kind of acknowledgement from an appropriately authorized end-user to begin creating documents in the "non dot indices" because the user might intend for them to be used for something else or already have roles/users created who have access to these indices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That's what I meant with internal indices. I'll change it to make it more clear.

We should also limit its access to an admin role initially, but let the user grant access to other users that may want to read from it.

I wonder if it's valuable to have a config parameter so the administrator can overwrite the index prefix?

Comment on lines +179 to +181
The telemetry will be sent, preferably, from the server. Only falling back to the browser in case we detect the server is behind firewalls and it cannot reach the service or if the user explicitly sets the behaviour in the config.

Periodically, the process (either in the server or the browser) will retrieve the telemetry to be sent by the channels, compile it into 1 bulk payload and send it encrypted to the [ingest endpoint](#inject-telemetry) explained earlier.
Copy link
Contributor

@jportner jportner Feb 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple observations:

  1. By "send it encrypted", do you mean application-level encryption before the data is sent via HTTPS? I think this is a must if the data is going to be routed through an end-user's browser.
  2. The same goes for receiving instructions through an end-user's browser.
  3. If an end-user's browser is going to be routing anything, we should limit that behavior to authenticated users only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, the current telemetry applies some application-level encryption before sending it over HTTPS.
  2. Good point. I'll add a mention to it.
  3. Right! I didn't mention it, but that's the way it is in telemetry nowadays. I'll make it clear in the docs.

Thank you! @jportner


# Unresolved questions

- Pending to define a proper handshake in the authentication mechanism to reduce the chance of a man-in-the-middle attack or DDoS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should configure clients (both Kibana server and browser) to utilize public-key pinning (not HPKP) for the Pulse server certificates to mitigate MITM attack threats. This can be accomplished in both cases using node-forge.

At a high level this involves generating a list of key pairs -- preferably EC as it is more future-proof -- and creating public key hashes of those keypairs to hard-code in clients. Perhaps 5 years' worth of key pairs / hashes is enough, with the idea that new key pairs would be added and old ones would be dropped on a regular (annual) basis. In this scheme, an "active" key pair is used with the Pulse server certificate, while the "backup" key pairs are waiting to be used in the future. A client version that is released today would trust all Pulse server certificates for the next 5 years.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @jportner, we came to the conclusion that the biggest disadvantage of request-crypto compared to public-key pinning is that request-crypto does not support key rotation. Our use of request-crypto prevents a MITM, as the telemetry data can only be decrypted with the private key.

Neither of these necessarily solve the problem of uniquely identifying/authenticating the "sender" of the telemetry information. Some further thought about how to accomplish this is warranted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account the encryption-decryption for payloads from the client to the remote service is safe. Let's focus on finding a way to ensure the encryption in the opposite direction is as well:

Neither of these necessarily solve the problem of uniquely identifying/authenticating the "sender" of the telemetry information. Some further thought about how to accomplish this is warranted.

What about handling this at the authenticate endpoint? We could follow a handshake authentication process like this:

If Kibana starts for the very first time:

  1. The client checks it doesn't have a local private key of its own nor an assigned deployment ID.
  2. The client generates the local private key (with expiration X).
  3. POST /authenticate: in the body (encrypted with request-crypto, using the public key of the remote service) the client provides a public key generated from the private key in 2 and the cluster_uuid.
  4. In the response, the remote service will decrypt the payload with its own private key and assign an anonymised deploymentID (common across the entire cluster UUID), storing the relation between the deploymentID, cluster UUID and public key.
    NB: This relation cannot be unique (aka, we can have multiple documents with the same deploymentID, cluster UUID but different public keys) because we may have a cluster with multiple Kibana instances. Is that OK? Or should we store the private key in a Saved Object so all the Kibana instances can share the same private key? Isn't it that risky?
  5. Using the provided public key to encrypt the response, the remote service replies back, providing the assigned deploymentID.
  6. The client locally stores that deploymentID (in a local file and as a Saved Object)
  7. The client keeps sending information with the deploymentID reference (as in POST /send_telemetry/{deploymentID}) while encrypted with the public key from the remote service and the remote service encrypts its responses with the public key from the client.

Kibana restarts

  1. The client finds it has a local private key and an assigned deployment ID.
  2. It simply proceeds to send the telemetry and receive the data the same way as in step 7 in the previous use case.

Key-rotation/refresh

We may want to update the public and private keys from time to time. In that case, we need a mechanism to notify the other end about this update in a safe way:

Updating the client-generated keys

  1. The client decides it's time for a change so it generates a new pair of private and public keys (without dropping the old ones just yet).
  2. It requests POST /authenticate/{deploymentID}/refresh, providing in the encrypted (with the remote service's public keys) body the old public key and the new public key.
  3. The remote service replies with one challenge encrypted with the old key and another challenge encrypted with the new key.
  4. The client pushes both unencrypted challenges to PUT /authenticate/{deploymentID}/refresh (but the payload is encrypted as usual with the remote service's public key)
  5. The remote service checks both challenges are correct and replies OK. Overwriting the old public key with the new one.
  6. The client will keep the old key as a backup for 30 more days (in case of disaster recovery in the remote service that forces us to roll back to a previous backup). If the backup key is used at any point, this whole process will be repeated.

Updating the remote service's generated keys

In order to push the update, we can do it via instruction handling. But we still need a process of confirmation we are not dealing with a MITM attack:

  1. When the instruction is received, the client will confirm by encrypting 1 challenge with the old public key and another one with the new one and posting it to POST /authenticate/{deploymentID}/challenge.
  2. The remote service will include in the (encrypted with the client's public key) response both unencrypted challenges.
  3. The client will check the challenges are correct and update the public key locally.

The remote service needs to maintain the old keys as backups for up to 5 years(??) to ensure old installations are supported.

??: Maybe we want to only maintain for up the 5 years the shipped ones, but not the rotated ones? What would be a good amount of time for those then? We may need to take into consideration those installations where the server can't reach the remote service and we are sending it via Browser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I forgot to mention: at the moment, request-crypto uses different public keys to identify the requests from different clients (kibana vs. beats vs. apm vs. ...). But we can extend this usage to one per actual client. There may be some memory concerns with that approach though.

Copy link
Contributor

@jportner jportner Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account the encryption-decryption for payloads from the client to the remote service is safe. Let's focus on finding a way to ensure the encryption in the opposite direction is as well:

Neither of these necessarily solve the problem of uniquely identifying/authenticating the "sender" of the telemetry information. Some further thought about how to accomplish this is warranted.

What about handling this at the authenticate endpoint? We could follow a handshake authentication process like this:

It seems like you put a lot of thought into this! This is a great first draft, but I think we can refine it a bit more.

Let's think of it from a threat modeling perspective as a simple exercise. We have five components with two trust boundaries that separate them.

                                         +                    +
                +--------+               |                    |
                | Kibana |               |                    |
+---------+  A  | server |  B  +----+    |     +--------+     |    +--------+
|         | <-+ |   1    | <-+ |    |    |     |        |     |    |        |
|   ES    |     +--------+     |    |    |  B  | Kibana |  C  |    | Pulse  |
| cluster |        ...         | LB | <------+ | client | +------> | server |
|         |     +--------+     |    |    |     |        |     |    |        |
|         | <-+ | Kibana | <-+ |    |    |     +--------+     |    +--------+
+---------+     | server |     +----+    |                    |
                |   n    |               |                    |
                +--------+               |                    |
                                         +                    +
                                       TRUST                TRUST
                                      BOUNDARY             BOUNDARY

We have two basic data flows (not including auth steps):

  • Send telemetry data (Kibana payload): B, A, C, B, A
  • Receive Pulse updates (Pulse payload): B, A, C, B, A

The threat agents we are primarily concerned with is anyone who has access to the Kibana client, and anyone who may be able to intercept data-in-transit to the Kibana server and Pulse server. Our primary attack vectors are the Kibana server API and Pulse server API.

Enumerating threats (without ranking them):

  • TK1. Spoofing, Tampering, & Elevation of Privilege: impersonate the Pulse server -- forge news/updates/etc., deface Kibana, conduct phishing, inject a malicious payload (XSS, etc.), trick Kibana into adding or removing data in Elasticsearch indices
  • TK2. Spoofing, Tampering, & Denial of Service: replay attacks against Kibana
  • TK3. Information Disclosure: read telemetry data -- obtain info on the Elastic stack installation
  • TK4. Denial of Service: algorithmic complexity attack on Kibana -- reduce performance / cause a crash
  • TK5. Denial of Service: exhaust Elasticsearch storage
  • TP1. Spoofing & Tampering: impersonate the Kibana server -- forge telemetry data, prevent valid telemetry data from being transmitted/received, inject a malicious payload
  • TP2. Spoofing, Tampering, & Denial of Service: replay attacks against Pulse
  • TP3. Information Disclosure: read Pulse updates -- obtain info on the Elastic stack installation
  • TP4. Denial of Service: algorithmic complexity attack on Pulse -- reduce performance / cause a crash

Potential countermeasures (mapped to threats):

  • CK1 (TK1, TK3): Application-level encryption for Pulse payloads -- encrypt data in the Pulse server, and decrypt data in the Kibana server; requires Kibana public key registration flow
  • CK2 (TK1, TK5): Application-level signatures for Pulse payloads -- sign encrypted sign-then-encrypt data in the Pulse server, and verify encrypted data verify decrypted signature in the Kibana server; requires distributed Pulse public key (request-crypto uses this today)
  • CK3 (TK2): Generate a nonce and include it in each encrypted Kibana payload, and verify the nonce with each decrypted Pulse payload
  • CK4 (TK4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload
  • CK5 (TK4): Limit resources used for cryptographic operations -- use a queue with a separate process
  • CP1 (TP1, TP3): Application-level encryption for Kibana payloads -- encrypt data in the Kibana server, and decrypt signature in the Pulse server; requires distributed Pulse public key (request-crypto uses this today)
  • CP2 (TP1): Application-level signatures for Kibana payloads -- sign encrypted data sign-then-encrypt data in the Kibana server, and verify encrypted data verify decrypted signature in the Pulse server; requires Kibana public key registration flow
  • CP3 (TP2): Generate a nonce and include it in each encrypted Pulse payload, and verify the nonce with each decrypted Kibana payload
  • CP4 (TP4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload
  • CP5 (TP4): Limit resources used for cryptographic operations -- use a queue with a separate process
  • CP6 (TP1, TP3): Configure Pulse server to use TLS 1.2/1.3 with strong ciphers, configure Kibana client to use hostname verification

Note regarding the "sign-then-encrypt" process: it should prepend the recipient's identifier (maybe their public key) before signing, so that the message can't be re-encrypted and sent to someone else.

If Kibana starts for the very first time:

  1. The client checks it doesn't have a local private key of its own nor an assigned deployment ID.
  2. The client generates the local private key (with expiration X).
  3. POST /authenticate: in the body (encrypted with request-crypto, using the public key of the remote service) the client provides a public key generated from the private key in 2 and the cluster_uuid.
  4. In the response, the remote service will decrypt the payload with its own private key and assign an anonymised deploymentID (common across the entire cluster UUID), storing the relation between the deploymentID, cluster UUID and public key.
    NB: This relation cannot be unique (aka, we can have multiple documents with the same deploymentID, cluster UUID but different public keys) because we may have a cluster with multiple Kibana instances. Is that OK? Or should we store the private key in a Saved Object so all the Kibana instances can share the same private key? Isn't it that risky?

I think we need to share the private key among all the Kibana instances. In a HA Kibana deployment, if we have a challenge / response flow that uses a browser as an intermediary, we don't have any way of guaranteeing that a Pulse response payload goes back to the same Kibana instance that the request payload came from. At any rate, we can use encrypted saved objects to protect the private key.

I don't think it's a good idea to allow anyone who knows the cluster_uuid to register a public key with the common deploymentID. The cluster_uuid is not necessarily secret, and is potentially accessible to many Kibana/ES users.

I think a good approach here would be to generate a separate pulse UUID/secret and store it in an encrypted saved object where all authorized Kibana instances can access it. We can derive the deploymentID from the combination of the cluster UUID and the pulse UUID.

If the pulse UUID ever got deleted for some reason, the cluster would "unpair" itself from the deploymentID (and get a new deploymentID), but I think that's an acceptable risk / has negligible impact.

  1. Using the provided public key to encrypt the response, the remote service replies back, providing the assigned deploymentID.
  2. The client locally stores that deploymentID (in a local file and as a Saved Object)
  3. The client keeps sending information with the deploymentID reference (as in POST /send_telemetry/{deploymentID}) while encrypted with the public key from the remote service and the remote service encrypts its responses with the public key from the client.

Sounds good.

Kibana restarts

  1. The client finds it has a local private key and an assigned deployment ID.
  2. It simply proceeds to send the telemetry and receive the data the same way as in step 7 in the previous use case.

Sounds good.

Key-rotation/refresh

We may want to update the public and private keys from time to time. In that case, we need a mechanism to notify the other end about this update in a safe way:

Updating the client-generated keys

  1. The client decides it's time for a change so it generates a new pair of private and public keys (without dropping the old ones just yet).
  2. It requests POST /authenticate/{deploymentID}/refresh, providing in the encrypted (with the remote service's public keys) body the old public key and the new public key.
  3. The remote service replies with one challenge encrypted with the old key and another challenge encrypted with the new key.
  4. The client pushes both unencrypted challenges to PUT /authenticate/{deploymentID}/refresh (but the payload is encrypted as usual with the remote service's public key)

I think steps 2-4 would work, but it is a bit over-complicated. At this point the Kibana server can just send a payload with the new public key, that is signed with the old private key and the new private key.

  1. The remote service checks both challenges are correct and replies OK. Overwriting the old public key with the new one.
  2. The client will keep the old key as a backup for 30 more days (in case of disaster recovery in the remote service that forces us to roll back to a previous backup). If the backup key is used at any point, this whole process will be repeated.

The Pulse server will need to keep the old public key on hand for 30 days too.

Updating the remote service's generated keys

In order to push the update, we can do it via instruction handling. But we still need a process of confirmation we are not dealing with a MITM attack:

  1. When the instruction is received, the client will confirm by encrypting 1 challenge with the old public key and another one with the new one and posting it to POST /authenticate/{deploymentID}/challenge.
  2. The remote service will include in the (encrypted with the client's public key) response both unencrypted challenges.
  3. The client will check the challenges are correct and update the public key locally.

The remote service needs to maintain the old keys as backups for up to 5 years(??) to ensure old installations are supported.

??: Maybe we want to only maintain for up the 5 years the shipped ones, but not the rotated ones? What would be a good amount of time for those then? We may need to take into consideration those installations where the server can't reach the remote service and we are sending it via Browser.

Rather than try to design some mechanism to update/distribute the remote service's generated keypair, we should probably just generate a bunch of keypairs ahead of time and ship each client with a batch of public keys to support them for a window of time (maybe 5 years is appropriate). This is similar to the certificate pinning approach that I mentioned above.

We should also consider using Elliptic Curve keys/algorithms instead of RSA as request-crypto currently does. This will be more performant, more future-proof, and simpler (once the client and server each have the others' public key, they can use ECDH to generate a shared secret).

Oh! I forgot to mention: at the moment, request-crypto uses different public keys to identify the requests from different clients (kibana vs. beats vs. apm vs. ...). But we can extend this usage to one per actual client. There may be some memory concerns with that approach though.

I don't see any practical reason to use a different Pulse keypair for each client. Unnecessarily complexity is the death of security, let's avoid doing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP1. Spoofing & Tampering: impersonate the Kibana server -- forge telemetry data, prevent valid telemetry data from beTing transmitted/received, inject a malicious payload

At the moment, anyone or anything could pretend to be a Kibana server. Pulse's public key is readily available. CP1 is supposed to control for this threat, but the current use of request-crypto doesn't. Am I missing something here?

CK2 (TK1, TK5): Application-level signatures for Pulse payloads -- sign encrypted data in the Pulse server, and verify encrypted data in the Kibana server; requires distributed Pulse public key (request-crypto uses this today)

I think I'm missing something here... Are you referring to request-crypto's use of an encryption algorithm which supports "authenticated encryption"? If so, how is this different than CK1?

CK3 (TK2): Generate a nonce and include it in each encrypted Kibana payload, and verify the nonce with each decrypted Pulse payload

This would require us to keep track of all previously seen nonces, correct?

CK4 (TK4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload

For this recommendation, is the "Kibana client" the "client"? In the situations where the Kibana server can access pulse directly, I assume we would just be skipping this step?

CP4 (TP4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload

Similar question to above, is the "Kibana client" the "client"? In the situations where the Kibana server can access pulse directly, do we force the Kibana server to perform the proof-of-work?

CP6 (TP1, TP3): Configure Pulse server to use TLS 1.2/1.3 with strong ciphers, configure Kibana client to use hostname verification

I'm not following how this would be a control for TP1.

At any rate, we can use encrypted saved objects to protect the private key.

We can, with a bunch of asterisks. All of our encryption keys, including the encrypted saved objects encryption keys, are by default randomly generated and temporary. As soon as a Kibana instance restarts, the temporary encryption key is lost. This also means that HA deployments without synchronized persistent encryption keys can't decrypt each others data.

We could consider storing the private key in the .kibana index and not encrypting it. This would mean that if someone got read access to the full .kibana index they could call any Pulse service endpoint as Kibana.

I don't think it's a good idea to allow anyone who knows the cluster_uuid to register a public key with the common deploymentID. The cluster_uuid is not necessarily secret, and is potentially accessible to many Kibana/ES users.

Agreed

I think a good approach here would be to generate a separate pulse UUID/secret and store it in an encrypted saved object where all authorized Kibana instances can access it. We can derive the deploymentID from the combination of the cluster UUID and the pulse UUID.

Would Kibana be generating the Pulse UUID? Would the Pulse service still be generating the deployment ID?

The following assumes we're still planning on using the originally proposed Authenticate endpoint. When the Kibana client is mediating the communication, we can easily lose a response from the Pulse service. Should we consider supporting multiple subsequent calls to authenticate with the same parameters until we begin receiving telemetry data and then stop? Would we provide the public certificate as an additional parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TP1. Spoofing & Tampering: impersonate the Kibana server -- forge telemetry data, prevent valid telemetry data from being transmitted/received, inject a malicious payload

At the moment, anyone or anything could pretend to be a Kibana server. Pulse's public key is readily available. CP1 is supposed to control for this threat, but the current use of request-crypto doesn't. Am I missing something here?

For better or for worse, in my attempt to simplify this, TP1 "impersonate the Kibana server" is a sort of catch-all for anything an attacker might do to that effect. This might be directly attacking the Kibana server APIs, or it might be attempts to MITM a legitimate connection.

CP1 partially controls for this threat by encrypting payloads to prevent them from being viewed or modified. The other (arguably more important) control for TP1 is CP2, which we use to prove the Kibana server's identity for a given deploymentID.

CK2 (TK1, TK5): Application-level signatures for Pulse payloads -- sign encrypted data in the Pulse server, and verify encrypted data in the Kibana server; requires distributed Pulse public key (request-crypto uses this today)

I think I'm missing something here... Are you referring to request-crypto's use of an encryption algorithm which supports "authenticated encryption"? If so, how is this different than CK1?

Encryption (with the Kibana server's public key) doesn't guarantee that a payload came from the Pulse server. Anyone who could view that public key could then generate a payload that would be trusted by Kibana. We should sign payloads using the Pulse private key, and verify them with the Pulse public key.

Note, I mention "distributed Pulse public key", to practice good hygiene we really should have two separate Pulse key pairs (one for encryption/decryption, one for signing/verification).

CK3 (TK2): Generate a nonce and include it in each encrypted Kibana payload, and verify the nonce with each decrypted Pulse payload

This would require us to keep track of all previously seen nonces, correct?

No, we would just have to keep track of any currently valid nonce that we generate, until we receive a Pulse payload with that nonce and forget it. Perhaps each nonce auto-expires after a short time (10 minutes?)

CK4 (TK4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload

For this recommendation, is the "Kibana client" the "client"? In the situations where the Kibana server can access pulse directly, I assume we would just be skipping this step?

Well, we don't know who the client is. The threat is that the Kibana server might receive unsolicited (potentially automated) malicious requests.

I think we can skip this step when the Kibana server can access the Pulse server directly, yes. We only need this proof-of-work on the Kibana server API that would be receiving requests from a client.

CP4 (TP4): Require proof-of-work before processing payloads -- before accepting a payload, use a KDF (such as scrypt) on a nonce, send the nonce to the client, and require the solution hash with the payload

Similar question to above, is the "Kibana client" the "client"? In the situations where the Kibana server can access pulse directly, do we force the Kibana server to perform the proof-of-work?

We don't know who the client is until we decrypt the payload, so I don't see how we can skip this step. The decryption part is the primary cause of the ACA threat.

CP6 (TP1, TP3): Configure Pulse server to use TLS 1.2/1.3 with strong ciphers, configure Kibana client to use hostname verification

I'm not following how this would be a control for TP1.

As I mentioned above, in my attempt to simplify this, TP1 "impersonate the Kibana server" is a sort of catch-all for anything an attacker might do to that effect, including MITM attacks. An additional layer of TLS encryption would help prevent a MITM if someone had stolen the Kibana server's private key -- if they could intercept traffic over the wire, they would also need to break TLS.

Unfortunately we can't guarantee how the Kibana server is configured in non-SaaS installations.

At any rate, we can use encrypted saved objects to protect the private key.

We can, with a bunch of asterisks. All of our encryption keys, including the encrypted saved objects encryption keys, are by default randomly generated and temporary. As soon as a Kibana instance restarts, the temporary encryption key is lost. This also means that HA deployments without synchronized persistent encryption keys can't decrypt each others data.

We could consider storing the private key in the .kibana index and not encrypting it. This would mean that if someone got read access to the full .kibana index they could call any Pulse service endpoint as Kibana.

Note: obtaining the Kibana private key would also allow an attacker to decrypt Pulse payloads. Maybe storing it in the .kibana index is "good enough", though. Generally speaking, very few people should be able to access that index.

Perhaps in Kibana, we can detect if the encrypted saved objects encryption key is not ephemeral, and use an encrypted saved object instead in that scenario?

I think a good approach here would be to generate a separate pulse UUID/secret and store it in an encrypted saved object where all authorized Kibana instances can access it. We can derive the deploymentID from the combination of the cluster UUID and the pulse UUID.

Would Kibana be generating the Pulse UUID? Would the Pulse service still be generating the deployment ID?

I don't see why we couldn't allow Kibana to generate its own pulseUUID.

Sorry, that last sentence was ambiguous. We could either allow the Pulse service to generate the deploymentID as @afharo proposed, or we could derive with a one-way hash like sha256(pulseUUID + clusterUUID). I don't see any particular reason not to derive it, since on the Kibana side we would store the deploymentID adjacent to the pulseUUID -- if the pulseUUID was compromised, the deploymentID would be too.

The following assumes we're still planning on using the originally proposed Authenticate endpoint. When the Kibana client is mediating the communication, we can easily lose a response from the Pulse service. Should we consider supporting multiple subsequent calls to authenticate with the same parameters until we begin receiving telemetry data and then stop? Would we provide the public certificate as an additional parameter?

You mean reuse the same proof-of-work, and Kibana payload that includes a nonce? I don't think so. That defeats the purpose of both of those controls. Perhaps I'm misunderstanding what you are trying to say here?


On another note: I made a mistake in my comment above, in CK2 and CP2 I stated "sign encrypted data" (encrypt-then-sign), a better practice would be to prepend the recipient's name and sign-then-encrypt. I'll update the comment accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we would just have to keep track of any currently valid nonce that we generate, until we receive a Pulse payload with that nonce and forget it. Perhaps each nonce auto-expires after a short time (10 minutes?)

Gotcha. So prior to the mediator making any requests to the Pulse service, the Kibana client should use some Kibana API which would generate the nonce, and then we'd expect to see that same nonce for the conceptual reply?

We don't know who the client is until we decrypt the payload, so I don't see how we can skip this step. The decryption part is the primary cause of the ACA threat.

So, the Pulse service would require the proof-of-work along with the nonce and doesn't necessarily care who calculates it, so the proof-of-work could be done by either the Kibana server or the Kibana client? In the case where the Kibana server is communicating directly with the Pulse service, it would have to be done on the Kibana server itself. Being Node.js, in all it's glory, if the proof-of-work is CPU intensive it could potentially block other operations.

Perhaps in Kibana, we can detect if the encrypted saved objects encryption key is not ephemeral, and use an encrypted saved object instead in that scenario?

I think this is a good compromise.

You mean reuse the same proof-of-work, and Kibana payload that includes a nonce? I don't think so. That defeats the purpose of both of those controls. Perhaps I'm misunderstanding what you are trying to say here?

Apologies for not being clear. Ignoring the nonce, should we allow multiple calls to authenticate with whatever other parameters we deem necessary? It seems like conceptually, we're trying to use the authenticate endpoint to setup whatever we need to uniquely identify a deployment and the subsequent communication from that deployment. Once the deployment has effectively been registered, we want to prevent others from re-registering and somehow authenticating as the original deployment. However, given that the Kibana server might not see a reply, we'll potentially need to accommodate for the pulse service seeing multiple calls to authenticate and seeing those as being valid, but still protecting against the abuse/takeover of a deploymentID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps in Kibana, we can detect if the encrypted saved objects encryption key is not ephemeral, and use an encrypted saved object instead in that scenario?

I think this is a good compromise.

This should probably come with a big ol' warning log message (if there isn't already one for encrypted saved objects w/ a configured key)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been told the warning exists already. Because it might affect the existing reporting and alerting features.


##### Retrieve instructions

This `GET` endpoint should return the list of instructions generated for that deployment. To control the likely ever-growing list of instructions for each deployment, it will accept a `since` query parameter where the requester can specify the timestamp ever since it was to retrieve the new values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrieving instructions when there is a mediator in play has the same limitations as the authenticate endpoint.

@afharo afharo added the RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted label Mar 5, 2020
1. Remote Pulse Service (RPS)
2. Local Pulse Service (LPS)

After that, it explains how we invision the architecture and design of each of those components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: invision -> envision


#### Exposing channels to the plugins

The channels will be exposed to the channels as part of the `coreContext` in the `setup` and `start` lifecycle methods in a fashion like (types to be properly defined when implementing it):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugins will be exposed to the channels

# Unresolved questions

- Pending to define a proper handshake in the authentication mechanism to reduce the chance of a man-in-the-middle attack or DDoS. => We already have some ideas thanks to @jportner and @kobelb but it will be resolved during the _Phase 2_ design.
- Opt-in/out per channel?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like the ILM of the new system indices is also an unresolved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody objected to the question. So I rephrased that into an affirmative question. Hopefully, that won't raise any concerns.


Only those specific roles (admin?) should have access to these local indices, unless they grant permissions to other users they want to share this information with.

The users should be able to control how long they want to keep that information for (via ILM?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone know if there any shared / core services in Kibana that can help ILM of system indices? We have quite a few system indices now, and I'm wondering if there are reusable patterns for these that the Kibana teams can share.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the master build for Kibana there are 3 ILM policies created by default. 2 of them for system indices
image


#### Data model

The storage of each of the documents, will be based on monthly-rolling indices split by channels. This means we'll have indices like `pulse-raw-{CHANNEL_NAME}-YYYY.MM` and `pulse-instructions-{CHANNEL_NAME}-YYYY.MM` (final names TBD).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the data model be based on Saved Objects, or will this service query and handle raw data in ES?

Is the date suffix necessary for managing storage? I think Elasticsearch can support ILM even if its a single index: meaning deleting old data, moving it to cold storage, and other things that we often think we need time-based indices for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, we are thinking of system (dot) indices instead of Saved Objects. This is internally handled by the sendToChannel method so it should be transparent to the plugins.

Regarding the time-based indices, for time-based data (such as this new granular telemetry), our recommendations (on trainings) are to always use time-based indices for time-based data.
Do you think we should use static indices instead?

Copy link
Member

@tsullivan tsullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! It's clear how it will benefit users, and really helps me understanding how this will improve the quality of Pulse.

One thing I would add is to make more clear the impact to developers, if any changes need to happen in plugin code. Right now we use the UsageCollector service, which will soon be considered a channel "legacy" collection. Will developers need to change their integrations to support the new collection? What would a before & after example look like?

@afharo
Copy link
Member Author

afharo commented Mar 10, 2020

@tsullivan thank you for your comments! They are super helpful!
I've amended the wording as per your suggestions and added an explanation of how we'll deal with the legacy usage collection we currently report.

@epixa epixa removed their request for review March 10, 2020 18:27
@afharo afharo changed the title [RFC][skip-ci] Pulse [RFC] Pulse Mar 10, 2020
Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@afharo afharo merged commit babf81b into elastic:master Mar 11, 2020
@afharo afharo deleted the rfc/pulse branch March 11, 2020 09:36
gmmorris added a commit to gmmorris/kibana that referenced this pull request Mar 11, 2020
* master:
  [Metrics Alerts] Fix error when a metric reports no data (elastic#59810)
  Vislib legend toggle broken (elastic#59736)
  [RFC] Pulse (elastic#57108)
@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Mar 12, 2020
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 57108 or prevent reminders by adding the backport:skip label.

afharo added a commit to afharo/kibana that referenced this pull request Mar 12, 2020
* [RFC][skip-ci] Pulse

* Add drawback

* Add Opt-In/Out endpoint

* Add clarification about synched local internal indices

* Update rfcs/text/0008_pulse.md

Co-Authored-By: Josh Dover <me@joshdover.com>

* Add Phased implementation intentions, Security and Integrity challenges and example of use

* Refer to a follow up RFC to talk about security in the future

* Fix wording + add Legacy behaviour

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Josh Dover <me@joshdover.com>
afharo added a commit that referenced this pull request Mar 12, 2020
* [RFC][skip-ci] Pulse

* Add drawback

* Add Opt-In/Out endpoint

* Add clarification about synched local internal indices

* Update rfcs/text/0008_pulse.md

Co-Authored-By: Josh Dover <me@joshdover.com>

* Add Phased implementation intentions, Security and Integrity challenges and example of use

* Refer to a follow up RFC to talk about security in the future

* Fix wording + add Legacy behaviour

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Josh Dover <me@joshdover.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Josh Dover <me@joshdover.com>
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Mar 12, 2020
simianhacker pushed a commit to simianhacker/kibana that referenced this pull request Mar 12, 2020
* [RFC][skip-ci] Pulse

* Add drawback

* Add Opt-In/Out endpoint

* Add clarification about synched local internal indices

* Update rfcs/text/0008_pulse.md

Co-Authored-By: Josh Dover <me@joshdover.com>

* Add Phased implementation intentions, Security and Integrity challenges and example of use

* Refer to a follow up RFC to talk about security in the future

* Fix wording + add Legacy behaviour

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Josh Dover <me@joshdover.com>
jkelastic pushed a commit to jkelastic/kibana that referenced this pull request Mar 12, 2020
* [RFC][skip-ci] Pulse

* Add drawback

* Add Opt-In/Out endpoint

* Add clarification about synched local internal indices

* Update rfcs/text/0008_pulse.md

Co-Authored-By: Josh Dover <me@joshdover.com>

* Add Phased implementation intentions, Security and Integrity challenges and example of use

* Refer to a follow up RFC to talk about security in the future

* Fix wording + add Legacy behaviour

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Josh Dover <me@joshdover.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes RFC/final-comment-period If no concerns are raised in 3 business days this RFC will be accepted RFC Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants