Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF continually wants to update nomad_job resource between plan/apply #290

Closed
kian opened this issue Sep 23, 2022 · 7 comments · Fixed by #356
Closed

TF continually wants to update nomad_job resource between plan/apply #290

kian opened this issue Sep 23, 2022 · 7 comments · Fixed by #356

Comments

@kian
Copy link

kian commented Sep 23, 2022

Terraform Version

Terraform v1.0.0
Nomad provider version 1.4.17

Nomad Version

1.1.0

Provider Configuration

Which values are you setting in the provider configuration?

nomad = {
  source = "hashicorp/nomad"
  version = "1.4.17"
}

provider "nomad" {
  address = "https://nomad.service./${var.datacenter}.internal:4646"
  region  = "global"
}

Environment Variables

No

Affected Resource(s)

Please list the resources as a list, for example:

  • nomad_job

Terraform Configuration Files

Terraform:

variable "datacenter" {
  type        = string
  description = "Datacenter ID to deploy the job to"
}

resource "nomad_job" "cassandra_stage" {
  jobspec = templatefile("${path.module}/cassandra-stage.hcl",
            { datacenter = var.datacenter })
}

Job spec template:

job "cassandra_stage" {
  datacenters = ["${datacenter}"]
  type = "service"
  update {
    max_parallel = 1
    stagger      = "1m"
  }
  group "cassandra-1.50" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.1.50"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-1-50" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.2.50,10.10.2.51,10.10.3.51,10.10.3.52"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
  group "cassandra-2.50" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.2.50"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-2-50" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.1.50,10.10.2.51,10.10.3.51,10.10.3.52"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
  group "cassandra-1.51" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.1.51"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-1-51" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.1.50,10.10.2.50,10.10.2.51"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
  group "cassandra-2.51" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.2.51"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-2-51" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.1.50,10.10.2.50,10.10.3.51,10.10.3.52"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
  group "cassandra-3.51" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.3.51"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-3-51" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.1.50,10.10.2.50,10.10.2.51,10.10.3.52"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
  group "cassandra-3.52" {
    constraint {
      attribute = "$${attr.unique.network.ip-address}"
      value     = "10.10.3.52"
    }
    restart {
      attempts = 10
      delay    = "30s"
      interval = "30m"
      mode     = "delay"
    }
    task "cassandra-3-52" {
      driver = "docker"
      config {
        image = "ecr-proxy.service.xxx.internal/internal/cassandra:master-20220105-39f4ce4a0"
        port_map = {
          rpc    = 9160
          gossip = 7000
        }
        network_mode = "host"
        logging {
          type = "gelf"
          config {
            gelf-address = "udp://$${node.unique.name}:12201"
            tag = "cassandra"
          }
        }
        volumes = [
          "/mnt/ebs/cassandra/:/srv/var/"
        ]
      }
      service {
        name = "cassandra"
        port = "rpc"
      }
      env {
        CLUSTER_NAME                       = "${datacenter}"
        SEEDS                              = "10.10.1.50,10.10.2.50,10.10.2.51,10.10.3.51"
        LISTEN_ADDRESS                     = "$${NOMAD_IP_rpc}"
        HEAP                               = "8G"
        KEY_CACHE_MB                       = "1024"
        COMPACTION_THROUGHPUT_MB           = "8"
        STREAM_THROUGHPUT_MEGABITS_PER_SEC = "400"
        ENABLE_HINTED_HANDOFF              = "true"
      }
      kill_timeout = "300s"
      resources {
        cpu    = 10000
        memory = 18432
        network {
          mbits = 100
          port "rpc" {
            static = 9160
          }
          port "gossip" {
            static = 7000
          }
        }
      }
    }
  }
}

Expected Behavior

terraform plan and terraform apply do not continually report a difference in nomad_job attributes such as allocation_ids and region.

Actual Behavior

terraform plan followed by terraform apply constantly show a change in the nomad job resource.

first plan, followed by apply:

# module.cassandra_stage.nomad_job.cassandra_stage will be updated in-place
  ~ resource "nomad_job" "cassandra_stage" {
      ~ allocation_ids          = [
          - "d7ead8f3-6668-316b-2593-d8e3c02a725d",
          - "01820573-0c0d-0321-9a32-bd8966f6e366",
          - "874ec969-80c1-a344-1f3c-0d174ed9ec02",
          - "0c7e412d-cc7d-832b-e1d9-d4f345104d37",
          - "02d16058-65d5-3a68-502d-051a58c4b6e9",
          - "ad8a8744-f776-3636-bd4c-31ee82db23b4",
        ] -> (known after apply)
        id                      = "cassandra_stage"
      ~ modify_index            = "5289351" -> (known after apply)
        name                    = "cassandra_stage"
      ~ region                  = "global" -> (known after apply)
        # (8 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

after applying, second plan:

# module.cassandra_stage.nomad_job.cassandra_stage will be updated in-place
  ~ resource "nomad_job" "cassandra_stage" {
      ~ allocation_ids          = [
          - "d7ead8f3-6668-316b-2593-d8e3c02a725d",
          - "01820573-0c0d-0321-9a32-bd8966f6e366",
          - "874ec969-80c1-a344-1f3c-0d174ed9ec02",
          - "0c7e412d-cc7d-832b-e1d9-d4f345104d37",
          - "02d16058-65d5-3a68-502d-051a58c4b6e9",
          - "ad8a8744-f776-3636-bd4c-31ee82db23b4",
        ] -> (known after apply)
        id                      = "cassandra_stage"
      ~ modify_index            = "5289351" -> (known after apply)
        name                    = "cassandra_stage"
      ~ region                  = "global" -> (known after apply)
        # (8 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply && terraform apply
@lgfa29
Copy link
Contributor

lgfa29 commented Dec 1, 2022

Hi @kian 👋

I have not been able to reproduce this issue. Could test with a smaller job and see if the problem still happens?

@goatmale
Copy link

goatmale commented Dec 22, 2022

I'm having a similar issue, but with periodic jobs -some thoughts that I had to be the cause -

  1. i'm not sure if it's because we are restarting allocations quite frequently - or if the state keeps track of allocation ids - but maybe this is an issue.
  2. we are also using HCL2 - maybe this is the related to the issue?

Here is an example job:

job "uat.util.some.job.name" {
  region      = "someregion"
  datacenters = ["SOMEDC"]
  type        = "batch"
  meta {
    description = "Some job."
  }


  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "somehost"
  }


  periodic {
    cron             = " 0 15 7 1 JAN ? 2099"
    prohibit_overlap = true
    time_zone        = "America/Chicago"
  }

  group "uat.util.some.job.name.group" {
    network {
    }
    restart {
      attempts = "10"
      interval = "15m"
      delay    = "30s"
      mode     = "delay"
    }

    task "uat.util.some.job.name.task" {


      artifact {
        source      = "git:somerepohere"
        destination = "local/repo"
        options {
          ref    = "someref"
          sshkey = "somekey"
        }
      }

      driver = "raw_exec"


      config {
        command = "/bin/bash"
        args    = ["-c", "bash local/somescript.sh"]
      }
    }
  }
}

Here is the error:

  # nomad_job.uat_util_some_job_name will be updated in-place
  ~ resource "nomad_job" "uat_util_some_job_name" {
      ~ allocation_ids          = [] -> (known after apply)
        id                      = "uat.util.some.job.name"
      ~ modify_index            = "181804" -> (known after apply)
        name                    = "uat.util.some.job.name"

We are using Nomad 1.4.1
Terraform v1.3.5
terraform-provider-nomad_v1.4.19_x5

@arcenik
Copy link

arcenik commented Jan 5, 2023

Hi,

I have the same issue.
I think the allocation_ids should be ignored as it is managed by nomad

~ resource "nomad_job" "this" {
      ~ allocation_ids          = [
          - "4992e7e2-9b6f-adce-a2e4-fd007d042eaf",
          - "09c0473b-9adf-ae2b-85e4-724dfdc324cb",
          - "ed1a45ed-f3a3-e722-1fe5-20ac966047f3",
          - "31bbd497-401f-9cb9-986d-1e31860810b3",
          - "96a4f90c-b675-265a-ee23-208b9d476a14",
          - "2b63d096-ea89-bddc-50d6-151d02b187aa",
          - "65864468-b42c-8bb4-f802-7a0bf7edc221",
          - "880f23ed-8000-92e0-beee-11127f54e8e0",
          - "d138bed9-1cfa-1030-8a51-34162ccda3e9",
        ] -> (known after apply)
        id                      = "cassandra-main"
      ~ modify_index            = "1572279" -> (known after apply)
        name                    = "cassandra-main"
        # (9 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

@the-nando
Copy link
Contributor

the-nando commented Apr 22, 2023

I've run into the same issue on a running system and I've tracked it down to an inconsistency between the job spec in the state (left) and the one on the filesystem (right):
Screenshot 2023-04-22 at 09 52 53
Default values were configured in the job spec after the initial apply and aren't persisted in the state on update, when overridden by variables passed in by the provider. Subsequent plan/apply will trigger the behaviour described by the OP.

The reference to allocation_ids and modify_index is a red herring as both are computed fields:

// We know that applying changes here will change the modify index
// _somehow_, but we won't know how much it will increment until
// after we complete registration.
d.SetNewComputed("modify_index")
// similarly, we won't know the allocation ids until after the job registration eval
d.SetNewComputed("allocation_ids")

To reproduce the issue:

Terraform v1.4.5
on darwin_arm64
+ provider registry.terraform.io/hashicorp/nomad v1.4.20
  • Create a Nomad job file with a variable and a matching resource definition which sets such variable:
variable "image" {
  type    = string
}
[...]
resource "nomad_job" "efs-nodes" {
  hcl2 {
    enabled = true
    vars = {
      datacenters   = join(",", var.datacenters)
      region        = var.region
      namespace     = var.namespace
      image         = var.driver_image
    }
  }
  jobspec = file("${path.module}/efs-nodes.hcl")
}
  • Run an apply
  • Run another apply (no changes as expected)
  • Update the job spec with a default value for the variable:
variable "image" {
  type    = string
  default = "amazon/aws-efs-csi-driver:v1.4.7"
}
  • Run an apply
Terraform will perform the following actions:

  # module.nomad-efs.nomad_job.efs-nodes will be updated in-place
  ~ resource "nomad_job" "efs-nodes" {
      ~ allocation_ids          = [] -> (known after apply)
        id                      = "csi-efs-nodes"
      ~ modify_index            = "461678" -> (known after apply)
        name                    = "csi-efs-nodes"
        # (9 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.
  • Run another apply
Terraform will perform the following actions:

  # module.nomad-efs.nomad_job.efs-nodes will be updated in-place
  ~ resource "nomad_job" "efs-nodes" {
      ~ allocation_ids          = [] -> (known after apply)
        id                      = "csi-efs-nodes"
      ~ modify_index            = "461678" -> (known after apply)
        name                    = "csi-efs-nodes"
        # (9 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Interestingly enough if I remove the setting of the variable from the hcl2.vars block of the resource, the job spec and state are updated as expected. It looks like something is off with the way the computed state is persisted after the hcl2 templating has run.

Terraform will perform the following actions:

  # module.nomad-efs.nomad_job.efs-nodes will be updated in-place
  ~ resource "nomad_job" "efs-nodes" {
      ~ allocation_ids          = [] -> (known after apply)
        id                      = "csi-efs-nodes"
      ~ jobspec                 = <<-EOT
          - variable "image" { type = string }
          + variable "image" {
          +   type    = string
          +   default = "amazon/aws-efs-csi-driver:v1.4.7"
          + }
            variable "region" { type = string }
            variable "datacenters" { type = string }
[...]
        EOT
      ~ modify_index            = "461678" -> (known after apply)
        name                    = "csi-efs-nodes"
        # (8 unchanged attributes hidden)

      ~ hcl2 {
          ~ vars     = {
              - "image"         = "amazon/aws-efs-csi-driver:v1.5.4" -> null
                # (5 unchanged elements hidden)
            }
            # (2 unchanged attributes hidden)
        }
    }

@stevecn
Copy link

stevecn commented Apr 22, 2023

Hi @lgfa29:
From mine, this problem can reproduce easily by removing(or adding) a blank line in nomad job file.

@ttaghavi
Copy link

ttaghavi commented Jun 21, 2023

Hi @lgfa29: From mine, this problem can reproduce easily by removing(or adding) a blank line in nomad job file.

I have a similar behaviour, but when adding comments to the jobspec file.

  • apply a new jobspec
  • add a comment to the jobspec
  • apply again
  • terraform shows it will do an in-place update of the job
  • I am guessing nothing actually changes (as nomad ignores comments)
  • the terraform state is not updated with the changed jobspec content (added comments)
  • thus apply will keep informing about changes it needs to do.

It looks like the comments are ignored by nomad and thus nomad says "nothing changed". But terraform would need to update the state with the comments/ignored lines by nomad.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 26, 2023

Thanks for the all the extra info everyone.

The analysis from @the-nando, @stevecn, and @ttaghavi about the space or var default changes seems like the root cause. #356 uses a semantic jobspec diff to prevent problems like these.

I'm also planning on deprecating the allocation_ids field. While not the root cause, it is a weird attribute that smells more like a data source.

@lgfa29 lgfa29 added this to the 2.0.0 milestone Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

7 participants