Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Separate "Kibana server is not ready yet" issue #140

Closed
edmitchellVS opened this issue Jul 25, 2022 · 19 comments
Closed

Separate "Kibana server is not ready yet" issue #140

edmitchellVS opened this issue Jul 25, 2022 · 19 comments

Comments

@edmitchellVS
Copy link

Hi Team,

I have been off work on AL and came back to broken LME server and I am getting the status message as above. When I go to url:9200 i get the following which shows elastic is running and attached tail log files off all three nodes. It seems to suggest incomplete shards and that I have to enable partial searches to get it working again. Though I may be completely off the mark, can anyone help? I have 33% disk space available so that rules out disk space as the issue.

https://IP_Address:9200
{
"name" : "es01",
"cluster_name" : "loggingmadeeasy-es",
"cluster_uuid" : "kYlAt-N3RdyKN0dHg7Wohg",
"version" : {
"number" : "7.17.4",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "79878662c54c886ae89206c685d9f1051a9d6411",
"build_date" : "2022-05-18T18:04:20.964345128Z",
"build_snapshot" : false,
"lucene_version" : "8.11.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Elastic log July 22.txt
Kibana log July 22.txt
logatash log July 22.txt

Any help to get this working again would be greatly appreciated.

Cheers

@duncan-ncc
Copy link
Contributor

Hello @edmitchellVS

Can you please provide the elastic logs with a larger number in the --tail option please?

Thanks.

@edmitchellVS
Copy link
Author

Thanks for this, tail 100 logs attached

ElasticSearch log 100 July 22.txt
logatash log 100 July 22.txt
Kibana log 100 July 22.txt
d

@duncan-ncc
Copy link
Contributor

Hello,

Have you attempted to restart the kibana/elasticsearch/logstash, If not i'd suggest that as a good step to ensure that it wasn't just an issue with the boot order or elasticsearch not being available during kibana booting.

Thanks.

@edmitchellVS
Copy link
Author

Hi Duncan,

Apologies, I was out of the office... Yes I have restarted the containers in that order however it is still not working. Are there any more logs I can get?

@edmitchellVS
Copy link
Author

Also I can get into the cluster health via the URL and got the following ( have just rebooted but still not working):
{
"cluster_name" : "loggingmadeeasy-es",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 152,
"active_shards" : 152,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 398,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 6,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 15284,
"active_shards_percent_as_number" : 27.436823104693143
}

@edmitchellVS
Copy link
Author

OK so it seems there are 9 problematic shards that cant be assigned,
{
"cluster_name" : "loggingmadeeasy-es",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 545,
"active_shards" : 545,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 9,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.37545126353791
}

{
"note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
"index" : "winlogbeat-10.07.2022",
"shard" : 1,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2022-07-28T08:48:25.343Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [NtN4pZu3R0qYYbyaYuyPFQ]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/7vouHeY4SCGQTjXnyyRnAA/1/translog/translog-3.tlog] is corrupted, operation size is corrupted must be [0..59492954] but was: 2065851766]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "yes",
"allocate_explanation" : "can allocate the shard",
"target_node" : {
"id" : "NtN4pZu3R0qYYbyaYuyPFQ",
"name" : "es01",
"transport_address" : "10.0.0.4:9300",
"attributes" : {
"ml.machine_memory" : "16771620864",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "512",
"ml.max_jvm_size" : "11811160064"
}
},
"allocation_id" : "LrkiJI4YQCOaMGK0f_jZHA",
"node_allocation_decisions" : [
{
"node_id" : "NtN4pZu3R0qYYbyaYuyPFQ",
"node_name" : "es01",
"transport_address" : "10.0.0.4:9300",
"node_attributes" : {
"ml.machine_memory" : "16771620864",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "512",
"ml.max_jvm_size" : "11811160064"
},
"node_decision" : "yes",
"store" : {
"in_sync" : true,
"allocation_id" : "LrkiJI4YQCOaMGK0f_jZHA"
}
}
]
}

I have identified 1 random shard from the URL , would be OK to just delete these unassigned shards 1 by 1 or will it cause further issues?

Sorry for the flurry of post.

Thanks

@edmitchellVS
Copy link
Author

OK after further investigation i now firmly believe this is the shard causing issues as we noticed the issue on 18th. There does not seem to be any data associated with the kibana_task_manager task which must be the culprit as LME was still working when winlogbeat indice was created. I will look into the yellow status issues later but I do not believe they would break LME. Any idea on how to proceed?

red | open | winlogbeat-10.07.2022 | 7vouHeY4SCGQTjXnyyRnAA | 4 | 0 | 777496 | 0 | 685.2mb | 685.2mb

red | open | .kibana_task_manager_7.17.4_001 | m-NhVW-GRl6TVdglxpTXDA | 1 | 0 |   

image

@edmitchellVS
Copy link
Author

OK after a great deal of research, learning and pulling my hair out I have discovered this may not be fixable. Seems as the translog is corrupted and that we cant fix it due to the fact it is containerised (Elastic needs to be stopped to run the elasticsearch-shard utility and when I stop the service on the docker container it kicks me out and kills Elasticsearch. Does anyone have any ideas on possible next steps?

  1. Is there anyway to try to run the elasticsearch-shard tool in dockerised elastic search?
  2. Should i just build a new non docker node and add it in then fix it that way? (time consuming but doable)
  3. Can i export the current shards, import to a temporary non docker instance, fix, then re-import? ( I only have VM backups of the docker server, no snapshots)
  4. Is there any other option I haven't discover yet?

Thanks

@duncan-ncc
Copy link
Contributor

Hi @edmitchellVS

I think this is a rather unique situation that hasn't been commented on before for LME so I would only be able to provide suggestions rather than recommendations.

You could avoid needing to run the elasticsearch-shard tool in docker or building a new one by installing the tool on the docker host and pointing it (using the --dir) option at the elasticsearch data store.
The elasticsearch docker container stores its data in the "/usr/share/elasticsearch/data" volume on the host

I think that might be the easiest option to save you having to mount volumes into a new custom docker container.

Thanks,
Duncan

@edmitchellVS
Copy link
Author

HI Duncan,

Thanks for this, so if I run it using the --dir option I can access the Elasticsearch shards even though the container is down?

I will look in to this shortly and get back to you.

Thanks again for all your help on this.

Cheers

@edmitchellVS
Copy link
Author

HI Duncan,

Sorry this is not possible as I need to run the Elasticsearch-shard tool whilst the service is stopped which in turn kills the Elasticsearch container. The path specified only exists in the container and not on the host. We are going to try to restore from backup and take the 10 days data loss hit.

Thanks again for your help on this it is really appreciated.

@edmitchellVS
Copy link
Author

Hi Duncan,

Quick question, what would happen if I were to delete the corrupted shards? Would it break LME even more or would it just try to recreate the deleted shards from the trans log? If it does try to recreate would I get access to Kibana for a period of time before it broke again? I am just thinking that if I can get access to Kibana I could setup a snapshot of current data, import to a non Docker version, repair the data and re-import. Is there a command to send via curl XPOST to setup the backup in elastic?

@adam-ncc
Copy link
Collaborator

adam-ncc commented Aug 1, 2022

Hi @edmitchellVS, apologies for coming in to this thread a bit late. You've hit the nail on the head with regards to the problem I think, the two red status indices above are in ill health, whereas the yellow SIEM signals indices can be safely ignored as the yellow status is just from their inability to make a replica (as LME is a single node cluster). The logs you posted earlier also seem to confirm that the inability for Kibana to properly access the .kibana_task_manager index is what's causing Kibana to refuse to start. I'm not sure what would have caused the translog corruption you're seeing, but it does seem like recovering these shards without any data loss is unlikely to be simple.

As you've suggested I'm hoping that deleting the indices themselves should resolve the issue if you are willing to accept the inherent data loss as a result, which may the quickest solution for you. I can't see any reason why Elasticsearch would try to recreate these from translog, barring an unknown underlying issue which may cause the problem to re-appear, although this is not something we've been able to test on our end. As far as I'm aware deleting the .kibana_task_manager_7.17.4_001 index is unlikely to have any significant downsides in your case, and Kibana should recreate this for you the next time it's restarted. You'll lose any saved objects already stored within this index as part of migration, but as it's not currently accessible it's unlikely to cause any data loss in practice.

If you're happy to proceed with this then the Delete API is relatively straight forward and can be done via XPOST as you suggested (note you'll need to add creds and the correct root CA file to the command to get it to complete successfully). The documentation is available here https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-index.html.

DELETE /.kibana_task_manager_7.17.4_001

Deleting the red status winlogbeat index is likely to have a more significant downside, as obviously you'll lose any log events for that day which were stored in that index, but might still be something that you're willing to go do get back up and running quickly, depending how critical the logs for that date are? I suspect you'll be able to get Kibana to boot simple by removing the task manager index, but you may still experience other problems searching and using the ELK stack until the unassigned winlogbeat index is resolved.

If you're unable to lose the data within the winlogbeat index then you may be able to simply delete the broken kibana index, and then either use the (hopefully) now loading Kibana, or the relevant Elasticsearch API to snapshot and restore this specific index, however the process for restoring only this single index may be somewhat complicated.

@edmitchellVS
Copy link
Author

Hi Adam,

No problem, I have now deleted all the unassigned shards and the the system is now fully functional again (bar 650mb of data). So thanks for all of both of your help with this. I feel more knowledgeable now with LME and Docker and more confident in dealing issues when they arise.

My next tasks is to look at DR clustering, is this supported with LME or is it something that may come in the future?

Thanks again

Ed

@divadiow
Copy link

divadiow commented Aug 3, 2022

hi @edmitchellVS

How did you identify which shard was causing you an issue, ref my new problem #143 ?

@edmitchellVS
Copy link
Author

hi @edmitchellVS

How did you identify which shard was causing you an issue, ref my new problem #143 ?

Hi Divadiow,

You can use some of the curl commands directly in the browser... E.G https://YOUR_SERVER_IP:9200/_cluster/health?pretty This URL will show you all the shards https://YOUR_SERVER_IP:9200/_cat/indices and their RAG status. You will need to authenticate with the elastic account to get access to the API

You can also use the CURL commands in PUTTY by using the --insecure switch and including the https:// in the URL instead of HTTP as the latter is disable in xpack (i think)

Hope this helps

Ed

@divadiow
Copy link

divadiow commented Aug 3, 2022

Hello,

Have you attempted to restart the kibana/elasticsearch/logstash, If not i'd suggest that as a good step to ensure that it wasn't just an issue with the boot order or elasticsearch not being available during kibana booting.

Thanks.

thanks @edmitchellVS

my statuses are all yellow or green. a different issue perhaps

though there are 5 unassigned shard

{
  "cluster_name" : "loggingmadeeasy-es",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 995,
  "active_shards" : 995,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.5
}

@edmitchellVS
Copy link
Author

Hello,
Have you attempted to restart the kibana/elasticsearch/logstash, If not i'd suggest that as a good step to ensure that it wasn't just an issue with the boot order or elasticsearch not being available during kibana booting.
Thanks.

thanks @edmitchellVS

my statuses are all yellow or green. a different issue perhaps

though there are 5 unassigned shard

{
  "cluster_name" : "loggingmadeeasy-es",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 995,
  "active_shards" : 995,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.5
}

Hi Divadiow,

No problem, glad i could help in some way... It shouldn't really matter if the shards are unassigned but if you start getting error messages about missing shards when you are looking at your data or running searches in Kibana then I would delete or try reassigning them.

Cheers

Ed

@edmitchellVS
Copy link
Author

Thank Adam and Duncan, issue resolved, closing off now.

Ed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants