-
Notifications
You must be signed in to change notification settings - Fork 117
Separate "Kibana server is not ready yet" issue #140
Comments
Hello @edmitchellVS Can you please provide the elastic logs with a larger number in the --tail option please? Thanks. |
Thanks for this, tail 100 logs attached ElasticSearch log 100 July 22.txt |
Hello, Have you attempted to restart the kibana/elasticsearch/logstash, If not i'd suggest that as a good step to ensure that it wasn't just an issue with the boot order or elasticsearch not being available during kibana booting. Thanks. |
Hi Duncan, Apologies, I was out of the office... Yes I have restarted the containers in that order however it is still not working. Are there any more logs I can get? |
Also I can get into the cluster health via the URL and got the following ( have just rebooted but still not working): |
OK so it seems there are 9 problematic shards that cant be assigned, { I have identified 1 random shard from the URL , would be OK to just delete these unassigned shards 1 by 1 or will it cause further issues? Sorry for the flurry of post. Thanks |
OK after further investigation i now firmly believe this is the shard causing issues as we noticed the issue on 18th. There does not seem to be any data associated with the kibana_task_manager task which must be the culprit as LME was still working when winlogbeat indice was created. I will look into the yellow status issues later but I do not believe they would break LME. Any idea on how to proceed? red | open | winlogbeat-10.07.2022 | 7vouHeY4SCGQTjXnyyRnAA | 4 | 0 | 777496 | 0 | 685.2mb | 685.2mb red | open | .kibana_task_manager_7.17.4_001 | m-NhVW-GRl6TVdglxpTXDA | 1 | 0 | |
OK after a great deal of research, learning and pulling my hair out I have discovered this may not be fixable. Seems as the translog is corrupted and that we cant fix it due to the fact it is containerised (Elastic needs to be stopped to run the elasticsearch-shard utility and when I stop the service on the docker container it kicks me out and kills Elasticsearch. Does anyone have any ideas on possible next steps?
Thanks |
I think this is a rather unique situation that hasn't been commented on before for LME so I would only be able to provide suggestions rather than recommendations. You could avoid needing to run the elasticsearch-shard tool in docker or building a new one by installing the tool on the docker host and pointing it (using the --dir) option at the elasticsearch data store. I think that might be the easiest option to save you having to mount volumes into a new custom docker container. Thanks, |
HI Duncan, Thanks for this, so if I run it using the --dir option I can access the Elasticsearch shards even though the container is down? I will look in to this shortly and get back to you. Thanks again for all your help on this. Cheers |
HI Duncan, Sorry this is not possible as I need to run the Elasticsearch-shard tool whilst the service is stopped which in turn kills the Elasticsearch container. The path specified only exists in the container and not on the host. We are going to try to restore from backup and take the 10 days data loss hit. Thanks again for your help on this it is really appreciated. |
Hi Duncan, Quick question, what would happen if I were to delete the corrupted shards? Would it break LME even more or would it just try to recreate the deleted shards from the trans log? If it does try to recreate would I get access to Kibana for a period of time before it broke again? I am just thinking that if I can get access to Kibana I could setup a snapshot of current data, import to a non Docker version, repair the data and re-import. Is there a command to send via curl XPOST to setup the backup in elastic? |
Hi @edmitchellVS, apologies for coming in to this thread a bit late. You've hit the nail on the head with regards to the problem I think, the two red status indices above are in ill health, whereas the yellow SIEM signals indices can be safely ignored as the yellow status is just from their inability to make a replica (as LME is a single node cluster). The logs you posted earlier also seem to confirm that the inability for Kibana to properly access the .kibana_task_manager index is what's causing Kibana to refuse to start. I'm not sure what would have caused the translog corruption you're seeing, but it does seem like recovering these shards without any data loss is unlikely to be simple. As you've suggested I'm hoping that deleting the indices themselves should resolve the issue if you are willing to accept the inherent data loss as a result, which may the quickest solution for you. I can't see any reason why Elasticsearch would try to recreate these from translog, barring an unknown underlying issue which may cause the problem to re-appear, although this is not something we've been able to test on our end. As far as I'm aware deleting the .kibana_task_manager_7.17.4_001 index is unlikely to have any significant downsides in your case, and Kibana should recreate this for you the next time it's restarted. You'll lose any saved objects already stored within this index as part of migration, but as it's not currently accessible it's unlikely to cause any data loss in practice. If you're happy to proceed with this then the Delete API is relatively straight forward and can be done via XPOST as you suggested (note you'll need to add creds and the correct root CA file to the command to get it to complete successfully). The documentation is available here https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-index.html.
Deleting the red status winlogbeat index is likely to have a more significant downside, as obviously you'll lose any log events for that day which were stored in that index, but might still be something that you're willing to go do get back up and running quickly, depending how critical the logs for that date are? I suspect you'll be able to get Kibana to boot simple by removing the task manager index, but you may still experience other problems searching and using the ELK stack until the unassigned winlogbeat index is resolved. If you're unable to lose the data within the winlogbeat index then you may be able to simply delete the broken kibana index, and then either use the (hopefully) now loading Kibana, or the relevant Elasticsearch API to snapshot and restore this specific index, however the process for restoring only this single index may be somewhat complicated. |
Hi Adam, No problem, I have now deleted all the unassigned shards and the the system is now fully functional again (bar 650mb of data). So thanks for all of both of your help with this. I feel more knowledgeable now with LME and Docker and more confident in dealing issues when they arise. My next tasks is to look at DR clustering, is this supported with LME or is it something that may come in the future? Thanks again Ed |
How did you identify which shard was causing you an issue, ref my new problem #143 ? |
Hi Divadiow, You can use some of the curl commands directly in the browser... E.G https://YOUR_SERVER_IP:9200/_cluster/health?pretty This URL will show you all the shards https://YOUR_SERVER_IP:9200/_cat/indices and their RAG status. You will need to authenticate with the elastic account to get access to the API You can also use the CURL commands in PUTTY by using the --insecure switch and including the https:// in the URL instead of HTTP as the latter is disable in xpack (i think) Hope this helps Ed |
thanks @edmitchellVS my statuses are all yellow or green. a different issue perhaps though there are 5 unassigned shard
|
Hi Divadiow, No problem, glad i could help in some way... It shouldn't really matter if the shards are unassigned but if you start getting error messages about missing shards when you are looking at your data or running searches in Kibana then I would delete or try reassigning them. Cheers Ed |
Thank Adam and Duncan, issue resolved, closing off now. Ed |
Hi Team,
I have been off work on AL and came back to broken LME server and I am getting the status message as above. When I go to url:9200 i get the following which shows elastic is running and attached tail log files off all three nodes. It seems to suggest incomplete shards and that I have to enable partial searches to get it working again. Though I may be completely off the mark, can anyone help? I have 33% disk space available so that rules out disk space as the issue.
https://IP_Address:9200
{
"name" : "es01",
"cluster_name" : "loggingmadeeasy-es",
"cluster_uuid" : "kYlAt-N3RdyKN0dHg7Wohg",
"version" : {
"number" : "7.17.4",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "79878662c54c886ae89206c685d9f1051a9d6411",
"build_date" : "2022-05-18T18:04:20.964345128Z",
"build_snapshot" : false,
"lucene_version" : "8.11.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Elastic log July 22.txt
Kibana log July 22.txt
logatash log July 22.txt
Any help to get this working again would be greatly appreciated.
Cheers
The text was updated successfully, but these errors were encountered: