Kubernetes Troubleshooting

Troubleshoot Kubernetes based deployment

Zus network, which is based on Kubernetes, is set up for System Test. However, debugging is necessary if the test fails for whatever reason.

All Alerts for tests are setup in devops-0chain Slack Channel.
For tests failure check GitHub Actions Annotations first.
There are two ways majorly for which System-Test fail.
- "One or more 0Chain components (listed below) crashed during the test run, therefore the build is NOT STABLE"
  - This error means the deployed 0chain providers crashed/failed/restarted while System Test was running.
- "System tests failed. Ensure tests are running against the correct images/branches and rule out any possible code issues before attempting a re-run"
  - System Tests are failing so report on qa channel.

Check which services failed sharder, miner, blobber, 0box, etc.
After confirming the service name check the GitHub Action Artifacts
- Check Crashed logs or particular service logs by downloading it from artifacts.

For more login to rancher ( https://rancher.dev-[1-9].devnet-0chain.net/ ) and check each services logs. Read Rancher Doc.
For detailed Monitoring & Logging check Grafana & Loki Doc.

Sharder - There can be 4 reasons why sharder is stuck or crashed -
1. While deploying the network sharder showed panic
2. After successful deployment sharder restarted
3. Not able to connect with databases
4. OOMKilled
In all the above cases these Kubernetes commands can be useful -
- Fetch sharder pod name by using this command - kubectl get po -A | grep sharder
- Check sharder logs by following command - kubectl logs << pod_name >> -n << namespace_name >> -c << container_name >>
  example - kubectl logs helm-sharder-01-68ccbd65dd-g6zbw -n dev-1 -c helm-sharder-01
- For OOMKilled, run the following command - kubectl describe po helm-sharder-01-68ccbd65dd-g6zbw -n dev-1
  example - kubectl describe po helm-sharder-01-68ccbd65dd-g6zbw -n dev-1
- For detailed logging, Use loki query - {tag="<< container_name >>"} |= ``
example - {tag="helm-sharder-01"} |= ``

Miner - There can be 4 reasons why miner is stuck or crashed -
1. While deploying the network miner showed panic
2. After successful deployment miner restarted
3. Not able to connect with databases
4. OOMKilled
In all the above cases these Kubernetes commands can be useful -
- Fetch miner pod name by using this command - kubectl get po -A | grep miner
- Check miner logs by following command - kubectl logs << pod_name >> -n << namespace_name >> -c << container_name >>
  example - kubectl logs helm-miner-01-68ccbd65dd-g6zbw -n dev-1 -c helm-miner-01
- For OOMKilled, run the following command - kubectl describe po helm-miner-01-68ccbd65dd-g6zbw -n dev-1
  example - kubectl describe po helm-miner-01-68ccbd65dd-g6zbw -n dev-1
- For detailed logging, Use loki query - {tag="<< container_name >>"} |= ``
example - {tag="helm-miner-01"} |= ``

Blobber - There can be three reasons why blobber is stuck or crashed -
1. While deploying the network blobber showed panic
2. After successful deployment blobber restarted
3. Not able to connect with databases
In all the above cases these Kubernetes commands can be useful -
- Fetch miner pod name by using this command - kubectl get po -A | grep blobber
- Check miner logs by following command - kubectl logs << pod_name >> -n << namespace_name >> -c << container_name >>
  example - kubectl logs helm-blobber-01-68ccbd65dd-g6zbw -n dev-1 -c helm-blobber-01
- For detailed logging, Use loki query - {tag="<< container_name >>"} |= ``
example - {tag="helm-blobber-01"} |= ``

0Box - There can be three reasons why 0box is crashed -
1. While deploying the network 0box panic
2. After successful deployment 0box started
3. Not able to connect with databases
In all the above cases these Kubernetes commands can be useful -
- Fetch miner pod name by using this command - kubectl get po -A | grep zbox
- Check miner logs by following command - kubectl logs << pod_name >> -n << namespace_name >>
  example - kubectl logs helm-zbox-01-68ccbd65dd-g6zbw -n dev-1
- For detailed logging, Use loki query - {tag="<< container_name >>"} |= ``
example - {container="helm-zbox"} |= ``

Authorizer - There can be few reasons why authorizer is crashed -
1. While deploying the network authorizer panic
2. After successful deployment authorizer started
In all the above cases these Kubernetes commands can be useful -
- Fetch miner pod name by using this command - kubectl get po -A | grep authorizer
- Check miner logs by following command - kubectl logs << pod_name >> -n << namespace_name >>
  example - kubectl logs helm-authorizer-01-68ccbd65dd-g6zbw -n dev-1
- For detailed logging, Use loki query - {tag="<< container_name >>"} |= ``
example - {container="helm-authorizer-01"} |= ``

Last updated 2 years ago