Describe the bug Nacos 集群在运行过程中,由于其中一个POD出现Failed operation in LogStorage,导致整个集群崩溃不能提供服务;
Expected behavior 集群正常运行;
Acutally behavior 集群在运行过程中多次down掉;
How to Reproduce
- 集群中其中一个pod 报出:
org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor at org.rocksdb.RocksDB.put(Native Method) at org.rocksdb.RocksDB.put(RocksDB.java:591) at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.saveFirstLogIndex(RocksDBLogStorage.java:291) at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.truncatePrefix(RocksDBLogStorage.java:563) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:527) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:496) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) at java.lang.Thread.run(Thread.java:748) 2021-11-13 08:17:21,734 ERROR Fail to truncatePrefix 403.
org.rocksdb.RocksDBException: While fdatasync: /home/nacos/data/protocol/raft/naming_service_metadata/log/000156.log: Bad file descriptor at org.rocksdb.RocksDB.deleteRange(Native Method) at org.rocksdb.RocksDB.deleteRange(RocksDB.java:1991) at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.lambda$truncatePrefixInBackground$2(RocksDBLogStorage.java:584) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2021-11-13 08:17:21,739 ERROR Encountered an error=Status[EIO<1014>: Failed operation in LogStorage] on StateMachine com.alibaba.nacos.core.distributed.raft.NacosStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
2. 随着集群运行,逐渐导致其他节点出现问题,但没有明显报错,最终集群不可用;
3. 下面是一次nacos集群故障的发生时间:
> 共5台nacos节点,nacos-0 ~ nacos-4
![image](https://user-images.githubusercontent.com/30346811/141773262-ef8c4d10-6fa9-4088-bb29-754ea9ff0ae7.png)
**Desktop (please complete the following information):**
- OS: ubuntu
- Version : nacos:2.0.3
- Module: naming/config
- SDK: spring-cloud-alibaba-nacos:2021.1
- K8S: V1.17.9
- storage: Azurefile
**Additional context**
**1. 集群部署形式**
Nacos集群的部署方式是以官方提供的**nacos-K8S**为模板,只在**存储**的部分替换成了现有的云存储(Azurefile,类似于NFS的网络存储)。**部署在云服务的K8S集群上,共5个POD**;
**2. 对于Jraft的指令log,是否会由于网络波动,云存储性能等原因导致执行失败?**
**3. Nacos挂载的内容中,对于Data目录下的文件需要读&写,对于Logs下的文件只需要写,这样理解对吗?**