spring-cloud-starter-dubbo不设置心跳间隔导致生产者重启no Provider问题记录

版本

spring-cloud-starter-dubbo-2.2.4.RELEASE

问题描述

生产者重启后，正常注册到注册中心，但是消费者调用接口是no provider，偶现，频繁出现

解决办法

先说原因和解决办法，有兴趣可以看下问题的排查过程。

原因

dubbo在建立连接后会起一个任务，检查连接的是否有效，如果已经时间，会重新连接。问题出在时间间隔上面。
从元数据读取heartbeat这个key，如果没有，那么使用默认的60秒，我们项目没有设置这个心跳时间，那么默认就是60秒。
而重试时间间隔默认为这个时间的三倍，也就是3分钟。此时也就问题就已经很明显，重连时间间隔太长。生产者重新启动后，还没有重新建立连接。此时调用DubboMetadataService.getExportedURLs的方法获取服务原数据还使用已经关闭的那个时效的连接，会失败报错

Caused by: org.apache.dubbo.remoting.RemotingException: message can not send, because channel is closed .

表现

查看日志，发现生产者有下线后，消费者会去重连，但是有时能重连成功，有时重连失败
连接失败

连接成功

经过观察，每次生产者启动后，都是因为重连失败会导致No Provider。

解决办法

dubbo:protocol:name: dubboheartbeat: 1000

这里设置1秒，那么3秒会进行一次检查，已经足够了在生产者启动暴露服务期间与生产者建立连接。
此时再查看元数据已经有了timeout

在这里插入图片描述
设置心跳时间后，重启生产者和消费者。问题解决，以后重启生产者不会再出现 No Provider的问题

问题排查过程

服务刷新角度排查

首先排查是不是因为是不是因为生产者注册到nacos服务变动，没有触发消费端的服务刷新。经过排查，正常触发了DubboMetaDataService的服务刷新，也正常触发了Dubbo Invoker的刷新。这俩监听器分别是com.alibaba.cloud.dubbo.registry.DubboCloudRegistry#subscribeDubboMetadataServiceURLs(org.apache.dubbo.common.URL, org.apache.dubbo.registry.NotifyListener)

com.alibaba.cloud.dubbo.registry.DubboCloudRegistry#subscribeURLs(org.apache.dubbo.common.URL, org.apache.dubbo.registry.NotifyListener)
执行顺序上没问题，因为spring cloud alibaba只注册DubboMetadataService到注册中心，消费者需要引用的生产者接口，是用过DubboMetadataService.getExportedURLs，做rpc调用生产者获取到的。因此需要先刷新DubboMetadataService对应的invoker再刷新消费者引用的的那些 invoker
但是在触发获取getExportedURLs时，发现有些情况获取到的结果是空

	private List<URL> getTemplateExportedURLs(URL subscribedURL,List<ServiceInstance> serviceInstances) {DubboMetadataService dubboMetadataService = getProxy(serviceInstances);List<URL> templateExportedURLs = emptyList();if (dubboMetadataService != null) {templateExportedURLs = getExportedURLs(dubboMetadataService, subscribedURL);}else {if (logger.isWarnEnabled()) {logger.warn("The metadata of Dubbo service[key : {}] still can't be found, it could effect the further "+ "Dubbo service invocation",subscribedURL.getServiceKey());}}return templateExportedURLs;}

生产者服务暴露时机排查

消费者正常通过DubboMetadataService.getExportedURL获取服务，返回空。首先怀疑生产者逻辑有问题。
经过排查，生产者保证了首先暴露所有的服务后才注册元数据到注册中心

消费者rpc调用。在生产者DubboMetadataService的实现IntrospectiveDubboMetadataService上断电观察，发现这里返回的数据是没问题的

@Overridepublic String getExportedURLs(String serviceInterface, String group, String version) {List<URL> urls = getRepository().getExportedURLs(serviceInterface, group,version);return jsonUtils.toJSON(urls);}

消费者调用生产者排查

不得不说这个问题真的难查，在不断点的情况下很容易出现，但是加上断点，导致程序执行速度变慢，很难复现。、
最终查看日志发现在生产者重启后的报错

Caused by: org.apache.dubbo.remoting.RemotingException: message can not send, because channel is closed .

但是我生产已经启动了，dubbo端口也起来了。为什么还报这个错。

Caused by: org.apache.dubbo.remoting.RemotingException: message can not send, because channel is closed .

看到这里，下意识猜测是不是因为生产者下线并上线后，消费者用的还是旧链接，而没有重新建立连接。
查看日志，发现生产者有下线后，消费者会去重连，但是有时能重连成功，有时重连失败
连接失败

连接成功

经过观察，每次生产者启动后，都是因为重连失败会导致No Provider。
那么问题就找到了。至于怎么解决，看下这个ReconnectTimerTask的逻辑是怎么样的

public class ReconnectTimerTask extends AbstractTimerTask {private static final Logger logger = LoggerFactory.getLogger(ReconnectTimerTask.class);private final int idleTimeout;public ReconnectTimerTask(ChannelProvider channelProvider, Long heartbeatTimeoutTick, int idleTimeout) {super(channelProvider, heartbeatTimeoutTick);this.idleTimeout = idleTimeout;}@Overrideprotected void doTask(Channel channel) {try {Long lastRead = lastRead(channel);Long now = now();// Rely on reconnect timer to reconnect when AbstractClient.doConnect fails to init the connectionif (!channel.isConnected()) {try {logger.info("Initial connection to " + channel);((Client) channel).reconnect();} catch (Exception e) {logger.error("Fail to connect to " + channel, e);}// check pong at client} else if (lastRead != null && now - lastRead > idleTimeout) {logger.warn("Reconnect to channel " + channel + ", because heartbeat read idle time out: "+ idleTimeout + "ms");try {((Client) channel).reconnect();} catch (Exception e) {logger.error(channel + "reconnect failed during idle time.", e);}}} catch (Throwable t) {logger.warn("Exception when reconnect to remote channel " + channel.getRemoteAddress(), t);}}
}

发现这个任务会检查连接是否有效，如果连接无效，那么会重新连接。
这个任务的执行时机是通过dubbo的时间轮调用的。
关于时间轮的这里不展开了。看下这个定时任务的执行间隔是多少
在HeaderExchangeClient中建立连接后。会开启一个重试连接的任务。

    private void startReconnectTask(URL url) {if (shouldReconnect(url)) {AbstractTimerTask.ChannelProvider cp = () -> Collections.singletonList(HeaderExchangeClient.this);int idleTimeout = getIdleTimeout(url);long heartbeatTimeoutTick = calculateLeastDuration(idleTimeout);this.reconnectTimerTask = new ReconnectTimerTask(cp, heartbeatTimeoutTick, idleTimeout);IDLE_CHECK_TIMER.newTimeout(reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS);}}

其中heartbeatTimeoutTick标识了重连检查的时间间隔

String HEARTBEAT_KEY = "heartbeat";
int DEFAULT_HEARTBEAT = 60 * 1000;
public static int getIdleTimeout(URL url) {int heartBeat = getHeartbeat(url);// idleTimeout should be at least more than twice heartBeat because possible retries of client.int idleTimeout = url.getParameter(Constants.HEARTBEAT_TIMEOUT_KEY, heartBeat * 3);if (idleTimeout < heartBeat * 2) {throw new IllegalStateException("idleTimeout < heartbeatInterval * 2");}return idleTimeout;}public static int getHeartbeat(URL url) {return url.getParameter(Constants.HEARTBEAT_KEY, Constants.DEFAULT_HEARTBEAT);}

可以看到超时时间是从，dubbo元数据读取heartbeat这个key，如果没有，那么使用默认的60秒，我们项目没有设置这个心跳时间，那么默认就是60秒。
而重试时间间隔默认为这个时间的三倍，3分钟。此时也就问题就已经很明显，重连时间间隔太长。生产者重新启动后，还没有重新建立连接。此时调用DubboMetadataService.getExportedURLs的方法获取服务原数据会失败，报错

Caused by: org.apache.dubbo.remoting.RemotingException: message can not send, because channel is closed .

等到了时间，重连成功后，又因为此时的nacos中的数据不再变化，不再触发服务变动，导致一直都是No Provider的状态。
那么解决这个办法也很简单，那就是设置心跳时间小一些。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/181850.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！