resilience4j 重试源码分析以及重试指标采集

屠和洽

2023-12-01

前言

需求

为了防止网络抖动问题，需要进行重试处理，重试达到阈值后进行告警通知，做到问题及时响应

技术选型

类型	同步、异步	是否支持声明式调用（注解）	是否支持监控
resilience4j-retry	同步	是	是
Guava Retry	同步	否	否，可通过监听器自行实现监控统计
Spring Retry	同步	是	否，可通过监听器自行实现监控统计

基于以上方案的对比，选择了使用resilience4j-retry，主要基于以下两点：

本身提供了监控数据，可完美接入premethus
resilience4j除了提供重试能力，还具备Hystrix相同的能力，包括断路器、隔断、限流、缓存。提供与Spring Boot集成的依赖，大大简化了集成成本。（后期可考虑从Hystrix迁移到resilience4j）

提出问题

resilience4j-retrry怎么集成到项目中以及怎么使用？
怎样自定义时间间隔？
resilience4j-retry实现原理？
监控数据如何统计以及premethus如何采集？

问题分析

resilience4j-retrry如何使用

maven引入resilience4j-spring-boot2包

<dependency>
		<groupId>io.github.resilience4j</groupId>
		<artifactId>resilience4j-spring-boot2</artifactId>
		<version>1.7.1</version>
</dependency>

配置重试服务

// 对应@Retry注解的name属性
resilience4j.retry.instances.sendConfirmEmail.max-attempts=3

在需要重试的方法加上@Retry注解

@Retry(name= "sendConfirmEmail",fallbackMethod = "sendConfirmEmailFallback")
public void sendConfirmEmail(SsoSendConfirmEmailDTO ssoSendConfirmEmail) {
   //省略方法内容
   throw new ServiceException("send confirm email error"); 
}

定义fallbackMethod
4.1 重要的是要记住，fallbackMethod应该放在同一个类中，并且必须具有相同的方法签名，只需要一个额外的目标异常参数
4.2 如果有多个 fallbackMethod 方法，将调用最接近匹配的方法
```
public void sendConfirmEmailFallback(SsoSendConfirmEmailDTO ssoSendConfirmEmail,ServiceException e){
   //发送邮件通知
}
```

自定义时间间隔

默认按照固定时间间隔重试，但如果现在想做到1s->2s-3s间隔时间逐次递增，这时就需要自定义时间间隔

实现IntervalBiFunction接口，自定义时间间隔类

public class SendEmailIntervalBiFunction implements IntervalBiFunction<Integer> {

    private final Duration waitDuration = Duration.ofSeconds(1);

    @Override
    public Long apply(Integer numOfAttempts, Either<Throwable, Integer> either) {
        return numOfAttempts * waitDuration.toMillis();
	}
}

配置指定自定义时间间隔类
3.1 通过Class.forName去加载自定义时间间隔类

resilience4j.retry.instances.sendConfirmEmail.interval-bi-function=com.xxx.xxx.retry.SendEmailIntervalBiFunction

resilience4j-retry源码分析

创建测试方法进行debug

@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
public class RetryTest {

    @Resource
    private UserApiService userApiService;

    @Test
    public void testRetryThreeTimes() throws InterruptedException {
        SsoSendConfirmEmailDTO ssoSendConfirmEmailDTO = null;
        userApiService.sendConfirmEmail(ssoSendConfirmEmailDTO);
    }
}

定义Retry切面：RetryAspect，对@Retry注解标识的类或者方法进行拦截
2.1 根据@Retry注解的name创建Retry实现类：RetryImpl
2.2 根据@Retry注解的fallbackMethod创建FallbackMethod（根据方法、参数、异常反射获取对应的方法）
2.3 重试处理（最终有重试实现类完成功能：RetryImpl#executeCheckedSupplier）

@Around(value = "matchAnnotatedClassOrMethod(retryAnnotation)", argNames = "proceedingJoinPoint, retryAnnotation")
public Object retryAroundAdvice(ProceedingJoinPoint proceedingJoinPoint,
    @Nullable Retry retryAnnotation) throws Throwable {
    //根据name创建Retry实现类：RetryImpl   ---> Retry retry = retryRegistry.retry(backend)
    io.github.resilience4j.retry.Retry retry = getOrCreateRetry(methodName, backend);
    
    // 根据@Retry注解的fallbackMethod创建FallbackMethod -->FallbackMethod#create
	FallbackMethod fallbackMethod = FallbackMethod
        .create(fallbackMethodValue, method, proceedingJoinPoint.getArgs(),
            proceedingJoinPoint.getTarget());
	
	//重试处理：RetryAspect#proceed  -->最终触发RetryImpl#executeCheckedSupplier
	return fallbackDecorators.decorate(fallbackMethod,
        () -> proceed(proceedingJoinPoint, methodName, retry, returnType)).apply();
}

重试处理

核心方法：Retry#decorateCheckedSupplier（do…while(true)）
1.1 获取重试上下文：RetryImpl$ContextImpl
1.2 调用被@Retry修饰的业务方法
1.3 对结果进行处理（以及如果发生异常，对异常进行处理）

static <T> CheckedFunction0<T> decorateCheckedSupplier(Retry retry,
                                                       CheckedFunction0<T> supplier) {
    return () -> {
    	//获取重试上下文：RetryImpl$ContextImpl
        Retry.Context<T> context = retry.context();
        do {
            try {
            	// 调被@Retry修饰的业务方法
                T result = supplier.apply();
                final boolean validationOfResult = context.onResult(result);
                if (!validationOfResult) {
                    context.onComplete();
                    return result;
                }
            } catch (Exception exception) {
                context.onError(exception);
            }
        } while (true);
    };
}

异常后重试处理：RetryImpl$ ContextImpl#onError
2.1 如果异常是可重试的异常，则进行重试处理：RetryImpl$ContextImpl#throwOrSleepAfterException

private void throwOrSleepAfterException() throws Exception {
    int currentNumOfAttempts = numOfAttempts.incrementAndGet();
    Exception throwable = lastException.get();
    // 如果重试次数超过阈值，则抛出异常
    if (currentNumOfAttempts >= maxAttempts) {
        failedAfterRetryCounter.increment();
        publishRetryEvent(
            () -> new RetryOnErrorEvent(getName(), currentNumOfAttempts, throwable));
        throw throwable;
    } else {
    	// 在重试范围内，则sleep间隔时间
        waitIntervalAfterFailure(currentNumOfAttempts, Either.left(throwable));
    }
}

重试数据采集

数据的作用：通过分析服务重试成功、重试失败、没有重试成功、没有重试成功数据，判断该服务的稳定性

在重试处理时，将统计数据存放在RetryImpl属性上
2.1 在RetryImp$ ContextImpll#onComplete统计succeededAfterRetryCounter、failedAfterRetryCounter、succeededWithoutRetryCounter
2.2 在RetryImp$ ContextImpll#onError统计failedWithoutRetryCounter

//重试后成功次数
private final LongAdder succeededAfterRetryCounter;
// 重试后失败次数（超过阈值后还是失败）
private final LongAdder failedAfterRetryCounter;
// 没有重试就成功的次数
private final LongAdder succeededWithoutRetryCounter;
// 没有重试就失败的次数（不是可重试的异常）
private final LongAdder failedWithoutRetryCounter;

premethus采集重试数据
3.1 引入premethus采集相关包，暴露采集接口

<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-registry-prometheus</artifactId>
	<version>1.7.1</version>
</dependency>
<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-core</artifactId>
	<version>1.7.1</version>
</dependency>

3.2 配置actutor开放premethus采集接口
3.2.1 premethus采集接口：PrometheusScrapeEndpoint#scrape
3.2.2 发送/actutor/prometheus触发收集：AbstractRetryMetrics#registerMetrics

management.server.port=9099
management.endpoint.health.show-details=always
management.endpoints.web.exposure.include=health,prometheus

束语

重试在我理解应该只能解决网络异常，业务异常重试也不能解决
如果是页面交互触发，这样重试方式会导致交互时间拉长（不能接受）
2.1 加@Aync注解将重试方法异步化，避免页面等待（如果此时应用宕机等导致没有执行怎样处理？）
欢迎大家一起讨论，给出好的解决方案

resilience4j 重试源码分析以及重试指标采集

前言

需求

技术选型

提出问题

问题分析

resilience4j-retrry如何使用

自定义时间间隔

resilience4j-retry源码分析

重试处理

重试数据采集

束语

相关阅读

相关文章

相关问答

相关文档