整体结构
要了解delay-server源码的一个整体结构,需要我们跟着源码,从初始化开始简单先过一遍。重试化的工作都在startup这个包里,而这个包只有一个ServerWrapper类。 结合上一篇的内容,通过这个类就基本能看到delay的一个源码结构。delay-server基于netty,init方法完成初始化工作(端口默认为20801、心跳、wheel等),register方法是向meta-server发起请求,获取自己自己的角色
为delay
,并开始和meta-server的心跳。startServer方法是开始HashWheel的转动,从上次结束的位置继续message_log的回放,开启netty server。另外在做准备工作时知道QMQ是基于一主一从一备的方式,关于这个sync方法,是开启监听一个端口回应同步拉取
动作,如果是从节点还要开始向主节点发起同步拉取
动作。当这一切都完成了,那么online方法就执行,表示delay开始上线提供服务了。总结一下两个要点,QMQ是基于netty进行通信,并且采用一主一从一备的方式。
存储
关于存储在之前我们也讨论了,delay-server接收到延迟消息,会顺序append到message_log,之后再对message_log进行回放,以生成schedule_log。所以关于存储我们需要关注两个东西,一个是message_log的存储,另一个是schedule_log的生成。
message_log
其实message_log的生成很简单,就是顺序append。主要逻辑在qunar.tc.qmq.delay.receiver.Receiver
这个类里,大致流程就是关于QMQ自定义协议的一个反序列化,然后再对序列化的单个消息进行存储。如图:
doInvoke
中。
private void doInvoke(ReceivedDelayMessage message) {
// ...
try {
// 注:这里是进行append的地方
ReceivedResult result = facade.appendMessageLog(message);
offer(message, result);
} catch (Throwable t) {
error(message, t);
}
}
复制代码
delay存储层相关逻辑都在facade这个类里,初始化时类似消息的校验等工作也都在这里,而message_log的相关操作都在messageLog里。
@Override
public AppendMessageRecordResult append(RawMessageExtend record) {
AppendMessageResult<Long> result;
// 注:当前最新的一个segment
LogSegment segment = logManager.latestSegment();
if (null == segment) {
segment = logManager.allocNextSegment();
}
if (null == segment) {
return new AppendMessageRecordResult(PutMessageStatus.CREATE_MAPPED_FILE_FAILED, null);
}
// 注:真正进行append的动作是messageAppender
result = segment.append(record, messageAppender);
switch (result.getStatus()) {
case MESSAGE_SIZE_EXCEEDED:
return new AppendMessageRecordResult(PutMessageStatus.MESSAGE_ILLEGAL, null);
case END_OF_FILE:
if (null == logManager.allocNextSegment()) {
return new AppendMessageRecordResult(PutMessageStatus.CREATE_MAPPED_FILE_FAILED, null);
}
return append(record);
case SUCCESS:
return new AppendMessageRecordResult(PutMessageStatus.SUCCESS, result);
default:
return new AppendMessageRecordResult(PutMessageStatus.UNKNOWN_ERROR, result);
}
}
// 看一下这个appender,也可以通过这里能看到QMQ的delay message 格式定义
private class DelayRawMessageAppender implements MessageAppender<RawMessageExtend, Long> {
private final ReentrantLock lock = new ReentrantLock();
private final ByteBuffer workingBuffer = ByteBuffer.allocate(1024);
@Override
public AppendMessageResult<Long> doAppend(long baseOffset, ByteBuffer targetBuffer, int freeSpace, RawMessageExtend message) {
// 这个lock这里影响不大
lock.lock();
try {
workingBuffer.clear();
final String messageId = message.getHeader().getMessageId();
final byte[] messageIdBytes = messageId.getBytes(StandardCharsets.UTF_8);
final String subject = message.getHeader().getSubject();
final byte[] subjectBytes = subject.getBytes(StandardCharsets.UTF_8);
final long startWroteOffset = baseOffset + targetBuffer.position();
final int recordSize = recordSizeWithCrc(messageIdBytes.length, subjectBytes.length, message.getBodySize());
if (recordSize > config.getSingleMessageLimitSize()) {
return new AppendMessageResult<>(AppendMessageStatus.MESSAGE_SIZE_EXCEEDED, startWroteOffset, freeSpace, null);
}
workingBuffer.flip();
if (recordSize != freeSpace && recordSize + MIN_RECORD_BYTES > freeSpace) {
// 填充
workingBuffer.limit(freeSpace);
workingBuffer.putInt(MESSAGE_LOG_MAGIC_V1);
workingBuffer.put(MessageLogAttrEnum.ATTR_EMPTY_RECORD.getCode());
workingBuffer.putLong(System.currentTimeMillis());
targetBuffer.put(workingBuffer.array(), 0, freeSpace);
return new AppendMessageResult<>(AppendMessageStatus.END_OF_FILE, startWroteOffset, freeSpace, null);
} else {
int headerSize = recordSize - message.getBodySize();
workingBuffer.limit(headerSize);
workingBuffer.putInt(MESSAGE_LOG_MAGIC_V2);
workingBuffer.put(MessageLogAttrEnum.ATTR_MESSAGE_RECORD.getCode());
workingBuffer.putLong(System.currentTimeMillis());
// 注意这里,是schedule_time ,即延迟时间
workingBuffer.putLong(message.getScheduleTime());
// sequence,每个brokerGroup应该是唯一的
workingBuffer.putLong(sequence.incrementAndGet());
workingBuffer.putInt(messageIdBytes.length);
workingBuffer.put(messageIdBytes);
workingBuffer.putInt(subjectBytes.length);
workingBuffer.put(subjectBytes);
workingBuffer.putLong(message.getHeader().getBodyCrc());
workingBuffer.putInt(message.getBodySize());
targetBuffer.put(workingBuffer.array(), 0, headerSize);
targetBuffer.put(message.getBody().nioBuffer());
final long payloadOffset = startWroteOffset + headerSize;
return new AppendMessageResult<>(AppendMessageStatus.SUCCESS, startWroteOffset, recordSize, payloadOffset);
}
} finally {
lock.unlock();
}
}
}
复制代码
以上基本就是message_log的存储部分,接下来我们来看message_log的回放生成schedule_log。
schedule_log
MessageLogReplayer这个类就是控制回放的地方。那么考虑一个问题,下一次重启的时候,我们该从哪里进行回放?QMQ是会有一个回放的offset,这个offset会定时刷盘,下次重启的时候会从这个offset位置开始回放。细节可以看一下下面这段代码块。
final LogVisitor<LogRecord> visitor = facade.newMessageLogVisitor(iterateFrom.longValue());
adjustOffset(visitor);
while (true) {
final Optional<LogRecord> recordOptional = visitor.nextRecord();
if (recordOptional.isPresent() && recordOptional.get() == DelayMessageLogVisitor.EMPTY_LOG_RECORD) {
break;
}
recordOptional.ifPresent((record) -> {
// post以进行存储
dispatcher.post(record);
long checkpoint = record.getStartWroteOffset() + record.getRecordSize();
this.cursor.addAndGet(record.getRecordSize());
facade.updateIterateOffset(checkpoint);
});
}
iterateFrom.add(visitor.visitedBufferSize());
try {
TimeUnit.MILLISECONDS.sleep(5);
} catch (InterruptedException e) {
LOGGER.warn("message log iterate sleep interrupted");
}
复制代码
注意这里除了offset还有个cursor,这是为了防止回放失败,sleep 5ms后再次回放的时候从cursor位置开始,避免重复消息。那么我们看一下dispatcher.post这个方法:
@Override
public void post(LogRecord event) {
// 这里是schedule_log
AppendLogResult<ScheduleIndex> result = facade.appendScheduleLog(event);
int code = result.getCode();
if (MessageProducerCode.SUCCESS != code) {
LOGGER.error("appendMessageLog schedule log error,log:{} {},code:{}", event.getSubject(), event.getMessageId(), code);
throw new AppendException("appendScheduleLogError");
}
// 先看这里
iterateCallback.apply(result.getAdditional());
}
复制代码
如以上代码,我们看略过schedule_log的存储,看一下那个callback是几个意思:
private boolean iterateCallback(final ScheduleIndex index) {
// 延迟时间
long scheduleTime = index.getScheduleTime();
// 这个offset是startOffset,即在delay_segment中的这个消息的起始位置
long offset = index.getOffset();
// 是否add到内存中的HashWheel
if (wheelTickManager.canAdd(scheduleTime, offset)) {
wheelTickManager.addWHeel(index);
return true;
}
return false;
}
复制代码
这里的意思是,delay-server接收到消息,会判断一下这个消息是否需要add到内存中的wheel中,以防止丢消息。大家记着有这个事情,在投递小节中我们回过头来再说这里。那么回到facade.appendScheduleLog这个方法,schedule_log相关操作在scheduleLog里:
@Override
public RecordResult<T> append(LogRecord record) {
long scheduleTime = record.getScheduleTime();
// 这里是根据延迟时间定位对应的delaySegment的
DelaySegment<T> segment = locateSegment(scheduleTime);
if (null == segment) {
segment = allocNewSegment(scheduleTime);
}
if (null == segment) {
return new NopeRecordResult(PutMessageStatus.CREATE_MAPPED_FILE_FAILED);
}
// 具体动作在append里
return retResult(segment.append(record, appender));
}
复制代码
留意locateSegment这个方法,它是根据延迟时间定位DelaySegment,比如如果延迟时间是2019-03-03 16:00:00,那么就会定位到201903031600这个DelaySegment(注:这里贴的代码不是最新的,最新的是DelaySegment的刻度是可以配置,到分钟级别)。同样,具体动作也是appender做的,如下:
@Override
public AppendRecordResult<ScheduleSetSequence> appendLog(LogRecord log) {
workingBuffer.clear();
workingBuffer.flip();
final byte[] subjectBytes = log.getSubject().getBytes(StandardCharsets.UTF_8);
final byte[] messageIdBytes = log.getMessageId().getBytes(StandardCharsets.UTF_8);
int recordSize = getRecordSize(log, subjectBytes.length, messageIdBytes.length);
workingBuffer.limit(recordSize);
long scheduleTime = log.getScheduleTime();
long sequence = log.getSequence();
workingBuffer.putLong(scheduleTime);
// message_log中的sequence
workingBuffer.putLong(sequence);
workingBuffer.putInt(log.getPayloadSize());
workingBuffer.putInt(messageIdBytes.length);
workingBuffer.put(messageIdBytes);
workingBuffer.putInt(subjectBytes.length);
workingBuffer.put(subjectBytes);
workingBuffer.put(log.getRecord());
workingBuffer.flip();
ScheduleSetSequence record = new ScheduleSetSequence(scheduleTime, sequence);
return new AppendRecordResult<>(AppendMessageStatus.SUCCESS, 0, recordSize, workingBuffer, record);
}
复制代码
这里也能看到schedule_log的消息格式。
投递
投递的相关内容在WheelTickManager这个类。提前加载schedule_log、wheel根据延迟时间到时进行投递等相关工作都在这里完成。而关于真正进行投递的相关类是在sender这个包里。
wheel
wheel包里一共就三个类文件,HashWheelTimer、WheelLoadCursor、WheelTickManager,WheelTickManager就应该是wheel加载文件,wheel中的消息到时投递的管理器;WheelLoadCursor应该就是上一篇中提到的schedule_log文件加载到哪里的cursor标识;那么HashWheelTimer就是一个辅助工具类,简单理解成Java中的ScheduledExecutorService,可理解成是根据延迟消息的延迟时间进行投递的timer,所以这里不对这个工具类做更多解读,我们更关心MQ逻辑。
首先来看提前一定时间加载schedule_log,这里的提前一定时间是多长时间呢?这个是根据需要配置的,比如3schedule_log的刻度自定义配置为1h,提前加载时间配置为30min,那么在2019-02-10 17:30就应该加载2019021018这个schedule_log。
@Override
public void start() {
if (!isStarted()) {
sender.init();
// hash wheel timer,内存中的wheel
timer.start();
started.set(true);
// 根据dispatch log,从上次投递结束的地方恢复开始投递
recover();
// 加载线程,用于加载schedule_log
loadScheduler.scheduleWithFixedDelay(this::load, 0, config.getLoadSegmentDelayMinutes(), TimeUnit.MINUTES);
LOGGER.info("wheel started.");
}
}
复制代码
recover这个方法,会根据dispatch log中的投递记录,找到上一次最后投递的位置,在delay-server重启的时候,wheel会根据这个位置恢复投递。
private void recover() {
LOGGER.info("wheel recover...");
// 最新的dispatch log segment
DispatchLogSegment currentDispatchedSegment = facade.latestDispatchSegment();
if (currentDispatchedSegment == null) {
LOGGER.warn("load latest dispatch segment null");
return;
}
int latestOffset = currentDispatchedSegment.getSegmentBaseOffset();
DispatchLogSegment lastSegment = facade.lowerDispatchSegment(latestOffset);
if (null != lastSegment) doRecover(lastSegment);
// 根据最新的dispatch log segment进行恢复投递
doRecover(currentDispatchedSegment);
LOGGER.info("wheel recover done. currentOffset:{}", latestOffset);
}
private void doRecover(DispatchLogSegment dispatchLogSegment) {
int segmentBaseOffset = dispatchLogSegment.getSegmentBaseOffset();
ScheduleSetSegment setSegment = facade.loadScheduleLogSegment(segmentBaseOffset);
if (setSegment == null) {
LOGGER.error("load schedule index error,dispatch segment:{}", segmentBaseOffset);
return;
}
// 得到一个关于已投递记录的set
LongHashSet dispatchedSet = loadDispatchLog(dispatchLogSegment);
// 根据这个set,将最新的dispatch log segment中未投递的消息add in wheel。
WheelLoadCursor.Cursor loadCursor = facade.loadUnDispatch(setSegment, dispatchedSet, this::refresh);
int baseOffset = loadCursor.getBaseOffset();
// 记录cursor
loadingCursor.shiftCursor(baseOffset, loadCursor.getOffset());
loadedCursor.shiftCursor(baseOffset);
}
复制代码
恢复基本就是以上的这些内容,接下来看看是如何加载的
private void load() {
// 提前一定时间加载到下一 delay segment
long next = System.currentTimeMillis() + config.getLoadInAdvanceTimesInMillis();
int prepareLoadBaseOffset = resolveSegment(next);
try {
// 加载到prepareLoadBaseOffset这个delay segment
loadUntil(prepareLoadBaseOffset);
} catch (InterruptedException ignored) {
LOGGER.debug("load segment interrupted");
}
}
private void loadUntil(int until) throws InterruptedException {
// 当前wheel已加载到baseOffset
int loadedBaseOffset = loadedCursor.baseOffset();
// 如已加载到until,则break
// have loaded
if (loadedBaseOffset > until) return;
do {
// 加载失败,则break
// wait next turn when loaded error.
if (!loadUntilInternal(until)) break;
// 当前并没有until这个delay segment,即loading cursor小于until
// load successfully(no error happened) and current wheel loading cursor < until
if (loadingCursor.baseOffset() < until) {
// 阻塞,直到thresholdTime+blockingExitTime
// 即如果提前blockingExitTime还未有until这个delay segment的消息进来,则退出
long thresholdTime = System.currentTimeMillis() + config.getLoadBlockingExitTimesInMillis();
// exit in a few minutes in advance
if (resolveSegment(thresholdTime) >= until) {
loadingCursor.shiftCursor(until);
loadedCursor.shiftCursor(until);
break;
}
}
// 避免cpu load过高
Thread.sleep(100);
} while (loadedCursor.baseOffset() < until);
LOGGER.info("wheel load until {} <= {}", loadedCursor.baseOffset(), until);
}
复制代码
根据配置的提前加载时间,内存中的wheel会提前加载schedule_log,加载是在一个while循环里,直到加载到until delay segment才退出,如果当前没有until 这个delay segment,那么会在配置的blockingExitTime时间退出该循环,而为了避免cpu load过高,这里会在每次循环间隔设置100ms sleep。这里加载为什么是在while循环里?以及为什么sleep 100ms,sleep 500ms 或者1s可不可以?以及为什么要设置个blockingExitTime呢?下面的分析之后,应该就能回答这些问题了。主要考虑两种情况,一种是当之前一直没有delay segment或者delay segment是间隔存在的,比如delay segment刻度为1h,2019031001和2019031004之间的2019031002及2019031003不存在这种之类的delay segment不存在的情况,另一种是当正在加载delay segment的时候,位于该segment的延迟消息正在被加载,这种情况是有可能丢消息的。所以这里加载是在一个循环里,以及设置了两个cursor,即loading cursor,和loaded cursor。一个表示正在加载,一个表示已经加载。此外,上面每次循环sleep 100ms,可不可以sleep 500ms 1s?答案是可以,只是消息是否能容忍500ms 或者1s的延迟。
private boolean loadUntilInternal(int until) {
int index = resolveStartIndex();
if (index < 0) return true;
try {
while (index <= until) {
ScheduleSetSegment segment = facade.loadScheduleLogSegment(index);
if (segment == null) {
int nextIndex = facade.higherScheduleBaseOffset(index);
if (nextIndex < 0) return true;
index = nextIndex;
continue;
}
// 具体加载某个segment的地方
loadSegment(segment);
int nextIndex = facade.higherScheduleBaseOffset(index);
if (nextIndex < 0) return true;
index = nextIndex;
}
} catch (Throwable e) {
LOGGER.error("wheel load segment failed,currentSegmentOffset:{} until:{}", loadedCursor.baseOffset(), until, e);
QMon.loadSegmentFailed();
return false;
}
return true;
}
private void loadSegment(ScheduleSetSegment segment) {
final long start = System.currentTimeMillis();
try {
int baseOffset = segment.getSegmentBaseOffset();
long offset = segment.getWrotePosition();
if (!loadingCursor.shiftCursor(baseOffset, offset)) {
LOGGER.error("doLoadSegment error,shift loadingCursor failed,from {}-{} to {}-{}", loadingCursor.baseOffset(), loadingCursor.offset(), baseOffset, offset);
return;
}
WheelLoadCursor.Cursor loadedCursorEntry = loadedCursor.cursor();
// have loaded
// 已经加载
if (baseOffset < loadedCursorEntry.getBaseOffset()) return;
long startOffset = 0;
// last load action happened error
// 如果上次加载失败,则从上一次的位置恢复加载
if (baseOffset == loadedCursorEntry.getBaseOffset() && loadedCursorEntry.getOffset() > -1)
startOffset = loadedCursorEntry.getOffset();
LogVisitor<ScheduleIndex> visitor = segment.newVisitor(startOffset, config.getSingleMessageLimitSize());
try {
loadedCursor.shiftCursor(baseOffset, startOffset);
long currentOffset = startOffset;
// 考虑一种情况,当前delay segment正在append消息,所以是while,而loaded cursor的offset也是没加载一个消息更新的
while (currentOffset < offset) {
Optional<ScheduleIndex> recordOptional = visitor.nextRecord();
if (!recordOptional.isPresent()) break;
ScheduleIndex index = recordOptional.get();
currentOffset = index.getOffset() + index.getSize();
refresh(index);
loadedCursor.shiftOffset(currentOffset);
}
loadedCursor.shiftCursor(baseOffset);
LOGGER.info("loaded segment:{} {}", loadedCursor.baseOffset(), currentOffset);
} finally {
visitor.close();
}
} finally {
Metrics.timer("loadSegmentTimer").update(System.currentTimeMillis() - start, TimeUnit.MILLISECONDS);
}
}
复制代码
还记得上一篇我们提到过,存储的时候,如果这个消息位于正在被wheel加载segment中,那么这个消息应该是会被加载到wheel中的。
private boolean iterateCallback(final ScheduleIndex index) {
long scheduleTime = index.getScheduleTime();
long offset = index.getOffset();
// 主要看一下这个canAdd
if (wheelTickManager.canAdd(scheduleTime, offset)) {
wheelTickManager.addWHeel(index);
return true;
}
return false;
}
// 就是cursor起作用的地方了
public boolean canAdd(long scheduleTime, long offset) {
WheelLoadCursor.Cursor currentCursor = loadingCursor.cursor();
int currentBaseOffset = currentCursor.getBaseOffset();
long currentOffset = currentCursor.getOffset();
// 根据延迟时间确定该消息位于哪个segment
int baseOffset = resolveSegment(scheduleTime);
// 小于当前loading cursor,则put int wheel
if (baseOffset < currentBaseOffset) return true;
// 正在加载
if (baseOffset == currentBaseOffset) {
// 根据cursor的offset判断
return currentOffset <= offset;
}
return false;
}
复制代码
sender
通过brokerGroup做分组,根据组批量发送,发送时是多线程发送,每个组互不影响,发送时也会根据实时broker的weight进行选择考虑broker进行发送。
@Override
public void send(ScheduleIndex index) {
if (!BrokerRoleManager.isDelayMaster()) {
return;
}
boolean add;
try {
long waitTime = Math.abs(sendWaitTime);
// 入队
if (waitTime > 0) {
add = batchExecutor.addItem(index, waitTime, TimeUnit.MILLISECONDS);
} else {
add = batchExecutor.addItem(index);
}
} catch (InterruptedException e) {
return;
}
if (!add) {
reject(index);
}
}
@Override
public void process(List<ScheduleIndex> indexList) {
try {
// 发送处理逻辑在senderExecutor里
senderExecutor.execute(indexList, this, brokerService);
} catch (Exception e) {
LOGGER.error("send message failed,messageSize:{} will retry", indexList.size(), e);
retry(indexList);
}
}
// 以下为senderExecutor内容
void execute(final List<ScheduleIndex> indexList, final SenderGroup.ResultHandler handler, final BrokerService brokerService) {
// 分组
Map<SenderGroup, List<ScheduleIndex>> groups = groupByBroker(indexList, brokerService);
for (Map.Entry<SenderGroup, List<ScheduleIndex>> entry : groups.entrySet()) {
doExecute(entry.getKey(), entry.getValue(), handler);
}
}
private void doExecute(final SenderGroup group, final List<ScheduleIndex> list, final SenderGroup.ResultHandler handler) {
// 分组发送
group.send(list, sender, handler);
}
复制代码
可以看到,投递时是根据server broker进行分组投递。看一下SenderGroup这个类
可以看到,每个组的投递是多线程,互不影响,不会存在某个组的server挂掉,导致其他组无法投递。并且这里如果存在某个组无法投递,重试时会选择其它的server broker进行重试。与此同时,在选择组时,会根据每个server broker的weight进行综合考量,即当前server broker有多少消息量要发送。 // 具体发送的地方
private void send(Sender sender, ResultHandler handler, BrokerGroupInfo groupInfo, String groupName, List<ScheduleIndex> list) {
try {
long start = System.currentTimeMillis();
// 从schedule log中恢复消息内容
List<ScheduleSetRecord> records = store.recoverLogRecord(list);
QMon.loadMsgTime(System.currentTimeMillis() - start);
// 发送消息
Datagram response = sendMessages(records, sender);
release(records);
monitor(list, groupName);
if (response == null) {
// 这里会进行重试等动作
handler.fail(list);
} else {
final int responseCode = response.getHeader().getCode();
final Map<String, SendResult> resultMap = getSendResult(response);
if (resultMap == null || responseCode != CommandCode.SUCCESS) {
if (responseCode == CommandCode.BROKER_REJECT || responseCode == CommandCode.BROKER_ERROR) {
// 该组熔断
groupInfo.markFailed();
}
monitorSendFail(list, groupInfo.getGroupName());
// 重试
handler.fail(list);
return;
}
Set<String> failedMessageIds = new HashSet<>();
boolean brokerRefreshed = false;
for (Map.Entry<String, SendResult> entry : resultMap.entrySet()) {
int resultCode = entry.getValue().getCode();
if (resultCode != MessageProducerCode.SUCCESS) {
failedMessageIds.add(entry.getKey());
}
if (!brokerRefreshed && resultCode == MessageProducerCode.BROKER_READ_ONLY) {
groupInfo.markFailed();
brokerRefreshed = true;
}
}
if (!brokerRefreshed) groupInfo.markSuccess();
// dispatch log 记录在这里产生
handler.success(records, failedMessageIds);
}
} catch (Throwable e) {
LOGGER.error("sender group send batch failed,broker:{},batch size:{}", groupName, list.size(), e);
handler.fail(list);
}
}
复制代码
就是以上这些,关于QMQ的delay-server源码分析就是这些了,如果以后有机会会分析一下QMQ的其他模块源码,谢谢。