Postagesql + JPA 实现中文全文搜索

杜建章

2023-12-01

1.需求

原本平常常用的数据库是用的MySQL和influxdb，但是了解到Postagsql数据库可以进行全文搜索而且比sql语句的like更优后，想用Postagesql做探测信息的Log的查询，这里用这篇文章记录和总结下。

2. 实现

tsvector类型

这个类型是进行数据库保存分词的类型，我们可以把用储存我们进行的分词语句存放到这个数据库类型里面。这里我把search来保存文本分词后的结构。（当然这里仅仅是对于中文分词，英文分词的话不需要进行数据库存放分词数据，只需要在查询的时候用函数操作分词就好。）

create TABLE  asset_log  (
   id  SERIAL PRIMARY KEY,
   biz_id  varchar(50),`在这里插入代码片`
   title  varchar(500),
   status  varchar(50),
   description  varchar(2000),
   search  tsvector,
   asset_bom_biz_id  varchar(50),
   asset_bom_node_biz_id  varchar(50),
   asset_log_template_id  int,
   content  jsonb DEFAULT NULL,
   start_time  timestamp NULL,
   end_time  timestamp NULL,
  CONSTRAINT  fk_iot_asset_log_asset_log_template_id  FOREIGN KEY ( asset_log_template_id ) REFERENCES  asset_log_template  ( id )
);

中文分词函数

中文分词用的是jieba的库，利用jieba直接进行分词。
首先引入依赖

    compile group: 'com.huaban',name: 'jieba-analysis',version: '1.0.2'

然后写工具类方法，注意【JiebaSegmenter.SegMode】的模式有两种，一种是index，一种是seach，seach分词更粗，index更细点，搜索的话，建议选择seach。

		public List<String> segmente(String text, JiebaSegmenter.SegMode mode) {
				JiebaSegmenter segmenter = new JiebaSegmenter();
				List<SegToken> tokens = segmenter.process(text, mode);
				List<String> result = tokens.stream()
						.map(tk -> tk.word)
						.filter(t -> t != null && !t.isBlank())
						.collect(Collectors.toCollection(LinkedList::new));
				return result;
		}

实体类：

实体类的映射是String，本来我以为是PostgreSQLTSVectorType类型，但是反射出错了，改成Sting就好了。

public class AssetLog {
  @Builder.Default
  private String bizId = UUID.randomUUID().toString();
  @Id
  @GeneratedValue(strategy = GenerationType.IDENTITY)
  private Long id;
  private String title;
  private String status;
  private String description;


  private String search ;
  private String assetBomBizId;
  private String assetBomNodeBizId;
  }

Repository

可以看到里面就是用原生sql写对数据库进行更新seach分词的操作，首先用postagsql的to_tsvector函数进行对录入的数据进行分词，然后用setweight函数进行权重的分配，这里我的log主要是两种信息，一种是title，一种是description。把title设为权重A，description设为权重B。|| 符号是用来做拼接的。

public interface AssetLogRepository extends JpaRepository<AssetLog, Long>, JpaSpecificationExecutor<AssetLog> {
		@Modifying
		@Query(value = "UPDATE asset_log\n" +
				"SET search = setweight(to_tsvector('simple', :title), 'A') || setweight(to_tsvector('simple', :description), 'B') WHERE id = :id", nativeQuery = true)
		void segment(@Param("title") String title,@Param("description") String description,@Param("id")Long id) ;
}

Service

插入数据，在插入数据的时候，我们就对录入的数据进行分词处理和权重处理，并且进行保存到seach里面。

		@Transactional
		public AssetLogDTO createAssetLog(String accountBizId, UpsertAssetLogDTO request) {

				if (StringUtils.isBlank(request.getAssetLogTemplateBizId())) {
						throw new BadRequest(ResponseCode.BAD_REQUEST, "AssetLogTemplateId is not null");
				}



				var assetLog = AssetLog.builder()
						.assetLogTemplate(this.getAssetLogTemplateEntity(request.getAssetLogTemplateBizId(), accountBizId))
						.assetBomBizId(request.getAssetBomBizId())
						.assetBomNodeBizId(request.getAssetBomNodeBizId())
						.description(request.getDescription())
						.status(request.getStatus())
						.title(request.getTitle())
						.build();


				if (request.getFormContent() != null) {
						var newJsonForm = new ObjectMapper().createObjectNode();
						if (assetLog.getContent() != null) {
								assetLog.getContent().fields()
										.forEachRemaining(kv -> newJsonForm.set(kv.getKey(), kv.getValue()));
								request.getFormContent().fields()
										.forEachRemaining(kv -> newJsonForm.set(kv.getKey(), kv.getValue()));
						}
				}

/**
						if (request.getAssetLogIotVariables() != null) {
								var assetLogVariableList = this.assetLogIotVariableRepository.findByBizIdIn(request.getAssetLogIotVariables());
								assetLog.setAssetLogIotVariables(assetLogVariableList);
						}
**/
						assetLog = this.assetLogRepository.save(assetLog);
						var id = assetLog.getId() ;
		        Segment  segment= new Segment() ;

		    LOGGER.info(id.toString()) ;
		   var title =  segment.segmente(request.getTitle(), JiebaSegmenter.SegMode.INDEX);
			 String titleString = String.join(" ",title) ;
		   var description =  segment.segmente(request.getDescription(), JiebaSegmenter.SegMode.SEARCH) ;
			 String descriptionString = String.join(" ",description) ;
		    LOGGER.info( titleString ) ;
		    LOGGER.info(descriptionString) ;
			 this.assetLogRepository.segment(titleString,descriptionString, id ) ;

			 var dataList = this.toAssetLogDtoList(accountBizId, Arrays.asList(assetLog));
			 return dataList.size() > 0 ? dataList.get(0) : null;

		}

查询操作
查询操作的时候，我们就进行想查询的数据放到toTsquery函数参数里面对seach字段查询，然后用to_rank函数进行对匹配的评分，当匹配的分数 > 0 的时候就把数据筛选出来。

	public PaginationDTO<List<AssetLogDTO>> getAssetLogList(String accountBizId,
		                                                        String templateBizId, String templateName,
		                                                        String bizIds,
		                                                        LocalDateTime from,
		                                                        LocalDateTime to,
		                                                        String sort,
		                                                        String status,
		                                                        String search,
		                                                        Integer page, Integer size) {

				var data = this.assetLogRepository.findAll(((Specification<AssetLog>) (root, query, builder) -> {
						List<Predicate> predicates = Lists.newArrayList();
						if (StringUtils.isNotBlank(accountBizId)) {
								predicates.add(builder.equal(root.join("assetLogTemplate").get("accountBizId"), accountBizId));
						}

						if (StringUtils.isNotBlank(templateBizId)) {
								predicates.add(builder.equal(root.join("assetLogTemplate").get("bizId"), templateBizId));
						}

						if (StringUtils.isNotBlank(templateName)) {
								predicates.add(builder.equal(root.join("assetLogTemplate").get("bizName"), templateName));
						}

						if (StringUtils.isNotBlank(bizIds)) {
								var bizIdArray = bizIds.split(",");
								CriteriaBuilder.In<String> in = builder.in(root.get("bizId"));
								for (String bizId : bizIdArray) {
										in.value(bizId);
								}
								predicates.add(in);
						}

						if (StringUtils.isNotBlank(status)) {
								predicates.add(builder.or(Arrays.stream(status.split(","))
										.map(str -> builder.equal(root.get("status"), str)).toArray(Predicate[]::new)));
						}

						if (from != null) {
								predicates.add(builder.greaterThan(root.get("startTime"), from));
						}
						if (to != null) {
								predicates.add(builder.lessThan(root.get("startTime"), to));
						}


						if (StringUtils.isNotBlank(search)) {
								LOGGER.info("here");
								List<Predicate> predicatess = Lists.newArrayList();
								//在程序端分词完毕后使用 ‘|’  拼接传入
										Expression toTsquery = builder.function("to_tsquery",String.class,builder.literal("simple"),builder.literal(search)) ;
										Expression toRank = builder.function("ts_rank", PostgreSQLTSVectorType.class,root.get("search"),toTsquery);

										if ("asc".equalsIgnoreCase(sort)) {
												query.orderBy(builder.asc(toRank));
										} else {
												query.orderBy(builder.desc(toRank));
										}

										predicates.add(builder.or(builder.greaterThan(toRank,"0")));
						}


						return builder.and(predicates.toArray(new Predicate[0]));
				}), PageRequest.of(page, size));

				return PaginationDTO.<List<AssetLogDTO>>builder().total(data.getTotalElements())
						.data(this.toAssetLogDtoList(accountBizId, data.getContent())).build();
		}

参考
https://www.skypyb.com/2020/12/jishu/1705/

Over！！！！！！

Postagesql + JPA 实现中文全文搜索

1.需求

2. 实现

tsvector类型

中文分词函数

实体类：

Repository

Service

相关阅读

相关文章

相关问答

相关文档