本教程系列将涵盖txtai的主要用例,这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码,可也可以在colab 中使用。
colab 地址
本文概述了 txtai 以及如何运行相似性搜索。
安装txtai和所有依赖项。
pip install txtai
Embeddings 实例是 txtai 的主要入口点。Embeddings 实例定义了用于标记文本部分并将其转换为嵌入向量的方法。
from txtai.embeddings import Embeddings
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
嵌入实例依赖于底层的转换器模型来构建文本嵌入。以下示例展示了如何使用 Transformers Embedding 实例对不同概念的列表运行相似性搜索。
data = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
# Get index of best section that best matches query
uid = embeddings.similarity(query, data)[0][0]
print("%-20s %s" % (query, data[uid]))
# -----------------------------------结果-------------------------
Query Best Match
--------------------------------------------------
feel good story Maine man wins $1M from $25 lottery ticket
climate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health US tops 5 million confirmed virus cases
war Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife The National Park Service warns against sacrificing slower friends in a bear attack
asia Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky Maine man wins $1M from $25 lottery ticket
dishonest junk Make huge profits without work, earn up to $100,000 a day
上面的示例显示,对于几乎所有查询,实际文本并未存储在文本部分列表中。这是 Transformer 模型对基于令牌的搜索的真正力量。你从盒子里拿出来的是!
对于小文本列表,上述方法有效。但是对于更大的文档存储库,在每个查询上标记化并转换为嵌入是没有意义的。txtai 支持构建可显着提高性能的预计算索引。
在前面的示例的基础上,下面的示例运行一个 index 方法来构建和存储文本嵌入。在这种情况下,每次搜索仅将查询转换为嵌入向量。
# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
# Extract uid of first result
# search result format: (uid, score)
uid = embeddings.search(query, 1)[0][0]
# Print text
print("%-20s %s" % (query, data[uid]))
# -----------------------------------结果-------------------------
Query Best Match
--------------------------------------------------
feel good story Maine man wins $1M from $25 lottery ticket
climate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health US tops 5 million confirmed virus cases
war Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife The National Park Service warns against sacrificing slower friends in a bear attack
asia Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky Maine man wins $1M from $25 lottery ticket
dishonest junk Make huge profits without work, earn up to $100,000 a day
嵌入索引可以保存到磁盘并重新加载。此时,索引不是增量创建的,索引需要完全重建以合并新数据。
embeddings.save("index")
embeddings = Embeddings()
embeddings.load("index")
uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])
嵌入索引支持更新和删除。upsert 操作将插入新数据并更新现有数据
以下部分运行查询,然后更新更改顶部结果的值,最后删除更新的值以恢复到原始查询结果。
# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])
# Update data
data[0] = "See it: baby panda born"
embeddings.upsert([(0, data[0], None)])
uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", data[uid])
# Remove record just added from index
embeddings.delete([0])
# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", data[uid])
# -----------------------------------结果-------------------------
Initial: Maine man wins $1M from $25 lottery ticket
After update: See it: baby panda born
After delete: Maine man wins $1M from $25 lottery ticket
Embeddings 支持两种创建文本向量的方法:sentence-transformers 库和词嵌入向量。两种方法都有其优点,如下所示:
https://dev.to/neuml/tutorial-series-on-txtai-ibg