txtai初识

卫和洽

2023-12-01

txtai初识

本教程系列将涵盖txtai的主要用例，这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码，可也可以在colab 中使用。
colab 地址

本文概述了 txtai 以及如何运行相似性搜索。

安装依赖

安装txtai和所有依赖项。

pip install txtai

创建一个嵌入实例

Embeddings 实例是 txtai 的主要入口点。Embeddings 实例定义了用于标记文本部分并将其转换为嵌入向量的方法。

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

运行相似度查询

嵌入实例依赖于底层的转换器模型来构建文本嵌入。以下示例展示了如何使用 Transformers Embedding 实例对不同概念的列表运行相似性搜索。

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, data)[0][0]

    print("%-20s %s" % (query, data[uid]))

# -----------------------------------结果-------------------------
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       Make huge profits without work, earn up to $100,000 a day

上面的示例显示，对于几乎所有查询，实际文本并未存储在文本部分列表中。这是 Transformer 模型对基于令牌的搜索的真正力量。你从盒子里拿出来的是！

构建嵌入索引

对于小文本列表，上述方法有效。但是对于更大的文档存储库，在每个查询上标记化并转换为嵌入是没有意义的。txtai 支持构建可显着提高性能的预计算索引。

在前面的示例的基础上，下面的示例运行一个 index 方法来构建和存储文本嵌入。在这种情况下，每次搜索仅将查询转换为嵌入向量。

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war", "wildlife", "asia", "lucky", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print("%-20s %s" % (query, data[uid]))

# -----------------------------------结果-------------------------
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky                Maine man wins $1M from $25 lottery ticket
dishonest junk       Make huge profits without work, earn up to $100,000 a day

嵌入加载/保存

嵌入索引可以保存到磁盘并重新加载。此时，索引不是增量创建的，索引需要完全重建以合并新数据。

embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])

嵌入更新/删除

嵌入索引支持更新和删除。upsert 操作将插入新数据并更新现有数据

以下部分运行查询，然后更新更改顶部结果的值，最后删除更新的值以恢复到原始查询结果。

# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])

# Update data
data[0] = "See it: baby panda born"
embeddings.upsert([(0, data[0], None)])

uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", data[uid])

# Remove record just added from index
embeddings.delete([0])

# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", data[uid])

# -----------------------------------结果-------------------------
Initial:  Maine man wins $1M from $25 lottery ticket
After update:  See it: baby panda born
After delete:  Maine man wins $1M from $25 lottery ticket

嵌入方法

Embeddings 支持两种创建文本向量的方法：sentence-transformers 库和词嵌入向量。两种方法都有其优点，如下所示：

sentence-transformers
- 通过对由transformers 库生成的向量进行平均池化来创建单个嵌入向量。
- 支持存储在 Hugging Face 模型中心或本地存储的模型。
- 有关如何创建自定义模型的详细信息，请参阅句子转换器，这些模型可以保存在本地或上传到 Hugging Face 的模型中心。
- 基本模型需要强大的计算能力（首选 GPU）。可以构建更小/更轻的模型，权衡速度的准确性。
词嵌入向量
- 通过每个词组件的 BM25 评分创建单个嵌入向量。有关此方法背后的逻辑，请参阅此Medium 文章。
- 由pymagnitude库支持。可以从参考链接安装预训练的词向量。
- 有关可以为自定义数据集构建词向量的代码，请参见words.py。
- 使用默认模型显着提高性能。对于较大的数据集，它提供了速度和准确性的良好权衡

参考

https://dev.to/neuml/tutorial-series-on-txtai-ibg

txtai初识

txtai初识

安装依赖

创建一个嵌入实例

运行相似度查询

构建嵌入索引

嵌入加载/保存

嵌入更新/删除

嵌入方法

参考

相关阅读

相关文章

相关问答

相关文档