启用 SQLite JSON1 和 FTS5 扩展
SQLite 3.9 版本以后天加了 JSON1 扩展,并且新增了 FTS5 全文搜索。FTS5 提高了复杂搜索的请求效率,并自带了 BM25 算法实现。
安装
首先确保你的 SQLite 安装了 JSON1 和 FTS5 扩展:
如果没有可以重新安装:brew uninstall --ignore-dependencies sqlite3; brew install sqlite3 --with-json1 --with-ft5
。需要注意的是你的 sqlite3 很可能是安装其它应用时作为依赖安装的,需要添加 --ignore-dependencies
参数来强制卸载,重新安装后并不影响这些应用。另外,Homebrew 中 sqlite 和 sqlite3 都对应的 sqlite.rb 这个 Formular,用任何一个名字都没有影响。
安装 peewee 和 pysqlite 用作例子:pip install -U peewee pysqlite
。
JSON1
首先下载一个 JSON 文件作为实验数据:
import json
import urllib2
fh = urllib2.urlopen('http://media.charlesleifer.com/blog/downloads/misc/blogs.json')
data = json.loads(fh.read())
定制数据库连接并创建内存数据库:
from pysqlite2 import dbapi2 as jqlite # 注意是 pysqlite2 而不是 pysqlite
from peewee import *
from playhouse.sqlite_ext import *
class JSQLiteDatabase(SqliteExtDatabase):
def _connect(self, database, **kw):
conn = jqlite.connect(database, **kw)
conn.isolation_level = None
self._add_conn_hooks(conn)
return conn
db = JSQLiteDatabase(':memory:')
定义类 model 并创建对应表:
class Entry(Model):
data = TextField()
class Meta:
database = db
Entry.create_table()
将下载的 json 数据导入数据库:
with db.atomic():
for entry_json in data:
Entry.create(data=json.dumps(entry_json))
title = fn.json_extract(Entry.data, '$.title')
query = (Entry.select(title.alias('title')).order_by('title').limit(5))
print [row for row in query.dicts()]
# 输出:
# [{'title': u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More'},
# {'title': u'Alternative Redis-Like Databases with Python'},
# {'title': u'Building the SQLite FTS5 Search Extension'},
# {'title': u'Connor Thomas Leifer'},
# {'title': u'Extending SQLite with Python'}]
实际上我们执行的 SQL 是这样的:
SELECT json_extract("t1"."data", '$.title') AS title
FROM "entry" AS t1
ORDER BY json_extract("t1"."data", '$.title')
LIMIT 5
接下来我们按照 tag 来查询相关文章:
from peewee import Entity
tags_src = fn.json_each(Entry.data, '$.metadata.tags').alias('tags')
tags_ref = Entity('tags')
query = (Entry
.select(title.alias('title'))
.from_(Entry, tags_src)
.where(tags_ref.value == 'sqlite')
.order_by(title))
[row for row, in query.tuples()]
'''
输出:
[u'Building the SQLite FTS5 Search Extension',
u'Extending SQLite with Python',
u'Meet Scout, a Search Server Powered by SQLite',
u'My List of Python and SQLite Resources',
u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
u'Web-based SQLite Database Browser, powered by Flask and Peewee']
'''
对应的查询 SQL:
SELECT json_extract("t1"."data", '$.title') AS title
FROM
"entry" AS t1,
json_each("t1"."data", '$.metadata.tags') AS tags
WHERE ("tags"."value" = 'sqlite')
ORDER BY json_extract("t1"."data", '$.title')
再举一个 json_each()
的例子:
query = (Entry
.select(
title.alias('title'),
fn.group_concat(tags_ref.value, ', ').alias('tags'))
.from_(Entry, tags_src)
.group_by(title)
.limit(5))
[row for row in query.tuples()]
'''
输出:
[(u'A Tour of Tagging Schemas: Many-to-many, Bitmaps and More',
u'peewee, sql, python'),
(u'Alternative Redis-Like Databases with Python',
u'python, walrus, redis, nosql'),
(u'Building the SQLite FTS5 Search Extension',
u'sqlite, search, python, peewee'),
(u'Connor Thomas Leifer', u'thoughts'),
(u'Extending SQLite with Python', u'peewee, python, sqlite')]
'''
对应的 SQL:
SELECT
json_extract("t1"."data", '$.title') AS title,
group_concat("tags"."value", ', ') AS tags
FROM
"entry" AS t1,
json_each("t1"."data", '$.metadata.tags') AS tags
GROUP BY json_extract("t1"."data", '$.title')
LIMIT 5
json_tree()
函数和 json_each
类似,不过 json_each()
仅返回特定路径直系后代,而 json_tree()
会迭代整个对象及其所有后代并返回。
tree = fn.json_tree(Entry.data, '$').alias('tree')
parent = fn.json_tree(Entry.data, '$').alias('parent')
tree_ref = Entity('tree')
parent_ref = Entity('parent')
query = (Entry
.select(title.alias('title'))
.from_(Entry, tree, parent)
.where(
(tree_ref.parent == parent_ref.id) &
(parent_ref.key == 'tags') &
(tree_ref.value == 'sqlite'))
.order_by(title))
[title for title, in query.tuples()]
'''
输出:
[u'Building the SQLite FTS5 Search Extension',
u'Extending SQLite with Python',
u'Meet Scout, a Search Server Powered by SQLite',
u'My List of Python and SQLite Resources',
u'Querying Tree Structures in SQLite using Python and the Transitive Closure Extension',
u"Using SQLite4's LSM Storage Engine as a Stand-alone NoSQL Database with Python",
u'Web-based SQLite Database Browser, powered by Flask and Peewee']
'''
对应的 SQL 是这样的:
SELECT json_extract("t1"."data", '$.title') AS title
FROM
"entry" AS t1,
json_tree("t1"."data", '$') AS tree,
json_tree("t1"."data", '$') AS parent
WHERE (
("tree"."parent" = "parent"."id") AND
("parent"."key" = 'tags') AND
("tree"."value" = 'sqlite'))
ORDER BY json_extract("t1"."data", '$.title')
关于 SQLite JSON1 扩展支持的所有函数可以查看这里:The JSON1 Extension。
使用 FTS5
继续使用上面的部分代码和数据。Peewee 2.6.5 添加了 FTS5Model 支持,请确保已经安装了对应了 Peewee。
fts5 扩展要求所有列仅包含非空数据,我们唯一需要指出的是非索引列,表明这列数据仅存储而不支持搜索。
现在我们为 Entry
类添加索引,我们将通过搜索标题来获得对应的 URL,因此我们可以将 url
字段设置为 unindexed=True
:
class EntryIndex(FTS5Model):
title = SearchField()
url = SearchField(unindexed=True)
class Meta:
database = db
extension_options = {
'prefix': [2, 3],
'tokenize': 'porter unicode61',
}
EntryIndex.create_table()
extension_options
字典为 fts5 扩展提供了额外的元数据以确定如何解析单词,以及快速前缀搜索支持的前缀长度。索引表的创建 SQL 如下:
CREATE VIRTUAL TABLE "entryindex" USING fts5 (
"title" ,
"url" UNINDEXED,
prefix='2,3',
tokenize="porter unicode61")
接下来,我们可以借助以下 JSON 函数来为 Entry
表中的数据添加索引:
title = fn.json_extract(Entry.data, '$.title').alias('title')
url = fn.json_extract(Entry.data, '$.url').alias('url')
query = Entry.select(title, url).dicts()
with db.atomic():
for entry in query:
EntryIndex.create(**entry)
创建完索引后,可以这样应用查询:
query = EntryIndex.search('sqlite').limit(3)
for result in query:
print result.title
'''
输出:
Extending SQLite with Python
Building the SQLite FTS5 Search Extension
My List of Python and SQLite Resources
'''
对应的 SQL:
SELECT "t1"."title", "t1"."url"
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite')
ORDER BY rank
查询时还可以通过 score
参数同时获得查询结果的近似度得分:
query = EntryIndex.search('sqlite AND python', with_score=True)
for result in query:
print round(result.score, 3), result.title
'''
输出:
-1.259 Extending SQLite with Python
-1.059 My List of Python and SQLite Resources
-0.838 Querying Tree Structures in SQLite using Python and the Transitive Closure Extension
'''
对应 SQL:
SELECT "t1"."title", "t1"."url", rank AS score
FROM "entryindex" AS t1
WHERE ("entryindex" MATCH 'sqlite AND python')
ORDER BY rank
参考:
包含源代码的 IPython Notebook:http://nbviewer.jupyter.org/gist/coleifer/f1fc90c7d4938c73951c