文档(Documents), 字段(Fields), 及模式设计(Schema Design)

璩正志

2023-12-01

1. 文档(Documents), 字段(Fields), 及模式设计(Schema Design)

1.1 概览

1.1.1 Solr的Schema文件

Solr在schema文件中存储它需要知道的字段类型和字段的详细信息。此文件的名称和位置，取决于你如何配置solr.
- schema.xml 是schema文件的传统名称。
- managed-schema 作为schema文件名，如果启用了Solr的managed schema特性。
此特性允许你通过Schema API与schem交互，你也可以选择schema文件的名称。
- 如果你使用SolrCloud，你在本地文件系统中找不到这些名称的文件。
你只能通过Schema API(如果启用了)或Solr Admin UI的Cloud页面查看schema.

不管你使用了哪个schema文件名，文件的结构都是一样的。但是，你跟schema交互的方式不一样。
如果使用managed schema, 则假定你只使用Schema API访问schema文件，并且从不手动编辑。
如果不使用managed schema, 则假定你从不使用Schema API，而且只手动编辑schema文件。

注意：在SolrCloud模式下，如果不使用Schema API, 你需要通过ZooKeeper使用upconfig和downconfig命令来修改schema.xml.

1.2 Solr字段类型(Field Types)

1.2.1 字段类型定义和属性

一个字段类型必须包含四部分信息：
- name (强制)
- 实现类class (强制)
- 如果是TextField, 关于字段analysis的描述
- 字段类型属性，取决于实现类，某些属性是强制性的。

(1) 字段类型定义
在schema.xml中字段类型由<fieldType>元素定义，它们也可组织在<types>元素内。例如
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
</fieldType>

在schema.xml的class名称中，solr是org.apache.solr.schema 或 org.apache.solr.analysis的简写。

(2) 字段类型属性
对于给定字段类型，可以指定的属性可归纳为三类：
- 字段类型class特定的属性
- Solr对于任易字段类型都支持的通用属性(General Properties)
- 字段类型中可被指定且覆盖默认属性值的属性 (Field Default Properties)

(2.1) 通用属性(General Properties)
- name 强烈推荐只包含字母数字和下划线，且不以数字开头。
- class "solr.TextField"等价于"org.apache.solr.schema.TextField"；如需要第三方class需要用全修饰名
- positionIncrementGap integer 对于多值(multivalued)字段, 指定值之间距离，防止伪短语匹配。
- autoGeneratePhraseQueries true/false 对于text字段. true, Solr对临近terms自动生成短语查询；false, 必须使用双引号查询才视为短语。
- docValuesFormat n/a 为此类型自定义DocValuesFormat. 这需要一个schema-aware codec, 如solrconfig.xml中定义的SchemaCodecFactory.
- postingsFormat n/a 为此类型自定义PostingsFormat. 这需要一个schema-aware codec, 如solrconfig.xml中定义的SchemaCodecFactory.

(2.2) 字段默认属性(Field Default Properties)
这些属性可以在字段类型(fieldType)上指定，或在单独的字段(field)上指定，以覆盖默认的值。
以下包含Solr提供的大部分FieldType实现的默认值。假定schema.xml声明version="1.6".

- indexed true
- stored true
- docValues false
- sortMissingFirst/sortMissingLast false
- multiValued false
- omitNorms * true: 忽略字段规范化信息(取消域长度规范化和索引时boosting，并节省内存)
所有原生(non-analyed)类型都默认为true.
- omitTermFreqAndPositions * true:忽略字段的term frequency,positions,payloads. 对于不需要此信息的可以提高性能，节省索引磁盘空间。对于依赖positions的查询会失败。
所有原生(non-analyed)类型都默认为true.
- omitPositions *
- termVectors/termPositions/termOffsets/termPayloads false 用于加速高亮和其他辅助功能，但代价是索引空间。
- required false
- useDocValuesAsStored true 如果启用了docValues, 此属性为true允许此field像stored字段一样返回(即便stored=false), 当fl="*".

1.2.2 Solr包含的字段类型

org.apache.solr.schema包
- BinaryField
- BoolField 首字母为"1", "t", or "T"的值解释为true,其他为false
- CollationField
- CurrencyField 货币
- DateRangeField
- ExternalFileField 从磁盘文件取值
- EnumField 枚举类型，参考配置
- ICUCollationField
- LatLonType 空间搜索：经纬度
- PointType 空间搜索：n-维点
- PreAnalyzedField
- RandomSortField 无值，使用dynamic field
- SpatialRecursivePrefixTreeFieldType 空间搜索：
- StrField UTF-8或Unicode
- TextField
- TrieDateField 精确到毫秒. precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- TrieDoubleField 64bit, precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- TrieFloatField 32bit, precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- TrieIntField 32bit, precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- TrieLongField 64bit, precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- TrieField 必须指定"type"属性：integer, long, float, double, date. precisionStep="0" 高效排序和节省空间；precisionStep="8" 高效范围查询(默认)
- UUIDField "NEW", SolrCloud模式下不推荐

1.2.3 使用货币类型和汇率(Currencies and Exchange Rates)

1.2.4 使用日期类型(Dates)

TrieDateField 表示精确到毫秒的时间点
格式: "YYYY-MM-DD'T'hh:mm:ss'Z'"

DateRangeField 支持上述的时间点和更多的时间范围表达
- 2000-11- 2000年11月
- 2000-11T13 2000年11月13时
- -0009–
- [2000-11-01 TO 2014-12-01]
- [2014 TO 2014-12-01]
- [* TO 2014-12-01]

日期语法
NOW+2MONTHS
NOW-1DAY
NOW/HOUR
NOW+6MONTHS+3DAYS/DAY
1972-05-20T17:33:18.772Z+6MONTHS+3DAYS/DAY

q=solr&fq=start_date:[* TO NOW]&NOW=1384387200000
fq={!field f=dateRange op=Contains}[2013 TO 2018]

1.2.5 使用枚举类型(Enum)

<fieldType name="priorityLevel" class="solr.EnumField" enumsConfig="enumsConfig.xml" enumName="priority"/>

<?xml version="1.0" ?>
<enumsConfig>
<enum name="priority">
<value>Not Available</value>
<value>Low</value>
<value>Medium</value>
<value>High</value>
<value>Urgent</value>
</enum>
<enum name="risk">
<value>Unknown</value>
<value>Very Low</value>
<value>Low</value>
<value>Medium</value>
<value>High</value>
<value>Critical</value>
</enum>
</enumsConfig>

1.2.6 使用外部文件及处理

1.2.7 字段属性和使用场景

- search within field: indexed=true
- retrieve contents: stored=false
- use as unique key: indexed=true, multiValued=false
- sort on field: indexed=true*7, multiValued=false, omitNorms=true*1, docValues=true*7
- use field boosts*5: omitNorms=false
- document boosts affect searches within field: omitNorms=false
- highlighting: indexed=true*4, stored=true, termVectors=true*2, termPositions=true*3
- faceting*5: true*7,docValues*7
- add multiple values, maintaining order: multiValued=true
- field lenght affects doc score: omitNorms=false
- MoreLikeThis*5: termVectors=true*6;

*1 推荐但不必要
*2 如果定义将使用，但不必要
*3 (if termVectors=true)
*4 必须定义一个tockenizer,但不必indexed
*5 在Analyzer,Tokenizers,Filters中描述
*6 Term vectors不是强制的。如果非true, 那么stored字段将被分析。所以term vectors是推荐的, 但如果stored=false则是必要的.
*7 indexed或docValues之一必须为true, 但不是都必须为true. docValues在很多情况下更高效。

1.3 定义字段(Defining Fields)

字段在schema.xml中用field元素定义。举例：
<field name="price" type="float" default="0.0" indexed="true" stored="true"/>

(1) 字段属性
- name 必要，带前后下划线的名称是保留的(如 "_version_")
- type 必要，字段类型
- default 没有默认值，如果不指定。

(2) 可选字段类型覆盖属性
参考字段默认属性(Field Default Properties)

1.4 Copying字段

对于同一个字段，应用不同的字段类型。
<copyField source="cat" dest="text" maxChars="30000" />
<copyField source="*_t" dest="text" maxChars="25000" />

1.5 动态(Dynamic)字段

动态字段使solr可以索引你在schema.xml中没有明确定义的字段。

动态字段跟普通的字段差不多，除了在name中包含一个通配符。索引时，如果一个字段不匹配任何显示定义的字段，则会去匹配动态字段。

<dynamicField name="*_i" type="int" indexed="true" stored="true"/>

1.6 其他Schema元素(Elements)

1.6.1 Unique Key

文档的唯一标识，如：<uniqueKey>id</uniqueKey>

Schema默认和copyFields不能用作uniqueKey字段，也不能用UUIDUpdateProcessorFactory来自动生成uniqueKey.
不能是multiValued.

1.6.2 Default Search Field

deprecated in solr 3.6 or higher. use df or qf.

1.6.3 Query Parser Default Operator

deprecated in solr 3.6 or higher. use q.op

1.6.4 相似度(Similarity)

声明<similarity>可以指定一个定制的相似度实现。

如，引用有无参构造函数的类：
<similarity class="solr.ClassicSimilarity"/>

或，引用SimilarityFactory的实现类，可选初始化参数：
<similarity class="solr.DFRSimilarityFactory">
<str name="basicModel">P</str>
<str name="afterEffect">L</str>
<str name="normalization">H2</str>
<float name="c">7</float>
</similarity>

有一个SchemaSimilarityFactory允许指定字段类型配置为使用指定的相似度，以覆盖默认的行为。
<similarity class="solr.SchemaSimilarityFactory">
<str name="defaultSimFromFieldType">text_dfr</str> # 默认为 ClassicSimilarity
<similarity>
<fieldType name="text_dfr" class="solr.TextField">
<analyzer ... />
<similarity class="solr.DFRSimilarityFactory">
<str name="basicModel">I(F)</str>
<str name="afterEffect">B</str>
<str name="normalization">H3</str>
<float name="mu">900</float>
</similarity>
</fieldType>
<fieldType name="text_ib">
<analyzer ... />
<similarity class="solr.IBSimilarityFactory">
<str name="distribution">SPL</str>
<str name="lambda">DF</str>
<str name="normalization">H2</str>
</similarity>
</fieldType>
<fieldType name="text_other">
<analyzer ... />
</fieldType>

SweetSpotSimilarityFactory,BM25SimilarityFactory...

1.7 Schema API

Schema API提供了对每个collection的schema的读写访问。
对所有schema元素的读访问都支持。
字段(Fields), dynamic fields, field types 和 copyField 可以被添加、删除或替代。未来Solr可能支持对更多schema元素的写操作。

注意：一旦schema被修改，重新索引所有数据。

要通过API修改schema, schema需要是managed且mutable, 参见Managed Schema配置。
API运行两种输出模式：JSON或XML.
当使用API修改schema时，core reload会自动发生以使之生效。

API的基本地址是 http://<host>:<port>/solr/<collection_name>, 如 http://localhost:8983/solr/gettingstarted

1.8 总览各个片段

1.9 DocValues

DocValues是一种在内部记录字段值的方法，比起传统索引，能更高效的实现某些目标，如果排序、分面。

1.9.1 为什么使用DocValues?
对于现今常见的search相关的功能，如sorting, faceting, grouping and highlighting, 传统的倒排索引不是非常高效。

DocValue字段在索引时构建了一个docment-to-value映射。

1.9.2 启用DocValues
使用DocValues，只需要在schema中启用。
<field name="manu_exact" type="string" indexed="false" stored="false"
docValues="true" />

DocValues只对特定字段有效，选择的类型决定了底层Lucene使用的docValue类型。
可用的solr字段类型：
- StrField 和 UUIDField
如果是single-valued, Lucene使用SORTED类型。
如果是multi-valued, Lucene使用SORTED_SET类型。
- 任何Trie*数值类型和EnumField
如果是single-valued, Lucene使用NUMERIC类型。
如果是multi-valued, Lucene使用SORTED_SET类型。

当multi-valued DocValues以SORTED_SET类型存储时，应当清楚其有两个含义：
(1) Values以排序后的顺序返回，而不是原始顺序
(2) 当一个文档出现多个相同的值时，只有一个被返回。

有一个额外选项可以改变docValuesFormat，默认的实现是加载一部分到内存，一部分在磁盘。
你也可以选择保持全部在内存：
<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true"
docValuesFormat="Memory" />

1.9.3 使用DocValues
如果字段的docValues="true", 那么每次对字段sorting或Function Queries的时候会自动使用DocValues.

搜索时获取DocValues
搜索查询"fl=*"时，如果字段为non-stored, 如果useDocValuesAsStored(默认为true)为true, 会返回DocValues.
useDocValuesAsStored="false"时，仍然可以在fl参数中通过显示指定字段名称来获取docValues。

1.10 无Schema模式

Solr包含部分功能特性集，不需要手动编辑Schema，通过索引数据就可以快速构建schema. 这些特性在solrconfig.xml中指定：
(1) Managed schema: 通过Solr API来修改schema
(2) Field value class guessing: 对于未知字段，会逐级运行一个基于Value的Parser集合，猜测字段值的Java类型。
目前有Boolean, Integer, Long, Float, Double, and Date的解析器.
(3) Automatic schema field addition, based on field value class(es): 基于字段值的Java类型，未知字段被添加到schema，并映射为schema字段类型。

这三种无schema模式已经预配置在data_driven_schema_configs中，使用例子：
bin/solr start -e schemaless

手动配置无Schema模式(solrconfig.xml)
- 启用Managed Schema
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory
- 定义一个UpdateRequestProcessorChain，用于猜测字段类型
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
......
</updateRequestProcessorChain>

UpdateRequestProcessorChain定义好以后，你必须设置UpdateRequestHandlers在索引更新时使用它。
<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">add-unknown-fields-to-the-schema</str>
</lst>
</initParams>

2. solr5自带schema配置解读

/opt/solr/server/solr/configsets/data_driven_schema_configs/conf/managed-schema

这是Solr schema文件，应被命名为"schema.xml"并出现在solr home的conf目录下(默认./solr/conf/schema.xml)
或位于Solr webapp通过classloader能找到的位置。

关于更多如何定制此文件的信息，请查看
http://wiki.apache.org/solr/SchemaXml

性能提示：此schema包含很多可选特性，不应做为标准使用。为优化性能可以
- 对所有可能的字段(尤其是大字段)设置stored="false"，当你只需要对字段进行搜索而不需要返回字段值。
- 设置indexed="false", 如果你不需要搜索字段，而只是在搜索结果中返回。
- 移除所有不需要的copyField声明。
- 为最优索引大小和搜索性能，将所有general text字段的"index"设为false, 使用copyField将它们copy到
一个容纳所有文本的字段，并使用此字段进行搜索。
- 为最大化构建索引的性能，使用Java客户端的ConcurrentUpdateSolrServer
- 记住在server node运行JVM, 并使用一个较高的日志级别以避免记录所有的请求日志。

<schema name="example-data-driven-schema" version="1.6">
<field>
<copyField>
<dynamicField>
<uniqueKey>id</uniqueKey>

<fieldType>


<dynamicField name="*_txt_cjk" type="text_cjk" indexed="true" stored="true"/>
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.CJKWidthFilterFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>

<useDocValuesAsStored>false</useDocValuesAsStored>
</schema>