Kythe-Writing a New Indexer 编写新的索引器



This document is an overview of the steps to take to add support for a new language to Kythe.
We assume(假设) that you have the Kythe release package extracted to /opt/kythe. You can also build the tools from source (but it is not necessary to build Kythe to provide(提供) it with graph data). Sample code snippets([ˈsnipit]片段) are written in JavaScript, but this document is not about indexing any particular([pəˈtikjulə] 独特的) language.

In the Kythe pipeline, a language’s indexer is responsible(负责) for building a subgraph that represents(表示) a particular program. Complete(完整) indexers usually accept(接受) .kzip files that contain a program, all of its dependencies(依赖项), and the arguments necessary for a compiler(编译器) or interpreter(解释器) to understand it. This data is packaged by a separate(单独) component(组件) called an extractor(此数据由提取器打包生成). Depending on the language and build system involved(所涉及的), it may be possible(合理的,可以允许的) to use a generic extractor to produce these hermetic([həːˈmetik]密封) compilation(编译) units(根据所涉及的语言和构建系统,可以使用通用提取器来生成这些密封编译单元). We will not address(讨论) extraction here.

For development and testing, it’s useful for the indexer to accept program text directly as input; this is how we will proceed(继续进行) in these instructions(指令)(这就是我们在这些指令中将如何进行的). First, we’ll begin by writing some scripts to insert file content into a small Kythe graph. From there, we’ll see how to encode Kythe nodes and edges into entries(项), the unit of exchange between many of our tools(我们将看到如何将节点和边编码到项中,这是我们许多工具之间的交换单位). We’ll see that certain(['sə:tən]某些) kinds of nodes are used to represent(表示) common sorts of semantic(语义(学)的) objects in programming languages and that other nodes are used to represent syntactic spans ([sin'tæktik][spæns]语法范围) of text. We will add relationships as edges between these nodes to add cross-reference data to the graph. This allows users to jump between definitions and references in programs we’ve indexed. Finally, we’ll discuss(详述) how to write tests for (and how to debug) Kythe indexers.

Bootstrapping(自举法) Kythe support

Kythe indexers emit(发射) directed(定向的) graph data as a stream(流) of entries that can represent(表示) either nodes or edges. These have various(不同的) encodings, but for simplicity(简易) we’ll use JSON. To get started, let’s write a script kythe-browse.sh that will turn a stream of JSON-formatted Kythe entries into a format that our example code browser can read. Put it in your Kythe root; it will clobber(彻底打垮) the directories //graphstore and //tables.
把它放在你的Kythe 根目录中,它会破坏目录//graphstore//tables

#!/bin/bash -e
set -o pipefail

# binaries at 
# https://github.com/kythe/kythe/releases/tag/v0.0.30.

# This script assumes that they are installed to /opt/kythe.
# If you build the tools yourself or install them to a different location,
# make sure to pass the correct public_resources directory to http_server.
rm -f -- graphstore/* tables/*
mkdir -p graphstore tables

# Read JSON entries from standard in to a graphstore.
/opt/kythe/tools/entrystream \
  --read_format=json | \
  /opt/kythe/tools/write_entries \
  -graphstore graphstore

# Convert the graphstore to serving tables.
/opt/kythe/tools/write_tables \
  -graphstore graphstore \

echo -e "\nhttp://localhost:${BROWSE_PORT}\n"
# Host the browser UI.
# ":${BROWSE_PORT}" allows access from other machines
/opt/kythe/tools/http_server \
  -public_resources /opt/kythe/web/ui \
  -serving_table tables \

The protocol buffer encoding of Kythe facts(事实) is more efficient(效率高的) than the JSON encoding we’re using here. Kythe supports JSON because some languages do not have good support for protocol buffers. This only comes into play for languages that emit a large amount of data, like C++. The entrystream tool used in kythe-browse.sh is invoked(调用) to read a stream of JSON entries from standard input and emit a varint32-delimited(为…定界) stream of kythe.proto.Entry messages on standard output.
Kythe facts的协议缓冲区编码比我们在这里使用的JSON编码更有效。
Kythe 支持 JSON,因为某些语言 没有很好的支持 协议缓冲区。

You can test this with a very short entry stream. The only tricky(错综复杂的) part here is that Kythe fact values, when serialized to JSON, are base64-encoded. This ensures that they can be properly(完整地) deserialized(反序列化) later, since fact values may contain arbitrary(任意的) binary data, but JSON strings permit only UTF-8 characters. ZmlsZQ== is file and SGVsbG8sIHdvcmxkIQ== is Hello, world!.
这里唯一棘手的部分是,Kythe fact values 在序列化为JSON时采用 Base64 编码。
这 确保以后可以正确反序列化它们,
因为fact values 可能包含任意二进制数据,但 JSON 字符串只允许使用 UTF-8 字符。

echo '
' | ./kythe-browse.sh

You can check that http://localhost:8080/#hello?corpus=example shows ‘Hello, world!’.

Modeling Kythe entries

