12 XML

优质

小牛编辑

144浏览

2023-12-01

XML is a significant markup language mainly intended as a means of serialising data structures as a text document. Go has basic support for XML document processing.

XML是一种重要的标记语言，旨在把数据结构序列化成文本文档。Go基本支持XML文档处理。

Introduction

介绍

XML is now a widespread way of representing complex data structures serialised into text format. It is used to describe documents such as DocBook and XHTML. It is used in specialised markup languages such as MathML and CML (Chemistry Markup Language). It is used to encode data as SOAP messages for Web Services, and the Web Service can be specified using WSDL (Web Services Description Language).

现在XML是一个用序列化的文本格式表现复杂数据结构的普遍方式。它被用来描述文档例如DocBook和XHTML。它还用于描述专用标记语言如MathML和CML(化学标记语言)。Web服务中它还用来将数据编码成SOAP消息,Web服务也可以指定使用WSDL(Web服务描述语言)。

At the simplest level, XML allows you to define your own tags for use in text documents. Tags can be nested and can be interspersed with text. Each tag can also contain attributes with values. For example,

在最简单的层次上,XML允许您定义您个人标记用于文本文档。标签可以嵌套,也穿插在文本里。每个标记也可以包含属性与值。例如,


<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>

The structure of any XML document can be described in a number of ways:

任何XML文档的结构可以用多种方式描述:

A document type definition DTD is good for describing structure
XML schema are good for describing the data types used by an XML document
RELAX NG is proposed as an alternative to both

一个文档类型定义DTD有利于表现数据结构
在一个XML文档中，使用XML模式有利于描述数据类型
RELAX NG提出了替代方案

There is argument over the relative value of each way of defining the structure of an XML document. We won't buy into that, as Go does not suport any of them. Go cannot check for validity of any document against a schema, but only for well-formedness.

人们总会争论定义XML文档结构的每一个方式的好坏。我们不会陷入其中,因为Go不支持其中任何一个。Go不能检查任何文档模式的有效性，但只知道良构性。

Four topics are discussed in this chapter: parsing an XML stream, marshalling and unmarshalling Go data into XML, and XHTML.

在本章中讨论四个主题:解析一个XML流,编组和解组Go数据成为XML和XHTML。

Parsing XML

解析XML

Go has an XML parser which is created using NewParser. This takes an io.Reader as parameter and returns a pointer to Parser. The main method of this type is Token which returns the next token in the input stream. The token is one of the types StartElement, EndElement, CharData, Comment, ProcInst or Directive.

Go有一个使用 NewParser.创建的XML解析器。这需要一个io.Reader 作为参数,并返回一个指向Parser 的指针。这个类型的主要方法是 Token ，这个方法返回输入流中的下一个标记。该标记是 StartElement, EndElement, CharData, Comment, ProcInst 和Directive 其中一种。

The types are

这些类有

StartElement

The type StartElement is a structure with two field types:

StartElement 类型是一个包含两个字段的结构:


type StartElement struct {
    Name Name
    Attr []Attr
}

type Name struct {
    Space, Local string
}

type Attr struct {
    Name  Name
    Value string
}

EndElement

This is also a structure

同样也是一个结构


type EndElement struct {
    Name Name
}

CharData

This type represents the text content enclosed by a tag and is a simple type

这个类表示一个被标签包住的文本内容，是一个简单类。


type CharData []byte

Comment

Similarly for this type

这个类也很简洁


type Comment []byte

ProcInst

A ProcInst represents an XML processing instruction of the form <?target inst?>

一个ProcInst表示一个XML处理指令形式，如<target inst?>


type ProcInst struct {
    Target string
    Inst   []byte
}

Directive

A Directive represents an XML directive of the form <!text>. The bytes do not include the <! and > markers.

一个指令用XML指令<!文本>的形式表示，内容不包含< !和> 构成部分。


type Directive []byte

A program to print out the tree structure of an XML document is

打印XML文档的树结构的一个程序，代码如下


/* Parse XML
 */

package main

import (
  "encoding/xml"
  "fmt"
  "io/ioutil"
  "os"
  "strings"
)

func main() {
  if len(os.Args) != 2 {
    fmt.Println("Usage: ", os.Args[0], "file")
    os.Exit(1)
  }
  file := os.Args[1]
  bytes, err := ioutil.ReadFile(file)
  checkError(err)
  r := strings.NewReader(string(bytes))

  parser := xml.NewDecoder(r)
  depth := 0
  for {
    token, err := parser.Token()
    if err != nil {
      break
    }
    switch t := token.(type) {
    case xml.StartElement:
      elmt := xml.StartElement(t)
      name := elmt.Name.Local
      printElmt(name, depth)
      depth++
    case xml.EndElement:
      depth--
      elmt := xml.EndElement(t)
      name := elmt.Name.Local
      printElmt(name, depth)
    case xml.CharData:
      bytes := xml.CharData(t)
      printElmt("\""+string([]byte(bytes))+"\"", depth)
    case xml.Comment:
      printElmt("Comment", depth)
    case xml.ProcInst:
      printElmt("ProcInst", depth)
    case xml.Directive:
      printElmt("Directive", depth)
    default:
      fmt.Println("Unknown")
    }
  }
}

func printElmt(s string, depth int) {
  for n := 0; n < depth; n++ {
    fmt.Print("  ")
  }
  fmt.Println(s)
}

func checkError(err error) {
  if err != nil {
    fmt.Println("Fatal error ", err.Error())
    os.Exit(1)
  }
}

Note that the parser includes all CharData, including the whitespace between tags.

注意,解析器包括所有文本节点,包括标签之间的空白。

If we run this program against the person data structure given earlier, it produces

如果我们运行这个程序对前面给出的 person数据结构,它就会打印出


person
  "
  "
  name
    "
    "
    family
      " Newmarch "
    family
    "
    "
    personal
      " Jan "
    personal
    "
  "
  name
  "
  "
  email
    "
    jan@newmarch.name
  "
  email
  "
  "
  email
    "
    j.newmarch@boxhill.edu.au
  "
  email
  "
"
person
"
"

Note that as no DTD or other XML specification has been used, the tokenizer correctly prints out all the white space (a DTD may specify that the whitespace can be ignored, but without it that assumption cannot be made.)

注意,因为没有使用DTD或其他XML规范, tokenizer 正确地打印出所有的空白(一个DTD可能指定可以忽略空格,但是没有它假设就不能成立。)

There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as func (c CharData) Copy() CharData to make a copy of data.

在使用这个解析器过程中有一个潜在的陷阱值得注意:它会为字符串重新利用空间,所以,一旦你看到一个你想要复制它的值的标记,假设你想稍后引用它的话，Go有类似的方法如 func (c CharData) Copy() CharData 来复制数据。

Unmarshalling XML

反编排XML

Go provides a function Unmarshal and a method func (*Parser) Unmarshal to unmarshal XML into Go data structures. The unmarshalling is not perfect: Go and XML are different languages.

Go提供一个函数 Unmarshal 和一个方法调用 func (*Parser) Unmarshal 解组XML转化为Go数据结构。解组并不是完美的:Go和XML毕竟是是两个不同的语言。

We consider a simple example before looking at the details. We take the XML document given earlier of

我们先考虑一个简单的例子再查看细节。我们用前面给出的XML文档


<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>

We would like to map this onto the Go structures

接下来我们想把这个文档映射到Go结构


type Person struct {
  Name Name
  Email []Email
}

type Name struct {
  Family string
  Personal string
}

type Email struct {
  Type string
  Address string
}

This requires several comments:

这里需要一些说明:

Unmarshalling uses the Go reflection package. This requires that all fields by public i.e. start with a capital letter. Earlier versions of Go used case-insensitive matching to match fields such as the XML string "name" to the field Name. Now, though, case-sensitive matching is used. To perform a match, the structure fields must be tagged to show the XML string that will be matched against. This changes Person to
```
type Person struct {
  Name Name `xml:"name"`
  Email []Email `xml:"email"`
}
  
```
While tagging of fields can attach XML strings to fields, it can't do so with the names of the structures. An additional field is required, with field name "XMLName". This only affects the top-level struct, Person
```
type Person struct {
  XMLName Name `xml:"person"`
  Name Name `xml:"name"`
  Email []Email `xml:"email"`
}
  
```
Repeated tags in the map to a slice in Go
Attributes within tags will match to fields in a structure only if the Go field has the tag ",attr". This occurs with the field Type of Email, where matching the attribute "type" of the "email" tag requires `xml:"type,attr"`
If an XML tag has no attributes and only has character data, then it matches a string field by the same name (case-sensitive, though). So the tag `xml:"family"` with character data "Newmarch" maps to the string field Family
But if the tag has attributes, then it must map to a structure. Go assigns the character data to the field with tag ,chardata. This occurs with the "email" data and the field Address with tag ,chardata

使用Go reflection包去解组。这要求所有字段是公有，也就是以一个大写字母开始。早期版本的Go使用不区分大小写匹配来匹配字段,例如XML标签“name”对应Name字段。但是现在使用case-sensitive匹配，要执行一个匹配,结构字段后必须用标记来显示XML标签名,以应付匹配。Person修改下应该是
```
type Person struct {
  Name Name `xml:"name"`
  Email []Email `xml:"email"`
}
  
```
虽然标记结构字段可以使用XML字符串,但是对于结构名不能这么做，这个解决办法是增加一个额外字段,命名“XMLName”。这只会影响上级结构，修改Person 如下
```
type Person struct {
  XMLName Name `xml:"person"`
  Name Name `xml:"name"`
  Email []Email `xml:"email"`
}
  
```
重复标记会映射到Go的slice
要包含属性的标签准确匹配对应的结构字段，只有在Go字段后标记”,attr”。举个下面例子中 Email类型的Type字段，需要标记`xml:"type,attr"`才能匹配带有“type”属性的“email”
如果一个XML标签没有属性而且只有文本内容,那么它匹配一个string 字段是通过相同的名称(区分大小写的,不过如此)。所以标签`xml:"family"`将对应着文本”Newmarch”映射到Family的string字段中
但如果一个标签带有属性,那么它这个特征必须反映到一个结构。Go在字段后标记着 ,chardata的文字。如下面例子中通过 Address 后标记,chardata的字段来获取email的文本值

A program to unmarshal the document above is

解组上面文档的一个程序


/* Unmarshal
 */

package main

import (
  "encoding/xml"
  "fmt"
  "os"
  //"strings"
)

type Person struct {
  XMLName Name    `xml:"person"`
  Name    Name    `xml:"name"`
  Email   []Email `xml:"email"`
}

type Name struct {
  Family   string `xml:"family"`
  Personal string `xml:"personal"`
}

type Email struct {
  Type    string `xml:"type,attr"`
  Address string `xml:",chardata"`
}

func main() {
  str := `<?xml version="1.0" encoding="utf-8"?>
<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>`

  var person Person

  err := xml.Unmarshal([]byte(str), &person)
  checkError(err)

  // now use the person structure e.g.
 fmt.Println("Family name: \"" + person.Name.Family + "\"")
  fmt.Println("Second email address: \"" + person.Email[1].Address + "\"")
}

func checkError(err error) {
  if err != nil {
    fmt.Println("Fatal error ", err.Error())
    os.Exit(1)
  }
}

(Note the spaces are correct.). The strict rules are given in the package specification.

(注意空间是正确的)。Go在包详解中给出了严格的规则。

Marshalling XML

编组 XML

Go 1 also has support for marshalling data structures into an XML document. The function is

Go1也支持将数据结构编组为XML文档的。这个函数是

    
func Marshal(v interface}{) ([]byte, error)

This was used as a check in the last two lines of the previous program.

这是用来检查前面程序的最后两行

<!--

At present there is no support for marshalling a Go data structure
into XML. In this section we present a simple marshalling
function that will give
a basic serialisation. The result can be unmarshalled using
the Go function Unmarshal of the previous section.

目前还不支持编组Go数据结构为XML。在这一节中我们提出一个简单的编组函数,将提供一个基本连载。使用上一节的Go函数Unmarshal编组出结果。

A straightforward but naive approach would be to write code that
walks over your data structures, printing out results as it goes.
But if is customised to your data types, then you wil need to change
code each time the types change.

有一个简单但幼稚的方法是编写代码遍历你的数据结构,边走边打印出结果。但如果使用你定制的数据类型,那么你将需要在每次类型变化时改变代码。

A better approach, and one that is used by the Go serialisation
libraries is to use the reflection package.
This is a package that allows you to examine data types and
data structures from within a running program. The idea of
reflection has been present in artificial intelligence
programming for many years, but is still seen as a rather arcane
technique for mainstream languages.

有一个更好的方法,一个是用于Go连载库是使用 reflection 包。这是一个允许您从一个运行着的程序中检查数据类型和数据结构的包。这个反射的办法多年一直存在于人工智能编程,但相对于主流语言仍被视为一个相当晦涩难懂的技术。

Go has two principal reflection types:
reflect.Type gives information about the Go types,
while reflect.Value gives information about a
particular data value. Value has a method
Type() that can return the type.

Go有两个主要反射类型:reflect.Type能给出Go类型的信息,虽然reflect.Value给出有关特定的数据值的信息。Value有一个方法Type()可以返回类型。

The simplest types and values correspond to primitive types.
For example, there is IntType, BoolType
etc, which can be used as values in type switches to determine the
precise type of a Type. The corresponding value types
are IntValue and BoolValue with
methods such as Get to return the value.

最简单的类型和值相当于基础类型。例如,< IntType 、 BoolType等等,这可以作为值的类型转换器，用来确定Type 精确的类型。相应的值类型是通过调用方法如Get来得到返回值IntValue和BoolValue。

A StructType is more complex, as it has methods
to access the fields by

func (t *StructType) Field(i int) (f StructField)

and a StructField has methods such as
Name to return the string value of the field's
label. This is useful for examing the type structure.

StructType是更复杂的,因为它有个通过访问字段的方法

func (t *StructType) Field(i int) (f StructField)

而且StructField等有些方法Name返回字段
标签的文本值。这是有用的检测类型结构的方法。

A StructValue is useful for examining the value
of fields of a data value. It has a method

func (v *StructValue) Field(i int) Value

which can be used to extract the value of each field.

StructValue用于检查一个代表文本值的字段的文字是有用的。它有一个方法

func (v *StructValue) Field(i int) Value

这可以用来提取的每个字段的值

The reflection process is basically stsrted by calling
NewValue on a data object, and then examining
its type and recursively walking through the values.
What we do with each value is to surround it by tags,
made of field names of the structures encountered.

反射过程基本上是开始通过在一个数据对象调用NewValue,然后检查它的类型并且递归地遍历值。我们对每一个值所做的都是围绕它的标签,由遇到的结构字段名构成的。

There are two complexities: the first is that the initial
data value will tpyically be a structure, and this doesn't
have a field name as it is not itself part of a structure.
For this starting case, we use the type name of the structure
as XML tag name.

这里有两种复杂性:第一,最初的数据值将tpyically代表性定义为一个结构,而这并不包含一个字段名,因为它不是本身结构的一部分。在这个下面开始的例子中,我们使用结构的类型名
作为XML标记名称。

The second complexity comes with arrays or slices. In this case we need to work through each element of the array/slice,
each time repeating the field name from the enclosing
structure.

第二个复杂性是随着数组或切片的产生。在这种情况下我们需要通过array/slice中的每个元素,每一次重复来自封闭
结构的字段名。

We define thre functions: Marshal which takes an initial data value. This prepares the XML document and creates the toplevel tag from the structure's type name.
The second function recurses through the
type values, switching on data types and writing tags from
field names and values as XML character data.
The third function handles the special case of slices,
as the tag name needs to be kept for all of the elements
of this slice.

我们定义成3个函数:Marshal获取初始化的数据值。它作用是读取XML文档并从结构类型名创建顶层标记。
第二个函数是递归遍历类型值,接入数据类型，写入标记字段名和值作为XML文本数据。
第三个函数负责处理特殊情况的slice,用标记名命名,需要保存slice中所有的元素。

We ignore pointers, channels, etc. We also do not produce
attributes, just tags and character data.
The program is
我们忽略了指针，channels等。我们也不会使用属性,只是标记和字符值。该程序是


/* Marshal
 */

package main

import (
  "fmt"
  "io"
  "os"
  "reflect"
  "bytes"
)

type Person struct {
  Name  Name
  Email []Email
}

type Name struct {
  Family   string
  Personal string
}

type Email struct {
  Kind    string "attr"
  Address string "chardata"
}

func main() {
  person := Person{
    Name: Name{Family: "Newmarch", Personal: "Jan"},
    Email: []Email{Email{Kind: "home", Address: "jan"},
      Email{Kind: "work", Address: "jan"}}}

  buff := bytes.NewBuffer(nil)
  Marshal(person, buff)
  fmt.Println(buff.String())
}

func Marshal(e interface{}, w io.Writer) {
  // make it a legal XML document
 w.Write([]byte("<?xml version=\"1.1\" encoding=\"UTF-8\" ?>\n"))

  // topvel e is a value and has no structure field, 
 // so use its type
 typ := reflect.TypeOf(e)
  name := typ.Name()

  startTag(name, w)
  MarshalValue(reflect.ValueOf(e), w)
  endTag(name, w)
}

func MarshalValue(v reflect.Value, w io.Writer) {
  t := v.Type()
  switch t.Kind() {
  case reflect.Struct:
    for n := 0; n < t.NumField(); n++ {
      field := t.Field(n)

      vv := v

      // special case if it is a slice

      if vv.Field(n).Type().Kind() == reflect.Slice {
        // slice
       MarshalSliceValue(field.Name,
          vv.Field(n), w)
      } else {
        // not a slice
       startTag(field.Name, w)
        MarshalValue(vv.Field(n), w)
        endTag(field.Name, w)
      }
    }
  case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64, reflect.Uint, reflect.Uint8, reflect.Uint16, reflect.Uint32, reflect.Uint64, reflect.Uintptr:
  case reflect.Bool:
  case reflect.String:
    vv := v
    w.Write([]byte("   " + vv.String() + "\n"))
  default:
  }
}

func MarshalSliceValue(tag string, v reflect.Value, w io.Writer) {
  for n := 0; n < v.Len(); n++ {
    startTag(tag, w)
    MarshalValue(v.Index(n), w)
    endTag(tag, w)
  }
}

func startTag(s string, w io.Writer) {
  w.Write([]byte("<" + s + ">\n"))
}

func endTag(s string, w io.Writer) {
  w.Write([]byte("</" + s + ">\n"))
}

func checkError(err error) {
  if err != nil {
    fmt.Println("Fatal error ", err.Error())
    os.Exit(1)
  }
}

-->

XHTML

HTML does not conform to XML syntax. It has unterminated tags such as '<br>'. XHTML is a cleanup of HTML to make it compliant to XML. Documents in XHTML can be managed using the techniques above for XML.

HTML并不符合XML语法。它包含无闭端的标签如“
”。XHTML是HTML的一个自身兼容XML的子集。在XHTML文档中可以使用操作XML的技术。

HTML

There is some support in the XML package to handle HTML documents even though they are not XML-compliant. The XML parser discussed earlier can handle many HTML documents if it is modified by

XML包的部分方法可支持处理HTML文档,即使他们本身不具备XML兼容性。前面讨论的XML解析器修改下就可以处理大部分HTML文件

    
  parser := xml.NewDecoder(r)
  parser.Strict = false
  parser.AutoClose = xml.HTMLAutoClose
  parser.Entity = xml.HTMLEntity

Conclusion

结论

Go has basic support for dealing with XML strings. It does not as yet have mechanisms for dealing with XML specification languages such as XML Schema or Relax NG.

Go基本支持对XML字符的处理，而且它不像有着针对XML专用语言如XML Schema或Relax NG的处理机制。