11 关于 HTML

优质

小牛编辑

154浏览

2023-12-01

The Web Server was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of different kinds. Nevertheless, HTML is still the main document type delivered over the Web. Go has basic mechanisms for parsing HTML documents, which are covered in this chapter

Web服务器的建立最开始是用来提供HTML文件服务的。现在它能为各种类型的文档和各种不同类型的数据提供服务。然而，HTML依然是互联网网络中传递的主要文档类型。Go有一套基本机制来解析HTML，本章主要阐述此内容。

Introduction

介绍

The Web was originally created to serve HTML documents. Now it is used to serve all sorts of documents as well as data of dirrent kinds. Nevertheless, HTML is still the main document type delivered over the Web

Web服务器的建立最开始是用来提供HTML文件服务的。现在它为各种类型的文档和各种不同类型的数据提供服务。然而，HTML仍然是互联网网络中传递的主要文档类型。

HTML has been through a large number of versions, and HTML 5 is currently under development. There have also been many "vendor" versions of HTML, introducing tags that never made it into standards.

HTML经历了大量的版本变迁，HTML5目前还在开发阶段。此外出现不少“独立供应商”版的HTML，但引入的标签从来没有做成标准。

HTML is simple enough to be edited by hand. Consequently, many HTML documents are "ill formed", not following the syntax of the language. HTML parsers generally are not very strict, and will accept many "illegal" documents.

HTML足够简单，以至于可以纯手工编写。因此，许多HTML文件格式不规范，没有遵守标准准则的语法。HTML解析器通常也不是很严格，而且能接受大多数格式“不严格”的文件。

There wasn't much in earlier versions of Go about handling HTML documents - basically, just a tokenizer. The incomplete nature of the package has led to its removal for Go 1. It can still be found in the exp (experimental) package if you really need it. No doubt some improved form will become available in a later version of Go, and then it will be added back into this book.

在早期版本的Go没有太多关于处理HTML文件的细节--基本上只是一个分词器。不完整的原始包在Go 1的版本中已移除。如果你真的需要它，仍然可以在exp(试验)包中找到它。毫无疑问，Go未来版本在这方面会有一些改进的地方，那么到时将会添加到本书中。

There is limited support for HTML in the XML package, discussed in the next chapter.

在XML包中对HTML的支持是有限的，在下一章将会讨论。

<!--

Go has basic parsing mechanisms based on a tokeniser. This allows you to process HTML documents as they are read, but if you want to, say, build a parse tree, then you have to do that yourself.

Go的基本解析机制是基于分词器。这允许您边读取HTML文档边处理, 但是如果你想建立一个解析树,然后你必须靠自己这样做。

The state of this package is currently marked as incomplete so it will change over time.

这个包的状态还是标记未完成所以后面会有变化

Tokenizer

The html implements a basic tokenizer that can used to parse HTML. The following program reads a file of HTML text and prints information about the text tokens and the tags:


/* Read HTML
 */

package main

import (
  "fmt"
  "html"
  "io/ioutil"
  "os"
  "strings"
)

func main() {
  if len(os.Args) != 2 {
    fmt.Println("Usage: ", os.Args[0], "file")
    os.Exit(1)
  }
  file := os.Args[1]
  bytes, err := ioutil.ReadFile(file)
  checkError(err)
  r := strings.NewReader(string(bytes))

  z := html.NewTokenizer(r)

  depth := 0
  for {
    tt := z.Next()

    for n := 0; n < depth; n++ {
      fmt.Print(" ")
    }

    switch tt {
    case html.ErrorToken:
      fmt.Println("Error ", z.Err().Error())
      os.Exit(0)
    case html.TextToken:
      fmt.Println("Text: \"" + z.Token().String() + "\"")
    case html.StartTagToken, html.EndTagToken:
      fmt.Println("Tag: \"" + z.Token().String() + "\"")
      if tt == html.StartTagToken {
        depth++
      } else {
        depth==
      }
    }
  }

}

func checkError(err error) {
  if err != nil {
    fmt.Println("Fatal error ", err.Error())
    os.Exit(1)
  }
}

-->

Conclusion

结论

There isn't anything to this package at present as it is still under development.

目前这个包没有内容，因为它目前仍处于开发阶段。