11.11 文件处理
我们已经做过一些基本的文件处理工作了: 我们可以打开或者关闭一个文件, 可以通过缓存对一个文件进行读取或者写入。 但是, 当处理文件的时候, UNIX® 提供了更多的功能。 我们将在这个章节中尝试使用一些功能, 并将以一个非常漂亮的文件转换工具作为结束。
当然, 我们是从后往前完成这个文件转换工具的。 因为在通常情况下, 如果我们已经知道我们要实现些什么, 那么我们的程序将完成得更加容易。
我最早为 UNIX 所编写的程序中,有一个叫 tuc 的文件格式转换程序, 可以将其他操作系统中的文本文件转化为 UNIX 的文本文件。 也就是说, 它将不同的行结束符, 转换换行符为文件结束符, 来对应 UNIX 的规范。 也可以将 UNIX 的文本文件转化为 DOS 格式的文本文件。
我虽然广泛得使用我的 tuc 程序, 但是仅仅限于从其他操作系统向 UNIX 转换。 所以, 我希望它能覆盖原来的文件,而不是将结果输出到其他的文件中。 大多数时候, 我是这样使用它的:
% tuc myfile tempfile % mv tempfile myfile
所以最好能有一个叫 ftuc 的程序, 可以这样使用:
% ftuc myfile
接下来, 我们将用汇编完成 ftuc, ( 最开始的 tuc 是C程序 ) 并学习进程中多种面向文件的内核服务。
你可能感觉这个程序很简单,不过是将输车删除而已。
如果你回答是肯定的, 那么想一想: 这个方法虽然对于 MS DOS 的文件, 乃至大多数情况下都适用, 不过有时候也会有失败的时候。
问题是, 并不是所有的非 UNIX 的文本文件一个回车符带一个换行符结尾。 有些只使用回车符, 有些将几个空行合并为一个回车符和若干个换行符, 等等。
一个文件转换程序需要能够处理所有可能出现的结尾:
carriage return / line feed
carriage return
line feed / carriage return
line feed
它也需要在上面的基础上,处理那些使用行合并的文件。 ( 比如, 一个回车带多个换行符)
11.11.1 有限状态自动机
这些问题可以很容易地通过一种名为 有限状态自动机 的技术来解决, 这种技术先前是由设计数字电路的人们发明的。 有限状态自动机 是一种其输出不仅取决于输入, 并且还取决于之前的输入, 也就是其状态的数字电路。 微处理器便是一种 有限状态自动机 的例子: 我们使用的汇编语言代码, 有些会对应于单字节的机器指令, 而另一些则对应于多字节机器指令。 微处理器会从内存中逐个字节地抓取指令, 而在这个过程中, 有些只是简单地改变处理器状态, 而并不产生个输出。 当完成了整个机器指令的抓取之后, 微处理器才会产生输出, 或改变寄存器的值, 等等。
正因为如此, 所有的软件, 对微处理器而言, 其实都是一系列的状态指令。 因而, 有限状态自动机 的概念对于软件设计而言也很有用。
我们将文本文件转换成许设计为一个包含三个状态的 有限状态自动机。 我们可以把这三个状态称作状态 0-2, 不过为了明确起见, 将它们符号化为:
ordinary
cr
lf
我们的程序开始时处于 ordinary
状态。 在这个状态中, 程序的动作取决于其输入:
如果输入是除了回车符和换行符之外的任何其他字符, 则这输入会直接送到输出, 同时程序状态保持不变。
如果输入是回车符, 则程序状态变为
cr
。 而输入则被丢弃, 也就是说不输入任何东西。如果输入是换行符, 则程序状态变为
lf
。 同时丢弃输入内容。
如够程序进入了 cr
状态, 说明之前的输出是回车符, 并且还没有进行处理。 这种情况下的输出仍然取决于当前的输入:
如果输入是任何回车或换行符之外的其它字符, 则首先输出一个换行符, 然后输出输入内容, 并将程序状态变为
ordinary
。如果输入是回车符, 说明我们收到了在同一行中的两个 (或更多) 换行符。 此时应丢弃输入, 并输出一个换行符, 同时保持状态不变。
如果输入是换行符, 则输出一个换行符, 并将程序状态变为
ordinary
。 请注意, 这和前面的地一种情况不同 - 如果我们把它们合并, 则结果将是输出两次换行符, 而不是所希望的一次。
最后, 如果我们收到的输入是一个在前面没有回车符的换行符的话, 则程序会进入 lf
状态。 如果我们的文件已经是 UNIX 格式, 或者一行中一个回车之后跟随了若干换行符, 或者行尾是换行/回车序列时便会发生这种情况。 下面是如何处理这种情况时的输入:
If the input is anything other than a carriage return or line feed, we output a line feed, then output the input, then change the state to
ordinary
. This is exactly the same action as in thecr
state upon receiving the same kind of input.If the input is a carriage return, we discard the input, we output a line feed, then change the state to
ordinary
.If the input is a line feed, we output the line feed, and leave the state unchanged.
11.11.1.1 The Final State
The above finite state machine works for the entire file, but leaves the possibility that the final line end will be ignored. That will happen whenever the file ends with a single carriage return or a single line feed. I did not think of it when I wrote tuc, just to discover that occasionally it strips the last line ending.
This problem is easily fixed by checking the state after the entire file was processed. If the state is not ordinary
, we simply need to output one last line feed.
注意: Now that we have expressed our algorithm as a finite state machine, we could easily design a dedicated digital electronic circuit (a "chip") to do the conversion for us. Of course, doing so would be considerably more expensive than writing an assembly language program.
11.11.1.2 The Output Counter
Because our file conversion program may be combining two characters into one, we need to use an output counter. We initialize it to 0
, and increase it every time we send a character to the output. At the end of the program, the counter will tell us what size we need to set the file to.
11.11.2 Implementing FSM in Software
The hardest part of working with a finite state machine is analyzing the problem and expressing it as a finite state machine. That accomplished, the software almost writes itself.
In a high-level language, such as C, there are several main approaches. One is to use a switch
statement which chooses what function should be run. For example,
switch (state) { default: case REGULAR: regular(inputchar); break; case CR: cr(inputchar); break; case LF: lf(inputchar); break; }
Another approach is by using an array of function pointers, something like this:
(output[state])(inputchar);
Yet another is to have state
be a function pointer, set to point at the appropriate function:
(*state)(inputchar);
This is the approach we will use in our program because it is very easy to do in assembly language, and very fast, too. We will simply keep the address of the right procedure in EBX
, and then just issue:
call ebx
This is possibly faster than hardcoding the address in the code because the microprocessor does not have to fetch the address from the memory──it is already stored in one of its registers. I said possibly because with the caching modern microprocessors do, either way may be equally fast.
11.11.3 Memory Mapped Files
Because our program works on a single file, we cannot use the approach that worked for us before, i.e., to read from an input file and to write to an output file.
UNIX allows us to map a file, or a section of a file, into memory. To do that, we first need to open the file with the appropriate read/write flags. Then we use the mmap
system call to map it into the memory. One nice thing about mmap
is that it automatically works with virtual memory: We can map more of the file into the memory than we have physical memory available, yet still access it through regular memory op codes, such as mov
, lods
, and stos
. Whatever changes we make to the memory image of the file will be written to the file by the system. We do not even have to keep the file open: As long as it stays mapped, we can read from it and write to it.
The 32-bit Intel microprocessors can access up to four gigabytes of memory - physical or virtual. The FreeBSD system allows us to use up to a half of it for file mapping.
For simplicity sake, in this tutorial we will only convert files that can be mapped into the memory in their entirety. There are probably not too many text files that exceed two gigabytes in size. If our program encounters one, it will simply display a message suggesting we use the original tuc instead.
If you examine your copy of syscalls.master, you will find two separate syscalls named mmap
. This is because of evolution of UNIX: There was the traditional BSD mmap
, syscall 71. That one was superseded by the POSIX® mmap
, syscall 197. The FreeBSD system supports both because older programs were written by using the original BSD version. But new software uses the POSIX version, which is what we will use.
The syscalls.master file lists the POSIX version like this:
197 STD BSD { caddr_t mmap(caddr_t addr, size_t len, int prot, \ int flags, int fd, long pad, off_t pos); }
This differs slightly from what mmap(2) says. That is because mmap(2) describes the C version.
The difference is in the long pad
argument, which is not present in the C version. However, the FreeBSD syscalls add a 32-bit pad after push
ing a 64-bit argument. In this case, off_t
is a 64-bit value.
When we are finished working with a memory-mapped file, we unmap it with the munmap
syscall:
提示: For an in-depth treatment of
mmap
, see W. Richard Stevens' Unix Network Programming, Volume 2, Chapter 12.
11.11.4 Determining File Size
Because we need to tell mmap
how many bytes of the file to map into the memory, and because we want to map the entire file, we need to determine the size of the file.
We can use the fstat
syscall to get all the information about an open file that the system can give us. That includes the file size.
Again, syscalls.master lists two versions of fstat
, a traditional one (syscall 62), and a POSIX one (syscall 189). Naturally, we will use the POSIX version:
189 STD POSIX { int fstat(int fd, struct stat *sb); }
This is a very straightforward call: We pass to it the address of a stat
structure and the descriptor of an open file. It will fill out the contents of the stat
structure.
I do, however, have to say that I tried to declare the stat
structure in the .bss
section, and fstat
did not like it: It set the carry flag indicating an error. After I changed the code to allocate the structure on the stack, everything was working fine.
11.11.5 Changing the File Size
Because our program may combine carriage return / line feed sequences into straight line feeds, our output may be smaller than our input. However, since we are placing our output into the same file we read the input from, we may have to change the size of the file.
The ftruncate
system call allows us to do just that. Despite its somewhat misleading name, the ftruncate
system call can be used to both truncate the file (make it smaller) and to grow it.
And yes, we will find two versions of ftruncate
in syscalls.master, an older one (130), and a newer one (201). We will use the newer one:
201 STD BSD { int ftruncate(int fd, int pad, off_t length); }
Please note that this one contains a int pad
again.
11.11.6 ftuc
We now know everything we need to write ftuc. We start by adding some new lines in system.inc. First, we define some constants and structures, somewhere at or near the beginning of the file:
;;;;;;; open flags %define O_RDONLY 0 %define O_WRONLY 1 %define O_RDWR 2 ;;;;;;; mmap flags %define PROT_NONE 0 %define PROT_READ 1 %define PROT_WRITE 2 %define PROT_EXEC 4 ;; %define MAP_SHARED 0001h %define MAP_PRIVATE 0002h ;;;;;;; stat structure struc stat st_dev resd 1 ; = 0 st_ino resd 1 ; = 4 st_mode resw 1 ; = 8, size is 16 bits st_nlink resw 1 ; = 10, ditto st_uid resd 1 ; = 12 st_gid resd 1 ; = 16 st_rdev resd 1 ; = 20 st_atime resd 1 ; = 24 st_atimensec resd 1 ; = 28 st_mtime resd 1 ; = 32 st_mtimensec resd 1 ; = 36 st_ctime resd 1 ; = 40 st_ctimensec resd 1 ; = 44 st_size resd 2 ; = 48, size is 64 bits st_blocks resd 2 ; = 56, ditto st_blksize resd 1 ; = 64 st_flags resd 1 ; = 68 st_gen resd 1 ; = 72 st_lspare resd 1 ; = 76 st_qspare resd 4 ; = 80 endstruc
We define the new syscalls:
%define SYS_mmap 197 %define SYS_munmap 73 %define SYS_fstat 189 %define SYS_ftruncate 201
We add the macros for their use:
%macro sys.mmap 0 system SYS_mmap %endmacro %macro sys.munmap 0 system SYS_munmap %endmacro %macro sys.ftruncate 0 system SYS_ftruncate %endmacro %macro sys.fstat 0 system SYS_fstat %endmacro
And here is our code:
;;;;;;; Fast Text-to-Unix Conversion (ftuc.asm) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; ;; Started: 21-Dec-2000 ;; Updated: 22-Dec-2000 ;; ;; Copyright 2000 G. Adam Stanislav. ;; All rights reserved. ;; ;;;;;;; v.1 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; %include 'system.inc' section .data db 'Copyright 2000 G. Adam Stanislav.', 0Ah db 'All rights reserved.', 0Ah usg db 'Usage: ftuc filename', 0Ah usglen equ $-usg co db "ftuc: Can't open file.", 0Ah colen equ $-co fae db 'ftuc: File access error.', 0Ah faelen equ $-fae ftl db 'ftuc: File too long, use regular tuc instead.', 0Ah ftllen equ $-ftl mae db 'ftuc: Memory allocation error.', 0Ah maelen equ $-mae section .text align 4 memerr: push dword maelen push dword mae jmp short error align 4 toolong: push dword ftllen push dword ftl jmp short error align 4 facerr: push dword faelen push dword fae jmp short error align 4 cantopen: push dword colen push dword co jmp short error align 4 usage: push dword usglen push dword usg error: push dword stderr sys.write push dword 1 sys.exit align 4 global _start _start: pop eax ; argc pop eax ; program name pop ecx ; file to convert jecxz usage pop eax or eax, eax ; Too many arguments? jne usage ; Open the file push dword O_RDWR push ecx sys.open jc cantopen mov ebp, eax ; Save fd sub esp, byte stat_size mov ebx, esp ; Find file size push ebx push ebp ; fd sys.fstat jc facerr mov edx, [ebx + st_size + 4] ; File is too long if EDX != 0 ... or edx, edx jne near toolong mov ecx, [ebx + st_size] ; ... or if it is above 2 GB or ecx, ecx js near toolong ; Do nothing if the file is 0 bytes in size jecxz .quit ; Map the entire file in memory push edx push edx ; starting at offset 0 push edx ; pad push ebp ; fd push dword MAP_SHARED push dword PROT_READ | PROT_WRITE push ecx ; entire file size push edx ; let system decide on the address sys.mmap jc near memerr mov edi, eax mov esi, eax push ecx ; for SYS_munmap push edi ; Use EBX for state machine mov ebx, ordinary mov ah, 0Ah cld .loop: lodsb call ebx loop .loop cmp ebx, ordinary je .filesize ; Output final lf mov al, ah stosb inc edx .filesize: ; truncate file to new size push dword 0 ; high dword push edx ; low dword push eax ; pad push ebp sys.ftruncate ; close it (ebp still pushed) sys.close add esp, byte 16 sys.munmap .quit: push dword 0 sys.exit align 4 ordinary: cmp al, 0Dh je .cr cmp al, ah je .lf stosb inc edx ret align 4 .cr: mov ebx, cr ret align 4 .lf: mov ebx, lf ret align 4 cr: cmp al, 0Dh je .cr cmp al, ah je .lf xchg al, ah stosb inc edx xchg al, ah ; fall through .lf: stosb inc edx mov ebx, ordinary ret align 4 .cr: mov al, ah stosb inc edx ret align 4 lf: cmp al, ah je .lf cmp al, 0Dh je .cr xchg al, ah stosb inc edx xchg al, ah stosb inc edx mov ebx, ordinary ret align 4 .cr: mov ebx, ordinary mov al, ah ; fall through .lf: stosb inc edx ret
警告: Do not use this program on files stored on a disk formatted by MS-DOS® or Windows®. There seems to be a subtle bug in the FreeBSD code when using
mmap
on these drives mounted under FreeBSD: If the file is over a certain size,mmap
will just fill the memory with zeros, and then copy them to the file overwriting its contents.