Linux中输入和输出基本过程

1.文件内核级缓冲区

前面在如何理解Linux一切皆文件的特点中提到为了保证在Linux中所有进程访问文件时的方式趋近相同，在f ile 结构体中存在一个 files_operations 结构体指针，对应的结构体保存所有文件操作的函数指针（这个结构体也被称为操作表）

每一个f ile 结构体中除了有自己的操作表以外还有一个文件的内核级缓冲区，这个缓冲区不同于语言层面的缓冲区，在调用底层系统接口的读或者写时，会有一方先将内容保存到该缓冲区，再将内容移动到指定设备

例如，对于读函数 read 来说，以从磁盘文件读取文件内容为例，首先操作系统会从磁盘中读取文件内容到缓冲区，再由 read 函数从缓冲区将内容读取到指定的设备；对于将内存中需要写入的内容先写入缓冲区，再由 write 函数来说，首先操作系统会 write 函数从缓冲区写入到磁盘中。除了前面提到的两个单一过程外，还要一种复合过程：对文件的内容进行修改。这一种过程既涉及到读取，也涉及到写入，所以首先需要通过操作系统读入需要修改的内容到内核级缓冲区，由 read 函数将缓冲区中的数据读到内存，在程序运行中对读取到的内容进行修改，再通过 write 函数将修改后的内容写入到缓冲区，最后由操作系统从缓冲区写入到磁盘

如果抽象化一下 read 函数和 write 函数的过程可以发现这两个函数的基本行为就是读写缓冲区，所以这两个函数也可以理解为拷贝函数， read 函数即为将缓冲区中的数据拷贝到内存中的指定位置， write 函数即为将内存中的数据拷贝到缓冲区

上面整个过程中， read 函数和 write 函数只完成对缓冲区中的数据进行处理，但是缓冲区中的数据何时移动到指定的设备由操作系统自主决定

如果用户向自主决定何时将内核级缓冲区中的数据刷新到内核级缓冲区可以使用 fsync函数，其原型如下：

int fsync(int fd);

读或者写过程示意图如下：

2.何为重定向

前面提到，文件描述符是Linux中每一个文件的唯一标识符，也就是说，通过文件描述符可以唯一确定一个文件，而stdin、stdout和stderr是每一个C语言程序默认打开的三个文件，对应的文件描述符为0、1和2，而之所以文件描述符是从0开始，本质是因为文件描述符是fd_array数组的下标，当用户再打开一个文件时，对应的文件结构指针就会存储到fd_array的指定位置，而因为下标0、1和2已经被占用，所以新开的文件对应的下标只能从3开始

现在关闭0号位置的文件，观察效果，例如下面的代码：

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>int main()
{close(0);int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT);printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);close(fd1);close(fd2);close(fd3);close(fd4);return 0;
}

如果关闭的是2号文件，观察效果，例如下面的代码：

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>int main()
{close(2);int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT);printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);close(fd1);close(fd2);close(fd3);close(fd4);return 0;
}

运行：

从上面的代码中可以看出，在Linux中，一个文件被打开时，文件描述符的分配原则是在 fd_array 中找到最小的且没有被其他 file 结构指针占用的位置

以关闭0号位置为例，示意图如下：

在上面的演示中，先关闭了某一个文件描述符较前的文件，再打开另一个文件，打开的这个文件所处的位置是就是被关闭的那个文件的位置，此时再使用输出语句输出内容，原本应该输出到文件描述符较前的原始文件中，比如前面的 stdin ，后面却输出到了文件中 test1.txt ，这个过程就称为文件重定向，主要原理就是文件描述符是固定不变的，而输出和输入只认文件描述符，不会在意这个文件描述符对应的位置指向的是哪一个文件

在Linux中，存在一个系统调用接口使得可以在程序中实现重定向，这个函数为dup2，其原型如下：

int dup2(int oldfd, int newfd);

在Linux操作手册中， dup2 函数的描述如下：

dup2() makes newfd be the copy of oldfd, closing newfd first if necessary

参数解释：

oldfd：该参数表示待拷贝的文件的文件指针对应的文件描述符
newfd：该参数表示被覆盖的文件的文件指针对应的文件描述符

示意图如下：

所以，使用dup2就可以实现使用printf函数本应该输出到stdout中，而输出到test1.txt的效果，代码如下：

#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>int main()
{int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);dup2(fd1, 1);printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);close(fd1);close(fd2);close(fd3);close(fd4);return 0;
}

需要注意，因为在使用 dup2 函数之前， stdout 已经被打开了，此时 open 函数打开时就只能使用3号下标的位置，所以可以看到 test1.txt 文件 dup2 函数的确实现了将内容输出到 test1.txt 文件而不是st dout 中，此时示意图如下：

这里需要注意一个细节：尽管 dup2 函数会关闭被覆盖的文件 test1.txt 占用了被覆盖的文件 stdout ，但是关闭后被拷贝的文件 stdout 的位置，所以再打开其他文件依旧会从7号位置开始，例如下面的代码演示：

#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>int main()
{int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);dup2(fd1, 1);int fd5 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666); // 重定向后打开一个新文件printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);printf("%d\n", fd5);close(fd1);close(fd2);close(fd3);close(fd4);close(fd5);return 0;
}

但是，前面的代码中，只演示了关闭0号位置和2号位置的文件，如果此时关闭1号位置的文件，再打开同样的四个文件，输出每一个文件的文件描述符，从前面的规律可以推出，输出语句会因为1号位置的文件是test1.txt而将内容输出到test1.txt中，例如下面的代码：

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>int main()
{close(1);int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);close(fd1);close(fd2);close(fd3);close(fd4);return 0;
}

编译运行上面的代码后可以看到不论是test1.txt还是其他文件都没有显示需要的内容，控制台也没有正常打印需要的内容

但是如果将代码修改为如下：

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>int main()
{// ...fflush(stdout);close(fd1);close(fd2);close(fd3);close(fd4);return 0;
}

编译运行上面的代码后可以看到结果如下：

现在就引入了一个问题，为什么加了fflush(stdout)就可以看到需要的效果

前面提到，在调用操作系统接口中的读写函数时会涉及到一个文件内核级缓冲区，这个缓冲区是为了减少操作系统进行IO操作时产生的时间和空间的消耗，但是如果用户使用的是语言级别的接口，例如 fread或者fwrite等，那么此时就会出现用户的接口到操作系统的消耗，这个消耗主要来自于语言级函数多次调用系统接口时创建函数栈帧，所以为了减少这种情况下的时间和空间消耗，就存在了语言级别的缓冲区

但是，在实现进度条时提到了\n可以刷新语言级别的缓冲区，此处为什么用了\n也没有刷新，原因就在于语言级别的缓冲区刷新情况有三种：

行刷新，主要是stdout文件时的刷新
满刷新，当语言级别的缓冲区写满时刷新
不缓冲

而此处的\n就属于行刷新，但是此时的printf认识的fd为1的文件并不是stdout，所以行刷新就失效了，转换成默认的满刷新，所以需要调用fflush(stdout)将当前语言级别缓冲区中的数据调用写函数刷新到操作系统的内核级缓冲区，再由操作系统自主刷新到输出设备中

如果将前面的代码修改为如下，观察效果：

#include <stdio.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <fcntl.h>int main()
{close(1);int fd1 = open("test1.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd2 = open("test2.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd3 = open("test3.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);int fd4 = open("test4.txt", O_WRONLY | O_TRUNC | O_CREAT, 0666);printf("%d\n", fd1);printf("%d\n", fd2);printf("%d\n", fd3);printf("%d\n", fd4);return 0;
}

编译运行上面的代码后可以看到结果如下：

可以看到此时尽管没有fflush(stdout)也可以正常将内容输出到test1.txt，主要原因是当程序即将结束时，会进行缓冲区自动刷新，相当于自动进行了一次fflush(stdout)，而前面调用 close(fd)时之所以没有刷新缓冲区是因为系统内部的文件已经被关闭，语言级的缓冲区无法将内容通过系统调用刷新到系统内核级缓冲区

所以，由上面的结果可以推出，在C语言中，文件结构FILE肯定包含文件描述符和缓冲区，而C语言的所有输入输出函数本质也是拷贝函数，只是源头或者目标是其结构体中的缓冲区，对应的源码如下：

struct _IO_FILE {int _flags; /* High-order word is _IO_MAGIC; rest is flags. */
#define _IO_file_flags _flags//缓冲区相关/* The following pointers correspond to the C++ streambuf protocol. *//* Note: Tk uses the _IO_read_ptr and _IO_read_end fields directly. */ char* _IO_read_ptr; /* Current read pointer */ char* _IO_read_end; /* End of get area. */ char* _IO_read_base; /* Start of putback+get area. */char* _IO_write_base; /* Start of put area. */char* _IO_write_ptr; /* Current put pointer. */ char* _IO_write_end; /* End of put area. */ char* _IO_buf_base; /* Start of reserve area. */ char* _IO_buf_end; /* End of reserve area. *//* The following fields are used to support backing up and undo. */ char *_IO_save_base; /* Pointer to start of non-current get area. */ char *_IO_backup_base; /* Pointer to first valid character of backup area */ char *_IO_save_end; /* Pointer to end of non-current get area. */struct _IO_marker *_markers;struct _IO_FILE *_chain;int _fileno; //封装的文件描述符#if 0int _blksize;
#elseint _flags2;
#endif_IO_off_t _old_offset; /* This used to be _offset but it's too small. */
#define __HAVE_COLUMN /* temporary *//* 1+column number of pbase(); 0 is unknown. */ unsigned short _cur_column; signed char _vtable_offset; char _shortbuf[1];/* char* _save_gptr; char* _save_egptr; */_IO_lock_t *_lock;
#ifdef _IO_USE_OLD_IO_FILE
};

3.子进程与缓冲区

前面演示的都是只有一个进程打开关闭文件，如果此时在自动刷新语言级别的缓冲区之前创建了一个子进程，观察下面代码的执行效果：

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>int main()
{int fd = open("log.txt", O_CREAT | O_WRONLY | O_APPEND, 0666);dup2(fd, stdout->_fileno);// C库函数printf(" hello printf\n");fprintf(stdout, " hello fprintf\n");const char *message = " hello fwrite\n";fwrite(message, 1, strlen(message), stdout);// 系统调用const char *w = " hello write\n";write(1, w, strlen(w));// 创建子进程fork();return 0;
}

查看 log.txt 的内容

可以看到，除了系统调用接口 write 函数以外，其他的C语言库函数均出现两次打印的情况，本质就是因为语言级缓冲区自动刷新到系统内核级缓冲区机制，而因为 fork 创建了子进程，所以子进程在不修改的情况下会共享父进程中的所有内容，包括语言级缓冲区，所以在最后程序结束时，因为父进程没有刷新语言级缓冲区，所以创建子进程之后，父子进程都要进行一次刷新操作，所以父子进程都通过系统调用接口将数据写入到内核级缓冲区中，而之所以 write 函数没有写入两次，就是因为父进程已经调用完成write函数了，系统内核级缓冲区已经有对应的内容，所以此时子进程就不需要再调用了，最后由操作系统决定将所有内容刷新到文件中