当前位置: 首页 > 知识库问答 >
问题:

解压缩HTTPInputStream时,GZIPInputStream过早关闭

姚海
2023-03-14

请参见下面编辑部分的更新问题

我试图使用GZIPInputStream动态地从Amazon S3解压大型(约300M)GZIPed文件,但它只输出文件的一部分;但是,如果在解压缩之前下载到文件系统,那么GZIPInputStream将解压缩整个文件。

如何让GZIPInputStream解压整个HTTPInputStream,而不仅仅是它的第一部分?

请参见下面编辑部分中的更新

我怀疑是HTTP问题,只是从来没有抛出异常,GZIPInputStream每次都返回一个相当一致的文件块,而且据我所知,它总是在湿记录边界上中断,尽管它为每个URL选择的边界不同(这很奇怪,因为所有内容都被视为二进制流,根本没有对文件中的湿记录进行解析

我能找到的最接近的问题是,当从s3读取时,GZIPInputStream过早关闭了。这个问题的答案是,一些GZIP文件实际上是多个附加的GZIP文件,而GZIPInputStream不能很好地处理这些文件。然而,如果是这种情况,为什么GZIPInputStream可以在文件的本地副本上正常工作?

下面是一段示例代码,演示了我看到的问题。我已经用Java1.8.0_72和1.8.0_112在两个不同网络上的两台不同Linux计算机上进行了测试,结果相似。我希望解压后的HTTPInputStream的字节数与解压后的文件本地副本的字节数相同,但解压后的HTTPInputStream要小得多。

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;

public class GZIPTest {
    public static void main(String[] args) throws Exception {
        // Our three test files from CommonCrawl
        URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");

        /*
         * Test the URLs and display the results
         */
        test(url0, "testfile0.wet");
        System.out.println("------");
        test(url40, "testfile40.wet");
        System.out.println("------");
        test(url500, "testfile500.wet");
    }

    public static void test(URL url, String testGZFileName) throws Exception {
        System.out.println("Testing URL "+url.toString());

        // First directly wrap the HTTPInputStream with GZIPInputStream
        // and count the number of bytes we read
        // Go ahead and save the extracted stream to a file for further inspection
        System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
        int bytesFromGZIPDirect = 0;
        URLConnection urlConnection = url.openConnection();
        FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);

        // FIRST TEST - Decompress from HTTPInputStream
        GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());

        byte[] buffer = new byte[1024];
        int bytesRead = -1;
        while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPDirect += bytesRead;
            directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
        }
        gzipishttp.close();
        directGZIPOutStream.close();

        // Now save the GZIPed file locally
        System.out.println("Testing saving to file before decompression");
        int bytesFromGZIPFile = 0;
        ReadableByteChannel rbc = Channels.newChannel(url.openStream());
        FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
        outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
        outputStream.close();

        // SECOND TEST - decompress from FileInputStream
        GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));

        buffer = new byte[1024];
        bytesRead = -1;
        while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPFile += bytesRead;
        }
        gzipis.close();

        // The Results - these numbers should match but they don't
        System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
        System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
        System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
    }

}

根据@VGR的评论,演示代码中的封闭流和相关通道。

更新:

问题似乎确实与文件有关。我将常见的爬行湿归档文件拉到本地(wget),解压缩(gunzip 1.8),然后重新压缩(gzip 1.8),并重新上传到S3,然后动态解压缩工作正常。如果修改上面的示例代码以包含以下行,则可以看到测试:

// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");

test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");

URL rezippes3指向我下载、解压缩和重新压缩的WET存档文件,然后重新上传到S3。您将看到以下输出:

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt

正如您所看到的,一旦文件被重新压缩,我就能够通过GZIPInputStream流式传输它并获取整个文件。原始文件仍然显示通常的解压过早结束。当我下载并上传WET文件而不重新压缩它时,我得到了相同的不完整流式传输行为,因此肯定是重新压缩修复了它。我还将原始文件和重新压缩的文件都放到了传统的Apache Web服务器上,并且能够复制结果,因此S3似乎与该问题无关。

所以。我有一个新问题。

为什么FileInputStream在读取相同内容时的行为与HTTPInputStream不同。如果是完全相同的文件,为什么:

新的GZIPInputStream(urlConnection.getInputStream());

行为方式与

新的gzip输入流(newfileinputstream(“./test.wet.gz”));

?? 输入流不就是输入流吗??

共有1个答案

公良泰宁
2023-03-14

结果表明,输入流可能会有很大的变化。具体而言,它们在如何实现方面存在差异。可用()方法。例如,ByteArrayInputStream。available()返回InputStream中剩余的字节数。然而,HTTPInputStream。available()返回需要发出阻塞IO请求以重新填充缓冲区之前可读取的字节数。(有关更多信息,请参阅Java文档)

问题是GZIPInputStream使用的输出。available()确定在完成对完整GZIP文件的解压缩后,InputStream中是否有其他GZIP文件可用。下面是OpenJDK源文件GZIPInputStream的第231行。java方法readTraile()。

   if (this.in.available() > 0 || n > 26) {

如果HTTPInputStream读取缓冲区在两个串联的GZIP文件的边界处清空,则GZIPInputStream调用。available(),它以0响应,因为它需要到网络中重新填充缓冲区,因此GZIPInputStream将文件视为完整文件,并提前关闭。

常见的爬行。湿归档文件是数百兆字节的小型串联GZIP文件,因此最终HTTPInputStream缓冲区将在其中一个串联GZIP文件的末尾清空,GZIPInputStream将提前关闭。这就解释了问题中显示的问题。

这个GIST包含一个jdk8u152-b00修订版12039的补丁和两个jtreg测试,它们消除了(在我的愚见中)对.可用()的不正确依赖。

如果您无法修补JDK,一个解决方法是确保可用()始终返回

这里的输出显示,当HTTPInputStream按照所讨论的那样进行包装时,当从文件和直接从HTTP读取连接的GZIP时,GZIPInputStream将产生相同的结果。

Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 451171329 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 453183600 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet

下面是使用InputStream包装器html" target="_blank">修改的问题的演示代码。

import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;

public class GZIPTest {
    // Here is a wrapper class that wraps an InputStream
    // but always returns > 0 when .available() is called.
    // This will cause GZIPInputStream to always make another 
    // call to the InputStream to check for an additional 
    // concatenated GZIP file in the stream.
    public static class AvailableInputStream extends InputStream {
        private InputStream is;

        AvailableInputStream(InputStream inputstream) {
            is = inputstream;
        }

        public int read() throws IOException {
            return(is.read());
        }

        public int read(byte[] b) throws IOException {
            return(is.read(b));
        }

        public int read(byte[] b, int off, int len) throws IOException {
            return(is.read(b, off, len));
        }

        public void close() throws IOException {
            is.close();
        }

        public int available() throws IOException {
            // Always say that we have 1 more byte in the
            // buffer, even when we don't
            int a = is.available();
            if (a == 0) {
                return(1);
            } else {
                return(a);
            }
        }
    }



    public static void main(String[] args) throws Exception {
        // Our three test files from CommonCrawl
        URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
        URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");

        /*
         * Test the URLs and display the results
         */
        test(url0, "testfile0.wet");
        System.out.println("------");
        test(url40, "testfile40.wet");
        System.out.println("------");
        test(url500, "testfile500.wet");
    }

    public static void test(URL url, String testGZFileName) throws Exception {
        System.out.println("Testing URL "+url.toString());

        // First directly wrap the HTTP inputStream with GZIPInputStream
        // and count the number of bytes we read
        // Go ahead and save the extracted stream to a file for further inspection
        System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
        int bytesFromGZIPDirect = 0;
        URLConnection urlConnection = url.openConnection();
        // Wrap the HTTPInputStream in our AvailableHttpInputStream
        AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream());
        GZIPInputStream gzipishttp = new GZIPInputStream(ais);
        FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
        int buffersize = 1024;
        byte[] buffer = new byte[buffersize];
        int bytesRead = -1;
        while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) {
            bytesFromGZIPDirect += bytesRead;
            directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
        }
        gzipishttp.close();
        directGZIPOutStream.close();

        // Save the GZIPed file locally
        System.out.println("Testing saving to file before decompression");
        ReadableByteChannel rbc = Channels.newChannel(url.openStream());
        FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
        outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);

        // Now decompress the local file and count the number of bytes
        int bytesFromGZIPFile = 0;
        GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));

        buffer = new byte[1024];
        while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
            bytesFromGZIPFile += bytesRead;
        }
        gzipis.close();

        // The Results
        System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
        System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
        System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
    }

}
 类似资料:
  • 问题内容: 我有一个我从另一个构建的。我想知道gzip数据的原始(未压缩)长度。尽管我可以读到的末尾,然后算数,但这将花费大量时间并浪费CPU。在阅读之前,我想知道尺寸。 有没有像一个类似的方法为: 从以下版本开始: API Level 1 获取此ZipEntry的未压缩大小。 问题答案: GZIPInputStream是否有类似ZipEntry.getSize()的类似方法 不。它不在Javad

  • tar [-]c|x|u|r|t[z|j][v] -f 归档文件 [待打包文件] 将多个文件打包为一个归档文件,可以在打包的同时进行压缩。支持的格式为 tar(归档)、gz(压缩)、bz2(压缩率更高,比较耗时) 操作选项 -c 创建 -x 解包 -u 更新 -r 添加 -t 查看 -d 比较压缩包内文件和文件 -A 将 tar 文件添加到归档文件中 格式选项 -z 使用 gz 压缩格式 -j 使

  • 问题内容: 我知道这是一项容易的任务,但是更改代码后它停止工作,并且无法恢复!我实际上使用了两个函数来进行压缩和解压缩,尽管实际上它是“ jar”和“ unjar”,但这并没有太大的区别 任何帮助/建议吗? 创建JarFile时发生错误: 问题答案: 我不知道这是否是您的问题,但是通常最好在完成写入后关闭每个zip条目。 请参阅。 在显示的代码中,不会关闭邮政编码中的最后一个条目。您也不会显示关闭

  • 主要内容:1. 压缩和解压缩介绍,2. 启用压缩,3. 启用解压缩,4. 发送压缩文件本节介绍如何配置响应的压缩或解压缩以及发送压缩文件。 在这篇文章中,涉及内容如下 - 压缩和解压缩介绍 启用压缩 启用解压缩 发送压缩文件 1. 压缩和解压缩介绍 压缩响应通常会显着减少传输数据的大小。 然而,由于压缩在运行时发生,它还可以增加相当大的处理开销,这会对性能产生负面影响 在向客户端发送响应之前,NGINX会执行压缩,但不会“压缩”已压缩的响应(例如,由代理的服务器)。 2. 启用压缩

  • 本文向大家介绍Nodejs关于gzip/deflate压缩详解,包括了Nodejs关于gzip/deflate压缩详解的使用技巧和注意事项,需要的朋友参考一下 0x01.关于 写http时候,在接收http请求时候,出现乱码,后来发现是gzip没有解压。 关于gzip/deflate压缩,有放入管道压缩,和非管道压缩方法。 0x02.管道压缩 Node中的I/O是异步的,因此对磁盘和网络的读写需要

  • 我正在使用Julia的ZipFile包来提取和处理csv文件。没问题,但是当我遇到zip文件中的zip文件时,我也想处理它,但是遇到了一个错误。 Julia ZipFile文档如下:https://zipfilejl.readthedocs.io/en/latest/ 对如何做到这一点有什么想法吗?