从大文件中逐一读取3000万用户ID

华旭

2023-03-14

问题内容：

我正在尝试使用Java读取非常大的文件。该大文件将具有这样的数据，这意味着每行将具有一个用户ID。

在那个大文件中，将有大约3000万用户ID。现在，我只尝试一次从该大文件中一次读取所有用户ID。意味着每个用户ID只能从该大文件中选择一次。例如，如果我有3000万用户ID，那么使用多线程代码只能打印3000万用户ID。

下面是我拥有的代码，它是一个运行10个线程的多线程代码，但是使用下面的程序，我无法确保每个用户ID仅被选择一次。

public class ReadingFile {


    public static void main(String[] args) {

        // create thread pool with given size
        ExecutorService service = Executors.newFixedThreadPool(10);

        for (int i = 0; i < 10; i++) {
            service.submit(new FileTask());
        }
    }
}

class FileTask implements Runnable {

    @Override
    public void run() {

        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader("D:/abc.txt"));
            String line;
            while ((line = br.readLine()) != null) {
                System.out.println(line);
                //do things with line
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                br.close();
            } catch (IOException e) {

                e.printStackTrace();
            }
        }
    }
}

有人可以帮我吗？我在做什么错？最快的方法是什么？

问题答案：

假设您没有做过像跨多个磁盘分割文件这样的事情，那么让一个线程按顺序读取文件确实不能改善。使用一个线程，您可以执行一次查找，然后进行长时间的顺序读取。如果有多个线程，则每个线程都将获得对磁盘头的控制权时，它们将导致多个寻道。

编辑：这是一种在仍然使用串行I /
O读取行的同时并行处理行的方法。它使用BlockingQueue在线程之间进行通信。的FileTask增加线到队列，并且CPUTask读取它们并对其进行处理。这是一个线程安全的数据结构，因此无需向其添加任何同步。您正在使用put(E e)将字符串添加到队列中的方法，因此，如果队列已满（最多可容纳200个字符串，如中的声明中所定义），则会阻塞ReadingFile这些FileTask块，直到空间释放为止。同样，您正在使用take()从队列中删除项目的方法，因此CPUTaskwill会阻塞，直到有项目可用为止。

public class ReadingFile {
    public static void main(String[] args) {

        final int threadCount = 10;

        // BlockingQueue with a capacity of 200
        BlockingQueue<String> queue = new ArrayBlockingQueue<>(200);

        // create thread pool with given size
        ExecutorService service = Executors.newFixedThreadPool(threadCount);

        for (int i = 0; i < (threadCount - 1); i++) {
            service.submit(new CPUTask(queue));
        }

        // Wait til FileTask completes
        service.submit(new FileTask(queue)).get();

        service.shutdownNow();  // interrupt CPUTasks

        // Wait til CPUTasks terminate
        service.awaitTermination(365, TimeUnit.DAYS);

    }
}

class FileTask implements Runnable {

    private final BlockingQueue<String> queue;

    public FileTask(BlockingQueue<String> queue) {
        this.queue = queue;
    }

    @Override
    public void run() {
        BufferedReader br = null;
        try {
            br = new BufferedReader(new FileReader("D:/abc.txt"));
            String line;
            while ((line = br.readLine()) != null) {
                // block if the queue is full
                queue.put(line);
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                br.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

class CPUTask implements Runnable {

    private final BlockingQueue<String> queue;

    public CPUTask(BlockingQueue<String> queue) {
        this.queue = queue;
    }

    @Override
    public void run() {
        String line;
        while(true) {
            try {
                // block if the queue is empty
                line = queue.take(); 
                // do things with line
            } catch (InterruptedException ex) {
                break; // FileTask has completed
            }
        }
        // poll() returns null if the queue is empty
        while((line = queue.poll()) != null) {
            // do things with line;
        }
    }
}

从大文件中逐一读取3000万用户ID

相关阅读

相关文章

相关问答

相关工具

相关文档