问题：

如何进行递归子文件夹搜索并返回列表中的文件？

漆雕深

2023-03-14

我正在编写一个脚本，以递归方式遍历主文件夹中的子文件夹，并根据特定的文件类型生成一个列表。我对剧本有意见。目前设置如下：

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

问题是subFolder变量正在拉入子文件夹列表，而不是项目文件所在的文件夹。我之前曾想过为子文件夹运行for循环，并加入路径的第一部分，但我想我会再次检查，看看在此之前是否有人有任何建议。

共有3个答案

申屠英韶

2023-03-14

匿名用户

这似乎是我能想出的最快的解决方案，比os.walk快，比任何glob解决方案快得多。

它还将免费为您提供所有嵌套子文件夹的列表。
您可以搜索多个不同的扩展名。
通过将f.path更改为f.name（不要更改子文件夹！），您还可以选择返回完整路径或仅返回文件名

函数返回两个列表：子文件夹，文件。

详细的速度分析见下文。

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

如果需要文件大小，还可以创建一个size列表，并添加f.stat（）.st_size，如下所示，以显示MiB：

sizes.append(f"{f.stat().st_size/1024/1024:.0f} MiB")

速度分析

用于获取所有子文件夹和主文件夹内具有特定文件扩展名的所有文件的各种方法。

tl；博士：

fast\u scandir显然胜出，速度是除os.walk之外的所有其他解决方案的两倍。

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

测试是用W7x64，Python 3.8.1,20次运行完成的。439（部分嵌套）子文件夹中的16596个文件。
find_files来自https://stackoverflow.com/a/45646357/2441026，允许您搜索几个扩展。
fast_scandir是我自己写的还将返回子文件夹列表。你可以给它一个扩展列表来搜索（我测试了一个列表，其中一个条目是一个简单的，如果...==". jpg"与无显著差异）。

# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")

凌展

2023-03-14

在Python 3.5中更改：支持使用"**"递归全局。

glob.glob（）获取了一个新的递归参数。

如果要获取my_path下的每个.txt文件（递归包括子目录）：

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

如果需要迭代器，可以使用iglob作为替代：

for file in glob.iglob(my_path, recursive=True):
    # ...

益锦程

2023-03-14

您应该使用称为根的dirpath。提供了dirnames，因此，如果存在不希望os.walk递归的文件夹，则可以对其进行修剪。

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

编辑：

在最近的一次否决投票后，我意识到，glob是一个更好的扩展选择工具。

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

还有一个生成器版本

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

用于Python 3.4的Edit2

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

如何进行递归子文件夹搜索并返回列表中的文件？

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档