与xlrd相比，使用openpyxl读取Excel文件的幅度要慢

通远

2023-03-14

问题内容：

我有一个Excel电子表格，每天都需要导入到SQL Server中。该电子表格将包含50列左右的250,000行。我已经使用几乎相同的代码对
openpyxl 和 xlrd 进行了测试。

这是我正在使用的代码（减去调试语句）：

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

传递包含3500行的同一Excel文件将产生截然不同的运行时间。使用xlrdI，您可以在2秒内将整个文件读入词典列表中。使用openpyxl我得到以下结果：

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

虽然我可以xlrd在最终脚本中使用，但由于各种问题（例如，int读取为float，date读取为int，datetime读取为float），我将不得不对许多格式进行硬编码。由于我需要将该代码重用于更多导入，因此尝试对特定的列进行硬编码以正确格式化它们并必须在4个不同的脚本中维护相似的代码是没有意义的。

关于如何进行的任何建议？

问题答案：

您可以遍历工作表：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

这应该扩展到大文件。如果列表data太大，则可能需要对结果进行分块。

现在，openpyxl版本的时间大约是xlrd版本的两倍：

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

请注意，xlrd和openpyxl可能会稍微不同地解释什么是整数和什么是浮点数。对于我的测试数据，我需要添加float()以使输出具有可比性：

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

现在，这两个版本为我的测试数据提供了相同的结果：

>>> xlrd_results == openpyxl_results
True

与xlrd相比，使用openpyxl读取Excel文件的幅度要慢

相关阅读

相关文章

相关问答

相关工具

相关文档