问题：

在导入到熊猫之前跳过每个指定行

万俟嘉珍

2023-03-14

我有一个需要跳过某些行的数据文件。

(1 1),skip,this
skip,this,too
1,2,3
4,5,6
7,8,9
10,11,12
(1 2),skip,this
skip,this,too
...

它们确实每隔4个数据条目重复一次。我尝试了这篇文章《熊猫》中的那些：当将文件读入数据帧时，忽略特定字符串后面的所有行，但该行没有被跳过，它正在将数据帧变成多索引。

我尝试使用startswith（）循环并将其添加到列表中，但是，数据被输入到单个列中。

我试图获得这个输出：

1,2,3
4,5,6
7,8,9
10,11,12

有多个文件，每个文件包含超过700万行。我正在寻找一种快速、高效的方法来实现这一点。

我尝试创建一个列表来跳过行0,1，然后再次跳过6,7。有可能通过这一点实现吗？

共有3个答案

艾飞宇

2023-03-14

一种方法是只生成要跳过的行号列表，因此使用以下方法确定文件中的行数：计算CSV Python中的行数？

然后执行以下操作：

In [16]:
import io
import pandas as pd
t="""(1 1),skip,this
skip,this,too
1,2,3
4,5,6
7,8,9
10,11,12
(1 2),skip,this
skip,this,too"""
# generate initial list, using 10 her but you can get the number of rows using another metho
a = list(range(10))
# generate the pairs of rows to skips in steps
rows = sorted(a[::6] + a[1::6])
# now read it in
pd.read_csv(io.StringIO(t), skiprows=rows, header=None)

Out[16]:
    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

卢涵畅

2023-03-14

我的建议是提前擦洗文件：

with open("file.csv") as rp, open("outfile.csv", 'w') as wp:
    for line in rp:
        if 'skip' not in line:
            wp.write(line)

梁研

2023-03-14

假设您想在要跳过的两行之后获取四行的部分，只需跳过两行，然后从csv阅读器obejct中获取四行的切片：

from itertools import islice, chain
import pandas as pd
import csv


def parts(r):
    _, n = next(r), next(r)
    while n:
        yield islice(r, 4)
        _, n = next(r, ""), next(r, "")
            _, n = next(r, ""), next(r, "")


with open("test.txt")as f:
        r = csv.reader(f)
        print(pd.DataFrame(list(chain.from_iterable(parts(r)))))

输出：

    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

或者将链对象传递给pd。数据帧。从_记录：

with open("test.txt")as f:
    r = csv.reader(f)
    print(pd.DataFrame.from_records(chain.from_iterable(parts(r))))

    0   1   2
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

或者更通用的方法，使用使用consume recipe函数跳过行：

from itertools import islice, chain
from collections import deque
import pandas as pd
import csv

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)


def parts(r, sec_len, skip):
    consume(r,skip)
    for sli in iter(lambda: list(islice(r, sec_len)), []):
        yield sli
        consume(r, skip)


with open("test.txt")as f:
    r = csv.reader(f)
    print(pd.DataFrame.from_records((chain.from_iterable(parts(r, 4, 2)))))

最后一个选项是写入StringIo对象并传递：

from io import StringIO
def parts(r, sec_len, skip):
    consume(r, skip)
    for sli in iter(lambda: list(islice(r, sec_len)), []):
        yield "".join(sli)
        consume(r, skip)


with open("test.txt")as f:
    so = StringIO()
    so.writelines(parts(f, 4, 2))
    so.seek(0)
    print(pd.read_csv(so, header=None))

类似资料：

熊猫groupby：每组前3个值

问题内容：在 pandas groupby上发布了一个新的更通用的问题：每个组中的前3个值并存储在DataFrame中，并且在那里已经找到了可行的解决方案。在此示例中，我创建了一个数据帧，其中的一些随机数据间隔为5分钟。我想创建一个数据框（ df分组），其中列出了每小时的 3个最高值。即：从这一系列价值我非常接近解决方案，但我找不到最后一步的正确语法。我到现在为止（）的结果是：我想从
在熊猫分组之后对每个组进行采样

问题内容：我知道必须在某些地方回答此问题，但我找不到它。问题：groupby操作后对每个组进行采样。问题答案：应用lambda并使用param调用：
跳过指定行

读取文件已支持 windows 系统，版本号大于等于 1.3.4.1；扩展版本大于等于 1.2.7； PECL 安装时将会提示是否开启读取功能，请键入 yes；测试数据准备 $config = ['path' => './tests']; $excel = new \Vtiful\Kernel\Excel($config); // 写入测试数据 $filePath = $excel->f
无法导入熊猫档案
使用熊猫导入每行具有不同列数的csv

问题内容：使用Pandas或CSV模块将每行具有不同列数的CSV导入Pandas DataFrame的最佳方法是什么。使用此代码：产生以下错误问题答案：在read_csv（）中提供列名列表应该可以解决问题。例如：names = [‘a’，’b’，’c’，’d’，’e’] https://github.com/pydata/pandas/issues/2981 编辑：如果您不想提供列名，
熊猫：过滤多个条件

问题内容：我正在尝试使用Pandas在几个条件下进行布尔索引。我原来的DataFrame称为。如果执行以下操作，将得到预期的结果：但是，如果我这样做（我认为应该是等效的），则不会返回任何行：知道导致差异的原因是什么？问题答案：使用是因为运算符优先级：或者，在单独的行上创建条件：样品：

在导入到熊猫之前跳过每个指定行

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档