为什么在C ++中拆分字符串要比Python慢？

邹星火

2023-03-14

问题内容：

我正在尝试将一些代码从Python转换为C ，以期提高速度并提高生锈的C

技能。当一个天真的实现从标准输入读取线是在Python比C快得多（见昨天我惊呆了这个）。今天，我终于弄清楚了如何使用合并定界符（与python的split（）相似的语义）在C 中拆分字符串，并且现在遇到了deja
vu！我的C ++代码需要花费更长的时间才能完成工作（尽管昨天的课程并没有增加一个数量级）。

Python代码：

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C ++代码：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意，我尝试了两种不同的拆分实现。一个（split1）使用字符串方法搜索令牌，并且能够合并多个令牌以及处理多个令牌（来自此处）。第二个（split2）使用getline作为流读取字符串，不合并定界符，并且仅支持单个分隔符（该字符由多个StackOverflow用户发布，用于回答字符串拆分问题）。

我以不同的顺序多次运行。我的测试机器是Macbook
Pro（2011，8GB，四核），并不是很重要。我正在使用20M行文本文件进行测试，该文件具有三个以空格分隔的列，每个列看起来都与此类似：“ foo.bar
127.0.0.1 home.foo.bar”

结果：

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

我究竟做错了什么？有没有更好的方法可以在C
++中进行字符串拆分，该方法不依赖于外部库（即不增强），支持合并定界符序列（如python的split），是线程安全的（因此没有strtok），并且其性能至少是与python相提并论？

编辑1 /部分解决方案？：

我尝试通过让python重置虚拟列表并每次都将其追加到C 上来使其更公平地进行比较。这仍然不完全是C
代码正在做的事情，但是还差一点。基本上，循环现在是：

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

python的性能现在与split1 C ++实现大致相同。

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

我仍然感到惊讶的是，即使Python已针对字符串处理进行了如此优化（如Matt Joiner所建议的），这些C 实现也不会更快。如果有人想知道如何使用C
以更好的方式执行此操作，请共享您的代码。（我认为我的下一步将尝试在纯C中实现此功能，尽管我不会在程序员的工作效率上进行折衷以重新使用C来实现我的整个项目，所以这只是一个字符串拆分速度的实验。）

感谢大家的帮助。

最终编辑/解决方案：

请参阅Alf接受的答案。由于python严格按引用处理字符串，并且经常复制STL字符串，因此使用香草python实现会提高性能。为了进行比较，我通过Alf的代码编译并运行了数据，这是与其他所有运行在同一台机器上的性能，基本上与朴素的python实现相同（尽管比重置/追加列表的python实现更快，因为如以上编辑所示）：

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

我唯一剩下的小麻烦就是在这种情况下要使C ++执行所需的代码量。

从本期和昨天的stdin行阅读版（如上链接）中可以得出的教训之一是，应该始终进行基准测试，而不是对语言的相对“默认”性能进行幼稚的假设。我很感谢教育。

再次感谢大家的建议！

问题答案：

可以猜测，Python字符串是引用计数的不可变字符串，因此在Python代码中不会复制任何字符串，而C
++std::string是可变值类型，并且被复制的机会最小。

如果目标是快速拆分，则可以使用恒定时间的子字符串操作，这意味着仅引用原始字符串的部分，如Python（以及Java和C＃…）一样。

C std::string类具有一个赎回功能：它是 standard
，因此它可以用于在效率不是主要考虑因素的地方安全，方便地传递字符串。但是足够的聊天。代码-
在我的机器上，这当然比Python快，因为Python的字符串处理是在C中实现的，而C是C 的子集（他）：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

免责声明：我希望没有任何错误。我没有测试功能，只是检查了速度。但是我认为，即使有一个或两个错误，更正也不会明显影响速度。

为什么在C ++中拆分字符串要比Python慢？

相关阅读

相关文章

相关问答

相关工具

相关文档

为什么在C ++中拆分字符串要比Python慢​​？

相关阅读

相关文章

相关问答

相关工具

相关文档

为什么在C ++中拆分字符串要比Python慢？