Skip to main content

Python Faster Than C++, An Edge Case

·2360 words·12 mins
Table of Contents

This article was first published on Medium.

I have read multiple articles about how Python is slower than C++ on multiple forums (for example, this article). This is true for heavy mathematical computation, but I found an interesting edge case where Python is faster than C++.

The example is pertinent to my work at an automotive company rather than well-known algorithms/programs found online (for example, benchmarks website).

Problem Statement
#

We receive vehicle data captured by specialized equipment in a very peculiar format.

The file can be read as a text file and is divided into different sections which cannot be opened in Excel or loaded as a dataframe.

A snippet of the file structure is below:

File created on February 23, 2023

[header]
time
var1
var2
...
var40

[channel units]
sec
var1_unit
var2_unit
...
var40_unit

[comments]
comment_1
comment_2
...
comment_n

[module Information]
module_information_1
module_information_2
...
module_information_n

[column names]
time column_name_1 column_name_2 ... column_name_40

[data]
013 223523.290 +2064.03711786 +4921.65282052 000.011 172.20 +0167.65 -0000.03 -0000.00 +0000.00 005 008 +00317 +00001 000.078 0.000000 +2064.03709920 +4921.65284024 000.021 193.96 +0168.70 -0000.03 +9.701501E-01 +1.414533E+00 +3.097366E-02 +5.186038E-02 +4.678240E-01 +1.571705E+00 +9.034662E-01 +4.701178E-04 +1.864912E-03 +6.303535E-04 +2.190931E-04 +3.000000E+00 -3.201444E-03 +5.865387E-03 -3.101673E-02 +4.425000E+01 -5.101483E-03 +5.463656E-03 +9.994213E-01 
013 223523.300 +2064.03711758 +4921.65281963 000.011 172.70 +0167.65 -0000.03 -0000.00 +0000.00 005 008 +00317 +00001 000.074 0.000000 +2064.03709891 +4921.65283936 000.039 140.61 +0168.70 +0000.01 +9.705352E-01 +1.415086E+00 +1.062475E-02 +1.776519E-02 +4.676667E-01 +1.571252E+00 +9.031764E-01 +5.826579E-04 +2.300232E-03 +7.659769E-04 +2.190884E-04 +3.000000E+00 +4.296572E-02 +5.492371E-03 -3.154023E-02 +4.425000E+01 -2.202520E-01 +2.753316E-03 +9.991882E-01 
... (>1 million rows)
013 223523.290 +2064.03711786 +4921.65282052 000.011 172.20 +0167.65 -0000.03 -0000.00 +0000.00 005 008 +00317 +00001 000.078 0.000000 +2064.03709920 +4921.65284024 000.021 193.96 +0168.70 -0000.03 +9.701501E-01 +1.414533E+00 +3.097366E-02 +5.186038E-02 +4.678240E-01 +1.571705E+00 +9.034662E-01 +4.701178E-04 +1.864912E-03 +6.303535E-04 +2.190931E-04 +3.000000E+00 -3.201444E-03 +5.865387E-03 -3.101673E-02 +4.425000E+01 -5.101483E-03 +5.463656E-03 +9.994213E-01 
013 223523.300 +2064.03711758 +4921.65281963 000.011 172.70 +0167.65 -0000.03 -0000.00 +0000.00 005 008 +00317 +00001 000.074 0.000000 +2064.03709891 +4921.65283936 000.039 140.61 +0168.70 +0000.01 +9.705352E-01 +1.415086E+00 +1.062475E-02 +1.776519E-02 +4.676667E-01 +1.571252E+00 +9.031764E-01 +5.826579E-04 +2.300232E-03 +7.659769E-04 +2.190884E-04 +3.000000E+00 +4.296572E-02 +5.492371E-03 -3.154023E-02 +4.425000E+01 -2.202520E-01 +2.753316E-03 +9.991882E-01 

To start any analysis of this data, it is essential to convert this file into a traditional CSV file.

The number of [header] entries matches the columns in the [data] section; therefore, the first row of the CSV file will be the [header] entries.

The data is space delimited; we would want that comma delimited. Moreover, the data has +/- signs for each number or trailing zeros; we would want floats in the CSV files with signs only for negative numbers — for example, 013 to 13.0 or 000.011 to 0.011 or -0000.03 to -0.03 or, finally, -0000.00 to 0.0.

C++ Code
#

Using my knowledge of C++ to the best of my abilities, I wrote the following code to parse the data into CSV. The code is straightforward; it does the following things in order (as seen from the main function):

  • Obtains some key elements from the text file path
  • Reads the text file into its various contents and saves them in a struct
  • Chooses the header and data fields of the struct to convert into std::vector
  • Writes the data to CSV

The code is compiled using g++ parse_txt.cpp -std=c++17 -O3 -Wall -o parse_txt. The GCC compiler version used was 11.2

#include <algorithm>
#include <chrono>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sstream>
#include <string>
#include <utility>
#include <vector>

using std::filesystem::path;

struct TxtFileContents {
    std::string header;
    std::string channel_units;
    std::string comments;
    std::string module_information;
    std::string column_names;
    std::string data;
};

/**
 * Function gets a particular section from the text file based on the
 * start and end section identifiers
 */
std::stringstream get_particular_section_from_file(
    std::string &section_strt_identifier, std::string &section_end_identifier,
    std::fstream &text_file_contents) {
    std::string str;
    std::stringstream return_str("", std::ios::app | std::ios::out);
    std::string sample_str = "";
    bool start = false;
    while (getline(text_file_contents, str)) {
        if (str == section_strt_identifier) {
            start = true;
            continue;
        }
        if (str == section_end_identifier && section_end_identifier != "") {
            break;
        }

        if (start) {
            return_str << str << "\n";
        }
    }

    return return_str;
}

/**
 * Read text file to obtain the various sections from the file so that
 * they can be used to create a CSV file for the data
 */
TxtFileContents read_text_file(std::string &filename) {
    std::vector<std::pair<std::string, std::string>> milestones_vec{
        std::make_pair("[header]", "[channel units]"),
        std::make_pair("[channel units]", "[comments]"),
        std::make_pair("[comments]", "[module Information]"),
        std::make_pair("[module Information]", "[column names]"),
        std::make_pair("[column names]", "[data]"),
        std::make_pair("[data]", ""),
    };

    std::fstream text_file_contents{filename};
    unsigned int i = 0;
    TxtFileContents output;
    for (auto a_pair : milestones_vec) {
        auto file_section = get_particular_section_from_file(
            a_pair.first, a_pair.second, text_file_contents);
        text_file_contents.seekg(0, text_file_contents.beg);

        if (i == 0) {
            output.header = file_section.str();
        } else if (i == 1) {
            output.channel_units = file_section.str();
        } else if (i == 2) {
            output.comments = file_section.str();
        } else if (i == 3) {
            output.module_information = file_section.str();
        } else if (i == 4) {
            output.column_names = file_section.str();
        } else if (i == 5) {
            output.data = file_section.str();
        }
        ++i;
    }

    text_file_contents.close();
    return output;
}

/**
 * Function separates the string to a vector of multiple strings based on the
 * the delimiter that is provided
 */
std::vector<std::string> separate_string_to_vec(std::string &str_to_convert,
                                                char delimiter) {
    std::vector<std::string> output;
    std::istringstream str_stream{str_to_convert};

    std::string str;
    while (std::getline(str_stream, str, delimiter)) {
        output.push_back(str);
    }

    return output;
}

/**
 * Function separates the string of data into vector of vectors of doubles
 * so that the precision is maintained
 */
std::vector<std::vector<double>> separate_data_to_vec(std::string &data_str) {
    std::vector<std::vector<double>> output;
    std::istringstream str_stream{data_str};

    std::string outer_str;
    while (std::getline(str_stream, outer_str)) {
        // Remove the ending \r from each line
        outer_str.pop_back();

        std::istringstream inner_str_stream{outer_str};
        std::string inner_str;
        std::vector<double> inner_output;

        while (std::getline(inner_str_stream, inner_str, ' ')) {
            if (inner_str != "") {
                inner_output.push_back(std::stod(inner_str));
            }
        }

        output.push_back(inner_output);
    }

    return output;
}

/**
 * Function writes data to CSV with the headers and the data
 */
void write_data_to_csv(const std::vector<std::string>& headers,
                       const std::vector<std::vector<double>>& data,
                       const std::string& pathname_filename_no_ext) {
    std::string output_string = "";
    for (size_t i = 0; i < headers.size(); i++) {
        output_string += headers[i];

        if (i < headers.size() - 1) {
            output_string += ",";
        } else {
            output_string += "\n";
        }
    }

    for (const auto& row : data) {
        for (size_t j = 0; j < row.size(); j++) {
            std::string num_str = std::to_string(row[j]);
            output_string += num_str;

            if (j < row.size() - 1) {
                output_string += ",";
            } else {
                output_string += "\n";
            }
        }
    }

    std::string path_to_write = pathname_filename_no_ext + ".csv";
    std::ofstream file(path_to_write);
    file << output_string;
    file.close();
}

/**
 * Main function of the parser executable which will parse special text files into
 * CSV files and provide additional modifications to it.
 */
int main(int argc, char *argv[]) {
    using std::cout;

    if (argc > 2) {
        cout << "Too many arguments. Only one text file expected." << '\n';
        return 1;
    } else if (argc == 1) {
        cout << "Too few arguments. At least one text file expected." << '\n';
        return 1;
    }

    // Define pathname for the text file
    path filename = path(argv[1]);
    std::string filename_str = filename.string();
    std::string pathname = filename.parent_path().string();
    std::string filename_noext_str = filename.stem().string();
    std::string pathname_filename_noext_str =
        pathname + "/" + filename_noext_str;

    // Start timer
    auto start = std::chrono::high_resolution_clock::now();

    // Get the text contents of the file
    TxtFileContents text_file_contents = read_text_file(filename_str);
    
    // Get the headers of the file which also have the units. Remove the last
    // because that is just an empty line
    auto headers = separate_string_to_vec(text_file_contents.header, '\r');
    headers.pop_back();

    // Obtain data into vector of vectors
    auto data = separate_data_to_vec(text_file_contents.data);

    // Write data to csv
    write_data_to_csv(headers, data, pathname_filename_noext_str);

    // End timer and print execution time
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::seconds>(end - start);

    cout << "Parsed text file successfully. Took " << duration.count()
         << " seconds.\n";

    return 0;
}

Python Code
#

Similarly, Python code was written using ‘pythonic’ ways while still being as similar as possible to the C++ implementation above. No external libraries were used to be consistent with the C++ implementation, which uses only the standard library.

Python code was run with 3.10.4 by using the command python parse_txt.py.

import io
import csv
import time
from pathlib import Path
import itertools
from dataclasses import dataclass, field

@dataclass
class TxtFileContents:
    header: str = field(default="")
    channel_units: str = field(default="")
    comments: str = field(default="")
    module_information: str = field(default="")
    column_names: str = field(default="")
    data: str = field(default="")


def get_particular_section_from_file(
    section_strt_identifer: str,
    section_end_identifier: str, 
    file_contents: io.StringIO
) -> str:
    """
    Function reads the file contents and parses only the relevant sections from
    the file

    Parameters
    ----------
    section_strt_identifier: The section from which the file will be obtained
    section_end_identifier: The section to which the file will be obtained
    file_contents: The filecontents of the file

    Returns
    --------
    The return string with the particular section
    """
    start: bool = False
    return_str = ""
    while True:
        line = file_contents.readline()

        if not line:
            break

        if line == section_strt_identifer and section_end_identifier == "":
            return_str = file_contents.read(-1)

        if line == section_strt_identifer:
            start = True
            continue

        if line == section_end_identifier and section_end_identifier != "":
            break

        if start:
            return_str += line
            # return_str += "\n"

    return return_str


def read_text_file(filename: Path) -> TxtFileContents:
    """
    Function reads text file and parses various sections from it
    
    Parameters
    -----------
    filename: The filename Path of the text file
    """
    milestones = [
        ("[header]\n", "[channel units]\n"),
        ("[channel units]\n", "[comments]\n"),
        ("[comments]\n", "[module Information]\n"),
        ("[module Information]\n", "[column names]\n"),
        ("[column names]\n", "[data]\n"),
        ("[data]\n", ""),
    ]

    with open(filename, 'r') as file:
        file_contents_from_file = file.read()

    text_file_contents = TxtFileContents()
    for i, milestone in enumerate(milestones):
        file_contents = io.StringIO(file_contents_from_file)
        section = get_particular_section_from_file(
            milestone[0],
            milestone[1],
            file_contents
        )
        if i == 0:
            text_file_contents.header = section
        elif i == 1:
            text_file_contents.channel_units = section
        elif i == 2:
            text_file_contents.comments = section
        elif i == 3:
            text_file_contents.module_information = section
        elif i == 4:
            text_file_contents.column_names = section
        elif i == 5:
            text_file_contents.data = section

    return text_file_contents


def separate_string_to_vec(str_to_convert: str) -> list[str]:
    """
    Function separates string to a list

    Parameters
    ----------
    str_to_convert: The str to separate

    Returns
    --------
    List of individual items
    """
    string_io = io.StringIO(str_to_convert)
    output_vec = []
    while True:
        line = string_io.readline()

        if not line:
            break

        line = line[:-1]
        output_vec.append(line)
    
    return output_vec


def separate_data_to_vec(str_to_convert: str):
    """
    Function separates string to a vector of vector of doubles

    Parameters
    ----------
    str_to_convert: The str to separate

    Returns
    --------
    The data as a vector of vectors
    """
    string_io = io.StringIO(str_to_convert)
    output_vec = []
    while True:
        line = string_io.readline()

        if not line:
            break

        split_line = line.split(" ")
        split_line.pop()
        data_nums = list(map(lambda x: float(x), split_line))
        output_vec.append(data_nums)

    return output_vec


def write_to_csv(headers: list[str], data: list[list[float]], pathname_filename_no_ext: str):
    """
    Function writes the data to a CSV file using the builtin csv writer

    Parameters
    -----------
    headers: The headers of the csv file
    data: The data which will get added to the csv file
    """
    with open(pathname_filename_no_ext + '.csv', 'w', newline='') as file:
        data_writer = csv.writer(file)
        data_writer.writerow(headers)
        data_writer.writerows(data)


def main():
    """
    Function is the main driver for parsing text files
    """
    filename = Path("./path/to/txt/file")
    filename_no_ext = filename.stem
    pathname = filename.parent
    pathname_filename_no_ext = str(pathname) + '/' + str(filename_no_ext)

    # Start timer
    time_s = time.perf_counter()

    # Get the text contents of the file
    text_file_contents = read_text_file(filename)
    
    # Get the headers which will have the channel units
    headers = separate_string_to_vec(text_file_contents.header)

    # Get data into vector of vectors
    data = separate_data_to_vec(text_file_contents.data)

    # Write to CSV
    write_to_csv(headers, data, pathname_filename_no_ext)

    time_e = time.perf_counter()
    print(f"Parsed text file successfully. Took {time_e - time_s} seconds")


if __name__ == '__main__':
    main()

Performance Differences — Windows 10
#

The C++ code and Python code were run on the same file. The operating system used was Windows 10. The breakdown of which part of the code executed in how much time is also provided.

C++ average execution time — Windows = 44 seconds

Python average execution time — Windows = 33 seconds

Part of Code C++ Time [sec] Python Time [sec]
Parsing into struct 0.0 5.5
Getting headers as vector 0.0 0.0
Getting data as vector 23.0 10.7
Writing headers and data as CSV 21.0 16.5
Total 44.0 32.7

Python runs about 1.3 times faster than C++. Even if we ignore the IO-dependent “writing to CSV” portion, the conversion from string to double and placing them in std::vector is the most noteworthy bottleneck. This was not expected; perhaps it is Windows 10 that is causing the slowness.

Performance Differences — WSL
#

Therefore, I ran the same code on a Debian Windows Subsystem for Linux (WSL). That is the closest I can get to a Linux operating system as we, like most companies, are issued a Windows laptop.

C++ average execution time — WSL = 13 seconds

Python average execution — WSL = 31 seconds

Part of Code C++ Time [sec] Python Time [sec]
Parsing into struct 1.0 9.0
Getting headers as vector 0.0 0.0
Getting data as vector 4.0 9.7
Writing headers and data as CSV 8.0 11.9
Total 13.0 30.6

Python, in WSL, runs 2.3 times slower than C++. The trends now make sense from what conventional wisdom suggests. The Python times stay approximately the same as with Windows, but the C++ execution times reduce drastically.

Rust-nhilation:
#

There is a proverbial new kid on the block: the Rust programming language. I will not show the code for this one to save from the belabor. However, I can tell you that the Rust code ran in only 3 seconds. Yes, all four steps were completed in 3 seconds, irrespective of the operating system, without using any external packages (or crates) and by using optimizations i.e. cargo run --release.

Indeed a cross-platform programming language with a fantastic package manager and an excellent standard library. The code was as simple as the Python variant shown above and provided the same flexibility we get from interpreted languages.

Discussion / Conclusion
#

If we disregard the slowness of C++ on Windows for now, it is interesting to see that Rust beats C++ in the Linux environment by a long margin. It turns out C++’s std::stod and getline are very slow, the former more than the latter. If I replace that portion by std::from_chars I get a two times improvement in C++ performance.

Another factor of slowness was the copying of std::string or std::vector in my implementation. The unnecessary copying can be eliminated by using std::move, however, in Rust due to its strict ownership rules, move semantics is handled automatically without the programmer’s intervention. These suggestions are from comments posted for a question I had asked on Stack Overflow recently. Hence, Rust is my go-to when I need that performance boost on new projects.

To conclude, 10–12 seconds is not much for the case I presented. But then I ask, what is an acceptable threshold? And why is that the acceptable threshold?