std::string is not a Container for Raw Data

Sometimes we need unformatted data, simple byte sequences. At first glance, std::string might be a fitting data structure for that, but it is not.

Think about data we get from networks, a CAN bus, another process. Serialized binary data that has to be interpreted before it can be used in our business logic. The natural way to manage this kind of data is having sequence containers like std::vector or std::array of std::byte or, lacking C++17 support, unsigned char. Sometimes we also see uint8_t, which on many platforms is unsigned char.

However, there is another contiguous container for 8-bit values that seems tempting to be used as a means to transport byte sequences: std::string. I am not sure about the reasons to do this apart from std::string being slightly less to type than std::vector<unsigned char>, meaning that I can not see any reason at all. On the contrary, it is a bad idea for several reasons.

‘\0’ delimiters

Many string operations rely on having zero-terminated character sequences. That means that there is exactly one null character, and that is at the end. Plain byte sequences, on the other hand, can contain an arbitrary number of null bytes anywhere. While std::string can store sequences with null characters, we have to be very careful to not use functions that take const char*, because those would truncate at the first null character.

Semantics

The major reason not to use std::string is semantics: When we see that type in our code, we naturally expect a series of readable characters. We expect some text. When it is misused as a series of raw bytes, it is confusing to maintainers of our codebase. It gets even worse if we expose the use of std::string as a raw data container via an API that has to be used by someone else.

Especially in locations where we convert text to serialized raw data or vice versa, it will be very confusing to determine which std::string is text and which is raw data.

Type safety

Apart from confusing the developer, having the same type for two nontrivial uses can be error prone as it neglects the safety mechanisms the strong typing of C++ gives us. Imagine for example a function that takes some text and some serialized raw data – both would take std::string and could easily switch places by accident.

Conclusion

Instead of std::string, use std::vector<std::byte> or std::vector<unsigned char>. While this already nicely says “sequence of bytes”, consider using a typedef. For even stronger typing, use a wrapper structure with a meaningful name.

Previous Post

8 Comments


  1. It partially fixes the issue of a very lacking standard library. With basic_string you get a lot of features that are missing or delayed for vectors.

    +=, << and >> operators
    Potentially useful members like substr, find, replace, startswith.
    string_view
    short “array” optimization.

    Obviously you wouldn’t go and replace every vector with basic_string. A lot of mentioned features you probably wouldn’t use in production.

    But the point is – it would make sense for basic_string to be just a vector with \0 at the end, but instead it was given a lot more fancy features that “could” make sense for vector.

    And also vector…

    Reply

  2. Maybe the std library should provide a bytearray class. Similar to QByteArray.

    Reply

    1. I’m not too familiar with Qt yet and have to look where QByteArray is different to a std::vector<std::byte>.

      Reply

  3. Google’s Protocol Buffers uses exactly that and I’ve never liked it for all the confusion it causes.

    Reply

    1. The subtitle is a good hint that that lighting talk is meant to be humorous and not as an advice to actually do such things 😉

      Reply

      1. I mean, it fixes the issue that is vector…

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *