std::string is not a Container for Raw Data

Arne Mertz November 28, 2018 13

Contents

Sometimes we need unformatted data, simple byte sequences. At first glance, std::string might be a fitting data structure for that, but it is not.

Think about data we get from networks, a CAN bus, another process. Serialized binary data that has to be interpreted before it can be used in our business logic. The natural way to manage this kind of data is having sequence containers like std::vector or std::array of std::byte or, lacking C++17 support, unsigned char. Sometimes we also see uint8_t, which on many platforms is unsigned char.

However, there is another contiguous container for 8-bit values that seems tempting to be used as a means to transport byte sequences: std::string. I am not sure about the reasons to do this apart from std::string being slightly less to type than std::vector<unsigned char>, meaning that I can not see any reason at all. On the contrary, it is a bad idea for several reasons.

‘\0’ delimiters

Many string operations rely on having zero-terminated character sequences. That means that there is exactly one null character, and that is at the end. Plain byte sequences, on the other hand, can contain an arbitrary number of null bytes anywhere. While std::string can store sequences with null characters, we have to be very careful to not use functions that take const char*, because those would truncate at the first null character.

Semantics

The major reason not to use std::string is semantics: When we see that type in our code, we naturally expect a series of readable characters. We expect some text. When it is misused as a series of raw bytes, it is confusing to maintainers of our codebase. It gets even worse if we expose the use of std::string as a raw data container via an API that has to be used by someone else.

Especially in locations where we convert text to serialized raw data or vice versa, it will be very confusing to determine which std::string is text and which is raw data.

Type safety

Apart from confusing the developer, having the same type for two nontrivial uses can be error prone as it neglects the safety mechanisms the strong typing of C++ gives us. Imagine for example a function that takes some text and some serialized raw data – both would take std::string and could easily switch places by accident.

Conclusion

Instead of std::string, use std::vector<std::byte> or std::vector<unsigned char>. While this already nicely says “sequence of bytes”, consider using a typedef. For even stronger typing, use a wrapper structure with a meaningful name.

13 Comments

Erich Frobisch
4 years ago Permalink

I see an advantage of std::string over vector when direct access to the buffer is needed:

std::string provides a data() member function that may also be called on empty strings. &v[0] cannot be used on an empty vector v, you need to check for that first.

As of c++17, string::data() also explicitly allows you to modify the buffer.

Reply
Jeff Bronte
5 years ago Permalink

Agreed, std::string is NOT for binary data. However, there is an unfortunate exception, when your C++ binds to another language, like Python. Binders like Swig and Boost and Tornado (sometimes) only recognize a std::string and is thus the only way (at least in Python) to pass binaries to/from C++. Ugly, perhaps there is a workaround or better binding?

Reply
viral
6 years ago Permalink

I have used std::basic_string<bool> in the past (aliased to something like BoolStr) It’s most likely a bad idea.
Everyone likes a bad idea when it works.

Reply
Martin
6 years ago Permalink

Tell that to the guys at Google’s protocol buffers 😉

Reply
Grzegorz
7 years ago Permalink

It partially fixes the issue of a very lacking standard library. With basic_string you get a lot of features that are missing or delayed for vectors.

+=, << and >> operators
Potentially useful members like substr, find, replace, startswith.
string_view
short “array” optimization.

Obviously you wouldn’t go and replace every vector with basic_string. A lot of mentioned features you probably wouldn’t use in production.

But the point is – it would make sense for basic_string to be just a vector with at the end, but instead it was given a lot more fancy features that “could” make sense for vector.

And also vector…

Reply
Paffy
7 years ago Permalink

Maybe the std library should provide a bytearray class. Similar to QByteArray.

Reply
1. Arne Mertz
  7 years ago Permalink
  
  I’m not too familiar with Qt yet and have to look where QByteArray is different to a std::vector<std::byte>.
  
  Reply
CB
7 years ago Permalink

Google’s Protocol Buffers uses exactly that and I’ve never liked it for all the confusion it causes.

Reply
Sven
7 years ago Permalink

Different opinion: https://youtu.be/SDJImePyftY

Reply
1. Arne Mertz
  7 years ago Permalink
  
  The subtitle is a good hint that that lighting talk is meant to be humorous and not as an advice to actually do such things 😉
  
  Reply
  1. Connor Hormna
    7 years ago Permalink
    
    I mean, it fixes the issue that is vector…
    
    Reply
    1. Arne Mertz
      7 years ago Permalink
      
      What issue is that?
      
      Reply
      1. alfC
        6 years ago Permalink
        
        Connor probably meant “the issue that is vector of book”.
        
        A fair debate would be between vector of byte and basic_string of byte.

Write clean and maintainable C++