How not to read a string from a UTF-8 stream

09/06/2021

Gérald Barré

.NET

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

String comparisons are harder than it seems
.NET Regex: \d is different from [0-9]
How to correctly count the number of characters of a string
Correctly converting a character to lower/upper case
How not to read a string from a UTF-8 stream (this post)
Regex with IgnoreCase option may match more characters than expected
How to remove diacritics from a string in .NET

This post is the result of a code review. The code below is a simplified version of the original, making the bug easier to spot.

The goal is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe, and there are additional operations performed on the stream.

string ReadString(Stream stream)
{
    var sb = new StringBuilder();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        var s = Encoding.UTF8.GetString(buffer, 0, readCount);
        sb.Append(s);
    }

    return sb.ToString();
}

The problem is that the returned string may differ from the original encoded string. For instance, a smiley emoji can be decoded as 4 replacement characters:

Encoded string: 😊
Decoded string: ????

UTF-8 uses 1 to 4 bytes to represent a Unicode character (more info about string encoding), but Stream.Read can return anywhere from 1 to messageBuffer.Length bytes. As a result, the buffer may end with an incomplete UTF-8 character sequence. When that happens, Encoding.UTF8.GetString cannot decode the partial sequence and returns replacement characters () because the missing bytes are unknown. The following code demonstrates this behavior:

var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }

var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
    sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}

Console.WriteLine(sb.ToString());
// "????" instead of "😊"

Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }

#How to fix the code?

There are multiple ways to fix this. One approach is to buffer all the data first, then decode it in one pass:

string ReadString(Stream stream)
{
    using var ms = new MemoryStream();
    var buffer = new byte[4096];
    int readCount;
    while ((readCount = stream.Read(buffer)) > 0)
    {
        ms.Write(buffer, 0, readCount);
    }

    return Encoding.UTF8.GetString(ms.ToArray());
}

Alternatively, you can wrap the stream in a StreamReader with the correct encoding:

string ReadString(Stream stream)
{
    using var sr = new StreamReader(stream, Encoding.UTF8);
    return sr.ReadToEnd();
}

You can also use the System.Text.Decoder class to correctly decode characters across buffer boundaries. If performance is a concern, consider using PipeReader or Rune for a more memory-efficient approach.

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:

Enjoy this blog?

💖 Sponsor on GitHub