This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
This post is the result of a code review. The code below is a simplified version of the original, making the bug easier to spot.
The goal is to read a UTF-8 encoded string from a stream. In the actual context, the stream is a named pipe, and there are additional operations performed on the stream.
C#
string ReadString(Stream stream)
{
var sb = new StringBuilder();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
var s = Encoding.UTF8.GetString(buffer, 0, readCount);
sb.Append(s);
}
return sb.ToString();
}
The problem is that the returned string may differ from the original encoded string. For instance, a smiley emoji can be decoded as 4 replacement characters:
Encoded string: 😊
Decoded string: ????
UTF-8 uses 1 to 4 bytes to represent a Unicode character (more info about string encoding), but Stream.Read can return anywhere from 1 to messageBuffer.Length bytes. As a result, the buffer may end with an incomplete UTF-8 character sequence. When that happens, Encoding.UTF8.GetString cannot decode the partial sequence and returns replacement characters () because the missing bytes are unknown. The following code demonstrates this behavior:
C#
var bytes = Encoding.UTF8.GetBytes("😊");
// bytes = new byte[4] { 240, 159, 152, 138 }
var sb = new StringBuilder();
// Simulate reading the stream byte by byte
for (var i = 0; i < bytes.Length; i++)
{
sb.Append(Encoding.UTF8.GetString(bytes, i, 1));
}
Console.WriteLine(sb.ToString());
// "????" instead of "😊"
Encoding.UTF8.GetBytes(sb.ToString());
// new byte[12] { 239, 191, 189, 239, 191, 189, 239, 191, 189, 239, 191, 189 }
#How to fix the code?
There are multiple ways to fix this. One approach is to buffer all the data first, then decode it in one pass:
C#
string ReadString(Stream stream)
{
using var ms = new MemoryStream();
var buffer = new byte[4096];
int readCount;
while ((readCount = stream.Read(buffer)) > 0)
{
ms.Write(buffer, 0, readCount);
}
return Encoding.UTF8.GetString(ms.ToArray());
}
Alternatively, you can wrap the stream in a StreamReader with the correct encoding:
C#
string ReadString(Stream stream)
{
using var sr = new StreamReader(stream, Encoding.UTF8);
return sr.ReadToEnd();
}
You can also use the System.Text.Decoder class to correctly decode characters across buffer boundaries. If performance is a concern, consider using PipeReader or Rune for a more memory-efficient approach.
#Additional resources
Do you have a question or a suggestion about this post? Contact me!