This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
Strings are complicated. A common mistake is using char.IsUpper or char.ToUpper incorrectly, such as when converting the first character of a string to uppercase. The naive approach, which is often wrong, looks like this:
C#
static string FirstCharacterToUpperCaseBad(string str)
{
if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
return str;
return char.ToUpperInvariant(str[0]) + str[1..];
}
This method works for many strings. For example, "abc" correctly becomes "Abc". However, the Latin alphabet is not the only writing system. Consider the Osage alphabet. The character 𐓸 should become 𐓐 when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸") returns the same string.
In .NET, a string is a sequential read-only collection of char objects. A char represents a UTF-16 code unit. UTF-16 is a character encoding that maps Unicode code points to sequences of 16-bit code units. It is a variable-length encoding, where code points are encoded using one or two 16-bit code units.
The string "𐓸" consists of two char values because it requires two UTF-16 code units. As a result, "𐓸".Length returns 2. The following screenshot shows how a and 𐓸 are encoded in UTF-16:
source: https://tools.meziantou.net/string-info
Accessing "𐓸"[0] retrieves only the first UTF-16 code unit, which is just half the character. Without both units, it is impossible to determine the character's case or convert it correctly. As a result, char.ToUpperInvariant("𐓸"[0]) returns the character unchanged.
The correct approach is to check whether the first character is part of a surrogate pair (two char values) and use both for the conversion. Rather than handling this manually with char.IsSurrogate, you can use the Rune type, which abstracts the complexity:
C#
static string FirstCharacterToUpperCase(string str)
{
if(string.IsNullOrEmpty(str))
return str;
// Get the first Rune of the string
var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);
// Check if the rune is uppercase
if (result != OperationStatus.Done || Rune.IsUpper(rune))
return str;
// Convert the first rune to uppercase and concatenate it to the rest of the string
return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}
You can now test this method with various strings:
C#
FirstCharacterToUpperCase("abc def"); // Abd def (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻"); // 𐐗𐐱𐐻 (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)
In general, when working with arbitrary text, consider using Rune instead of char.
Do you have a question or a suggestion about this post? Contact me!