Correctly converting a character to lower/upper case

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

Strings are complicated. A common mistake is using char.IsUpper or char.ToUpper incorrectly, such as when converting the first character of a string to uppercase. The naive approach, which is often wrong, looks like this:

C#
static string FirstCharacterToUpperCaseBad(string str)
{
    if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
        return str;
    return char.ToUpperInvariant(str[0]) + str[1..];
}

This method works for many strings. For example, "abc" correctly becomes "Abc". However, the Latin alphabet is not the only writing system. Consider the Osage alphabet. The character 𐓸 should become 𐓐 when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸") returns the same string.

In .NET, a string is a sequential read-only collection of char objects. A char represents a UTF-16 code unit. UTF-16 is a character encoding that maps Unicode code points to sequences of 16-bit code units. It is a variable-length encoding, where code points are encoded using one or two 16-bit code units.

The string "𐓸" consists of two char values because it requires two UTF-16 code units. As a result, "𐓸".Length returns 2. The following screenshot shows how a and 𐓸 are encoded in UTF-16:

source: https://tools.meziantou.net/string-info

Accessing "𐓸"[0] retrieves only the first UTF-16 code unit, which is just half the character. Without both units, it is impossible to determine the character's case or convert it correctly. As a result, char.ToUpperInvariant("𐓸"[0]) returns the character unchanged.

The correct approach is to check whether the first character is part of a surrogate pair (two char values) and use both for the conversion. Rather than handling this manually with char.IsSurrogate, you can use the Rune type, which abstracts the complexity:

C#
static string FirstCharacterToUpperCase(string str)
{
    if(string.IsNullOrEmpty(str))
        return str;

    // Get the first Rune of the string
    var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);

    // Check if the rune is uppercase
    if (result != OperationStatus.Done || Rune.IsUpper(rune))
        return str;

    // Convert the first rune to uppercase and concatenate it to the rest of the string
    return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}

You can now test this method with various strings:

C#
FirstCharacterToUpperCase("abc def");   // Abd def   (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻");       // 𐐗𐐱𐐻       (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)

In general, when working with arbitrary text, consider using Rune instead of char.

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?