Rogue Wave Banner

Click on the banner to return to the user guide home page.

©Copyright 1996 Rogue Wave Software

Multibyte Strings

Class RWCString provides limited support for multibyte strings, sometimes used in representing various alphabets (see Chapter 16: Localizing Alphabets...). Because a multibyte character can consist of two or more bytes, the length of a string in bytes may be greater than or equal to the number of actual characters in the string.

If the RWCString contains multibyte characters, you should use member function mbLength() to return the number of characters. On the other hand, if you know that the RWCString does not contain any multibyte characters, then the results of length() and mbLength() will be the same, and you may want to use length() because it is much faster. Here's an example using a multibyte string in Sun:

RWCString Sun("\306\374\315\313\306\374");
cout << Sun.length();                               // Prints "6"
cout << Sun.mbLength();                             // Prints "3"

The string in Sun is the name of the day Sunday in Kanji, using the EUC (Extended UNIX Code) multibyte code set. With the EUC, a single character may be 1 to 4 bytes long. In this example, the string Sun consists of 6 bytes, but only 3 characters.

In general, the second or later byte of a multibyte character may be null. This means the length in bytes of a character string may or may not match the length given by strlen(). Internally, RWCString makes no assumptions[3] about embedded nulls, and hence can be used safely with character sets that use null bytes. You should also keep in mind that while RWCString::data() always returns a null-terminated string, there may be earlier nulls in the string. All of these effects are summarized in the following program:

#include <rw/cstring.h>
#include <rw/rstream.h>
#include <string.h>
main() {
RWCString a("abc");                                          // 1
RWCString b("abc\0def");                                     // 2
RWCString c("abc\0def", 7);                                  // 3

cout << a.length();                                 // Prints "3"
cout << strlen(a.data());                           // Prints "3"

cout << b.length();                                 // Prints "3"
cout << strlen(b.data());                           // Prints "3"

cout << c.length();                                 // Prints "7"
cout << strlen(c.data());                           // Prints "3"
return 0; }

You will notice that two different constructors are used above. The constructor in lines 1 and 2 takes a single argument of const char*, a null-terminated string. Because it takes a single argument, it may be used in type conversion (ARM 12.3.1). The length of the results is determined the usual way, by the number of bytes before the null. The constructor in line 3 takes a const char* and a run length. The constructor will copy this many bytes, including any embedded nulls.

The length of an RWCString in bytes is always given by RWCString::length(). Because the string may include embedded nulls, this length may not match the results given by strlen().

Remember that indexing and other operators_basically, all functions using an argument of type size_t_work in bytes. Hence, these operators will not work for RWCStrings containing multibyte strings.


Previous file Table of Contents Next file