Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Conversions in Tango

Text conversions are sometimes a necessity in statically typed languages like D. Tango provides us with a number of simple ways to convert text, and contains a comprehensive formatting framework inspired by the one in the .Net framework. This framework is also the basis for Tango's locale support, covered in the next chapter of the reference.

The conversion modules in Tango (not including those needed for locale support) reside within tango.text.convert.

This chapter will start with an introduction to the two main parts of tango.text.convert: conversion to and from basic types, and formatting of text. After this, a more detailed presentation of each module will follow.

Conversion between text and numeric types

Most converters handle translation from text to numeric types and back again. Each have two common methods, which are the main workhorses of the converters, and which give the user full control: parse (to convert from a text representation to some type), and format (to convert from some type to a text).

format:

All format functions require an output buffer as the first argument. This avoids heap activity, and indicates to the converter which text encoding you want the output to be in: char[], wchar[], or dchar[]. The input to be converted is always the second argument, and the resultant text output is a slice of the provided output buffer. The user should utilize this return value instead of the original buffer, although the memory is the same (the return is a slice of the correct size, rather than the entire output buffer).

Depending on the situation, there are two common ways to provide this buffer. If result can live on the stack, one can do like this

char[10] output; // converter uses .length as max size
auto result = format(output, 15); // formatting the number 15

If the result needs to be placed on the heap, for instance to ship off to some other function, the simplest solution is

auto result = format(new char[10], 15);

Although both are possible, the last form will be used in later examples as it should be somewhat more self-documenting and shorter.

Each format method is templated to handle output in any of the native D character arrays, char[], wchar[], and dchar[], and have the general signature of

T[] format(T[] dst, V x)

where T is either of char, wchar or dchar, and V is the basic type converted from. The signature above is only for illustrative purposes; several templates have further options.

The different versions of format may have additional parameters that can be used to specify specifics related to formatting of the type to convert. These will be noted in the appropriate places of the reference.

parse:

The parse function takes a string representing the encoding type as its first argument, and returns a converted value.

Akin to format, parse is templated to handle input in native D text arrays, char[], wchar[], and dchar[], and has a general signature of

V parse(T[])

where T is of type char, wchar or dchar, and V is the type which the parsed string is converted to.

Like format, The different versions of parse may have additional parameters to provide specifics related to parsing the input. These will be noted in the appropriate places of this reference.

In addition to these, converters may provide a handful of convenience wrappers, such as the following:

toString, toString16, toString32:

These functions take the input value and produce a string using the character type specified by the function name. These functions do not need a pre-allocated array where the result should be placed.

The general signature is similar to that of format, but without the result parameter.

T[] toStringXX(V x)

toType:

Given string input, most modules has a function (e.g. toDouble in tango.text.convert.Float) that produce a value specified by the name of the function. The input string needs to be parseable in it's entirety to succeed.

These functions have a general signature similar to parse

V toType(T[])

Avoiding symbol collision

The conversion modules consists of mostly freestanding functions, and many have common names, like parse. To avoid symbol collisions when compiling, and also avoiding surprises when importing other libraries, it is a recommended practice to use the renaming imports to create a namespace for the imported functions. An example using the Integer module could be

    import Integer = tango.text.convert.Integer;
    auto i = Integer.parse ("32767");

Integer

The tango.text.convert.Integer module has several functions that help with converting to integer types of values from strings, and the other way. These vary from the quick and dirty to more comprehensive and safe methods. In the following example the default use of parse and format are shown.

auto value = Integer.parse ("0xff005500");
auto text  = Integer.format (new char[32], 12345L);

parse can take two additional parameters, an uint providing the radix of the number to parsed, and a pointer to an uint ate argument that will hold the length of the string parsed to provide the result. Note that if the radix is specified through the string itself, the provided radix will be ignored.

The supported radix variations are

Specifier Radix Example
B or b 2 (binary) 0b10101
O or o 8 (octal) 0o654
X or x 16 (hexadecimal) 0XDEADBEEF
Nothing (default) 10 (decimal) 150

format also have two additional parameters, namely a flag on how to format, and a flag with additional modifiers. The formatting type flags are

Format flag Description
Format.Unsigned Format as unsigned decimal
Format.Signed (default) Format as signed decimal
Format.Octal Format as an octal number
Format.Hex Format as a lower case hexadecimal number
Format.HexUpper Format as a upper case hexadecimal number
Format.Binary Format as a binary number

The additional style modifiers are

Modifier Description
Flags.None (default) No modifiers are applied
Flags.Prefix Prefix the conversion with a radix specifier
Flags.Plus Prefix positive numbers with a '+'
Flags.Space Prefix positive numbers with a space
Flags.Zero Pad with zeros on the left

A usage example that outputs a prefix upper case hexadecimal number could be

auto text  = Integer.format (new char[32], 12345L, Format.HexUpper, Flags.Prefix);

toString, toString16 and toString32 can take the same flags and modifiers as format, but as mentioned earlier, they do not need the pre-allocated output array.

toInt and toLong provide convenience wrappers for converting from a string to an integer value, both requiring that the input string is parseable in it's entirety. Both functions can take a radix as an argument. Default is 10 and a radix in the string will override any provided radix. An exception is raised where the input is not fully parsable.

Additional functions involve convert that do not look for a radix in the input string, trim that will try to extract optional signs and radix from the string, in addition to trimming away spacing in the string, and atoi and itoa for quick and dirty conversion from a string to uint and from an uint to a string. The latter should be used only when the input is already known to be valid.

Float

The tango.text.convert.Float module provides the user with quick and easy conversions of floating point numbers (float / double / real) to strings and from strings back to a floating point number. In the following example the default use of parse and format are shown.

auto value = Float.parse ("3.145");
auto text  = Float.format (new char[64], 3.145);

parse only have one additional parameter that can be used, a pointer to an uint ate argument that will hold the length of the string parsed to provide the result.

format however, have two additional parameters, one for the number of decimals to use (default is '6'), and a bool saying whether the number should be formatted using scientific notation. Default there is 'false'.

toString, toString16 and toString32 can take the same parameters as format but don't require the pre-allocated array for the result. They return an array of the type specified in the function name.

toDouble presents the simplest form of converting from a string to a floating point number and returns a double value. An exception is raised where the provided input is not fully parsed.

Layout

The Layout class template has a convert() method which can be used to format text dynamically, and is utilized by several other modules such as Stdout and Sprint.

In general Layout works either with the sprint() call or the convert() call. opCall is aliased to convert.

The difference is, convert does heap allocate the result, while the user can supply a buffer for the result to the sprint method.

The rough equivalent of sprintf in the Format is the static Format.sprint function, which takes a format string and some arguments, and generates an output string. (This is a nice improvement over sprintf since there’s no chance that you will overflow the output buffer).

Layouts format string

The syntax used closely follows the C# syntax.

Example:

Layout!(char) Layouter = new Layout!(char)();
int nError = 12;
char[] res = Layouter.format ("Error {} occurred.", nError); // "Error 12 occurred."

Note that the formatter leverages opCall to make the .format syntax redundant e.g. these two call are equivalent:

char[] res1 = Layouter ("Error {} occurred.", nError);        // "Error 12 occurred."
char[] res2 = Layouter.format ("Error {} occurred.", nError); // "Error 12 occurred."

For doing the convertion and append a newline, the shorthand .formatln() can be used.

Utilizing D metadata, the Formatter doesn’t need the format string to say what type of data you’re formatting - just where you want it. (A common sprintf bug is supplying the wrong data type - there’s no protection from using %s instead of %d and having your program crash when sprintf is called).

The {} in the text above is replaced with the value of nError, but what if you want to specify the number of digits to use? Or the base (hexadecimal etc)?

The text inside the curly braces has this format: '{' [ArgIndex][,Alignment][:Formatspecifier] '}'

The ArgIndex gives the index of the argument that should be used. If no ArgIndex is given, always the next argument is consumed. Most of the time you will not use it, you simply put "{}" into the format string. The ArgIndex is especially usefull, if you want to handle different translations, where the format string is read from a file. Then it might be usefull to specify the exact argument, because the position in the text can change depending on the language of the translated text. It is OK to use the arguments more than once.

Alignment

If alignment is positive, the text is right-aligned in a field the given number of spaces; if it’s negative, it’s left-aligned. The format text has a different meaning for each type that is passed in the argument.

Layouter ("->{,10}<-",  "Hello"); // "->     Hello<-"
Layouter ("->{,-10}<-", "Hello"); // "->Hello     <-"

The alignment does not restrict the argument size. It only fill with spaces if necessary.

Layouter ("->{,5}<-", "HelloHello"); // "->HelloHello<-"

Number formats

formatdescription
ddecimal format, this is the default
ffloating point format, for any floating point type
xhexadecimal format
Xuppercase hexadecimal format
ooctal format
bbinary format

After the format specifier a positive number gives the minimum number of digits.

Layouter ("{}", 123 );        // "123"       default decimal
Layouter ("{:d}", 123 );      // "123"       decimal
Layouter ("{:d4}", 123 );     // "0123"      decimal width 4
Layouter ("{:f}", 123.456 );  // "123.46"    floating point, note the rounding
Layouter ("{:f4}", 123.456 ); // "123.4560"  floating point width 4
Layouter ("0x{:x}", 123 );    // "0x7b"      hex
Layouter ("0x{:x4}", 123 );   // "0x007b"    hex width 4
Layouter ("0x{:X}", 123 );    // "0x7B"      hex upper
Layouter ("0o{:o}", 123 );    // "0o173"     octal
Layouter ("0b{:b}", 123 );    // "0b1111011" binary

Array types

Layout can format array types

int[]     a = [ 123, 456 ];             // dynamic array
ushort[3] b = [ cast(ushort) 1, 2, 3 ]; // static array

Layouter ("{}", a );                    // "[ 123, 456 ]"
Layouter ("{}", b );                    // "[ 1, 2, 3 ]"

char[][ double ] f;
f[ 1.0 ] = "one".dup;
f[ 3.14 ] = "PI".dup;
Layouter( "{}", f );                    // "{ 1.00=>one, 3.14=>PI }"

Escaping the curly brace

To escape the left curly brace, write it twice. The right curly brace does not need any escaping

Layouter ("{{" ); // "{"
Layouter ("}}" ); // "}}"

Using sprint or a custom sink

The advantage of sprint member function over the format call is, that there is no heap activity involved. A preallocated buffer can be used, that is filled with the Layout.sprint method.

Caution: do not use a global buffer in a multithreaded program, this will cause data corruption.

Best practice is to use a stack based buffer if the amount of needed data is not too big.

char[250] buf = void; // only get a buffer, no initialization needed.
char[] res = Layouter.sprint ( buf, "My formatted number {}", 123 );

you can also use an own callback delegate to consume the Layout data.

uint sink( char[] text ){
    // do something with text portion
    return text.length; // number of consumed chars
}
char[] res = Layouter ( &sink, "My formatted number {}", 123 );

Sprint

Sprint is a Format frontend which pre-allocates an output buffer, and is therefore somewhat more efficient. Sprint accepts the standard C# style formatting, and the result is restricted to the buffer size. In the same vein as Format is a replacement for the printf family of functions, Sprint is a replacement for the vsprintf family. Sprint resides in the tango.text.convert.Sprint module.

Sprint can be handy when you wish to format text for a Logger, or similar. It avoids heap activity during the conversion by providing a fixed size buffer for output. Please note that the class itself is stateful, and therefore a single instance is not shareable across multiple threads.

// create a Sprint instance
auto sprint = new Sprint!(char);

// write formatted text to the console
Cout (sprint("{0} is {1} times {2}", "Julio", 32, 5.68732));

As can be seen above, Sprint is templated and either of the character types char, wchar or dchar can be used with it.

Utf

The tango.text.convert.Utf module contains all the functions needed to convert (or transcode) between the various Unicode encodings, including the ascii subset.

This module only has six functions, two of each of toString, toString16 and toString32, returning char[], wchar[] and dchar[] respectively. The two versions of each take either of the two other string types and converts to the resulting type.

These functions are highly tuned on x86 enabled CPU's, but should also provide the user with very high performance in the more general case.

All of these functions are straigtforward to use in the common case.

auto utf8 = toString("test");
auto utf16 = toString16("test");
auto utf32 = toString32("test");

These transcoders support both streaming conversion and atomic conversion, where the former requires continuation and the latter does not. Streaming conversion is assisted via an optional argument to return how much of the input was consumed so far, thus indicating how much should be retained (and prefixed) to further input. In such scenarios, unconsumed content typically represents partial encodings such as an incomplete utf8 sequence.

The transcoders will operate most efficiently when you provide them with a workspace - this avoids heap activity. A typical usage scenario is thus:

void doSomething(char[] utf8)
{
    wchar[256] tmp = void;
    auto utf16 = toString16(utf8, tmp);

    // do something with utf16 result
    ...

Note that we assign the tmp buffer to void, ensuring we don't spend time clearing the buffer before it is used. The latter can be applied in situations where the stack content will be short-lived.

The full signature for these functions are

T[] toStringXX (U[] input, T[] output=null, uint* ate=null)

The output parameter is for the cases where the user wants to provide a pre-allocated space for the result, and the ate parameter to provide the caller with how much of the provided input was processed.

UnicodeBom

UnicodeBom from the tango.text.convert.UnicodeBom module is useful for encoding content to and from Unicode encodings with a BOM. You would normally use it indirectly via tango.io.UnicodeFile but it can be handy on it's own also.

UnicodeBom can autodetect the encoding of a file by inspecting the first few bytes for a specific pattern. Alternatively, you can be explicit about the encoding using provided constants e.g. Encoding.UTF_16BE, Encoding.UTF_32LE, etc.

Other functionality includes transcoding text back and forth from a templated target type (char[], wchar[], dchar[]). To illustrate, we read and convert a file with a BOM to char[] and display it on the console:

  auto file = new File("myfile.txt");
  auto bom = new UnicodeBom!(char)(Encoding.Unknown);
  
  Stdout (bom.decode(file.read));

getEncoding and getSignature let the user get information about the BOM, and setup lets the user set (or reset) the encoding to be used. The first returns the encoding it is set up with, either the initial, or one detected through decode. The other returns the signature of this BOM.

Encoding Signature Description
Encoding.Unknown N/A Fully unknown encoding
Encoding.UTF_8 N/A Expecting an UTF-8 encoding
Encoding.UTF_8N x"efbbbf" Explicitly set UTF-8N encoding
Encoding.UTF_16 N/A Expecting an UTF-16 encoding
Encoding.UTF_16BE x"feff" Explicitly set UTF16BE (Big Endian) encoding
Encoding.UTF_16LE x"fffe" Explicitly set UTF16LE (Little Endian) encoding
Encoding.UTF_32 N/A Expecting an UTF-32 encoding
Encoding.UTF_32BE x"0000feff" Explicitly set UTF32BE (Big Endian) encoding
Encoding.UTF_32LE x"fffe0000" Explicitly set UTF32LE (Little Endian) encoding

encode and decode will convert between the different encodings, depending on how the instance is set up. The first one will encode content provided in the type UnicodeBom was instantiated with , into a void[] with the encoding specified. The latter will decode input data in the specified encoding (or detected encoding if possible) into an array of the type that UnicodeBom was instatiatied with.

TimeStamp

HTML time stamps can be be converted from strings to ulong and from ulong to strings using the tango.text.convert.TimeStamp module. This converter will take any of the RFC 1123, RFC 850, asctime, DOS time or ISO-8601 formats as input strings, and return seconds since January 1st 1970, or conversely convert seconds since that day to a RFC 1123 compliant date string.

In the following example the default use of parse and format are shown.

auto date = "Sun, 06 Nov 1994 08:49:37 GMT"; // RFC 1123 compliant date string

auto msSinceEpoch = TimeStamp.parse (date);
auto text = TimeStamp.format (new char[64], msSinceEpoch);

parse only have one additional parameter that can be used, a pointer to an uint ate argument that will hold the length of the string parsed to provide the result. This function handles all 3 of the RFC 1123, RFC 850 and asctime formats and returns the value InvalidEpoch if any of them failed. toTime is a simpler wrapper that will raise an exception if the input string fail to be parsed in it's entirety.

format has no additional parameters beyond those common to all converter modules, and takes the input time as an ulong value. toString, toString16 and toString32 wraps this function to avoid that the user needs to provide a pre-allocated buffer for the result.

To convert from the DOS time and ISO-8601 formats, use dostime or iso8601 respectively. These are accompanied by rfc1123, rfc850 and asctime which are used by parse. All of these functions returns the number of elements in the input string consumed by the parsing. The resulting value is provided through an inout parameter. The general signature for all these functions can be seen as

int fromTimeFormat(T[] src, inout ulong value) // fromTimeFormat can be any of
                                               // the 5 functions mentioned above

The input source can be of any of the types char[], wchar[] or dchar[].

Generic Conversions

Tango also supports generic value conversions through the to function located in the tango.util.Convert module. This function allows you to perform value-preserving conversions between arbitrary types. For example:

  real pi = to!(real)("3.14159");

The above example will convert the string value 3.14159 into the closest real equivalent. There are two cases where conversion can fail: firstly, if to cannot find a suitable conversion between the given types, it will issue a compile-time error informing you of this. For example, you cannot convert from an imaginary type to an integer type, since they are from almost entirely disjoint domains.

Secondly, if the to function finds it cannot represent the given value in the destination type at runtime, it will throw a ConversionException exception. For example, attempting to convert the string values "256", "-1" or "que?" into a ubyte will throw this exception.

Below is a table of the built-in supported conversions.

Destination Type Source Types
bool Integer types (0/!0) and string types ("true"/"false")
Integer types bool, real types and string types
Real types Integer types, string types
Imaginary types Complex types
Complex types Integer types, real types, imaginary types
String types bool, integer types, real types

You can also convert between different types of arrays, as well as between different types of associative arrays, both of which are converted one element at a time. In order to convert from an array T[] to S[], you need to be able to convert Ts to Ses. Likewise, to convert between associative array types Va[Ka] and Vb[Kb], you need to be able to convert Vas to Vbs, and Kas to Kbs.

Finally, to can be extended to support user-defined structures and classes. This is done by defining appropriate static and instance member functions on your type. The below table summarises the rules involved. Note that in addition to camel-case TypeNames, c-case type_names can also be used in method names.

Destination Source Calls tried
dst_type dst src_type src src.toDstType(), src.to!(dst_type)(), dst_type.fromSrcType(src), dst_type.from!(src_type)(src)
dst_type[] src_type src src.toDstTypeArray()
dst_type src_type[] src dst_type.fromDstTypeArray(src)
dst_type[key_type] src_type src src.toKeyTypeToDstTypeMap()
dst_type src_type[key_type] src dst_type.fromKeyTypeToSrcTypeMap(src)
char[], wchar[], dchar[] src_type src src.toString(), src.toString16(), src.toString32()
dst_type char[], wchar[], dchar[] src dst_type.fromUtf8(src), dst_type.fromUtf16(src), dst_type.fromUtf32(src)

Note that in the case of string conversions, if the appropriate UTF encoding is not directly supported, Tango will convert to whichever encoding is available, and then transcode the result of that into the expected type. Note also that if specifically named methods are not found, to will always fall back on the general src.to!(dst_type)() and dst_type.from!(src_type)(src) calls before giving up.

Here is an example of a simple user-defined structure that implements several conversions.

  struct Foo
  {
    real value;
  
    int toInt()
    {
        return to!(int)(value);
    }
  
    static Foo fromReal(real v)
    {
        return Foo(v);
    }
  
    dchar[] toString32()
    {
        return "Foo!"d;
    }

    T to(T)()
    {
        static if( is( T == ireal ) )
            return value*-1.0i;
        
        else
            static assert(false);
    }

    static Foo from(T)(T src)
    {
        static if( is( T == ireal ) )
            return Foo(src*-1.0i);

        else
            static assert(false);
    }
  }

The following conversions are thus valid:

  to!(int)(Foo(42));            // == 42
  to!(Foo)(3.14159);            // == Foo(3.14159)
  to!(char[])(Foo(123.456));    // == "123.456"c
  to!(ireal)(Foo(456.789));     // == -456.789i
  to!(Foo)(-789.123i);          // == Foo(789.123)

References

External references to specifications used in this text.

User Comments

Comments

Attachments

  • conversions.d (3.3 kB) -The Code Listing for this chapter, added by jpelcis on 12/09/06 19:22:16.