Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Getting a dchar from a Buffer containing UTF-8

Moderators: larsivi kris

Posted: 02/02/07 20:25:25

A way of getting a dchar from a UTF-8 Buffer like Cin.buffer would be handy. This would mean that characters are read until a complete dchar can be constructed. I figured the following might work:

auto getter = new Reader(Cin.buffer);
dchar c;
getter(c);

But giving stdin, for instance, an 'a' causes it to wait for more characters. Another few 'a's and what we get is "Unicode.toUtf8 : invalid dchar". Reading a UTF-32-encoded file would presumably work, but ASCII, at least, fails.

Is there a built-in way of doing this, or do I need to do read char by char and decode manually, like with Phobos? As an example, this is how I'd do it that way with my current, meager knowledge of Tango:

import tango.core.Exception;
import tango.io.Stdout;
import tango.io.Console : Cin;
import tango.io.protocol.Reader;
import tango.text.convert.Utf : toUtf32;

void main() {
	auto getter = new Reader(Cin.buffer);

	dchar ch;

	for (;;) {
		try ch = readchar(getter);
		catch (IOException) { break; } // any smarter way of catching EOF?
		if (ch == 26) break; // any smarter way of catching DOS/Windows Ctrl-Z on the console?
		Stdout(ch);
	}
	Stdout();
}

dchar readchar(Reader s) {
	const ubyte[256] UTF8_BYTES_NEEDED =
	[
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
		0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
		0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
		0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
		0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
		2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
		2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
		3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
		4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0
	];

	char c;
	s(c);

	// ASCII
	if (c <= 127)
		return c;
	else {
		// UTF-8
		char[4] str;
		str[0] = c;
		ubyte n = UTF8_BYTES_NEEDED[c];

		if (!n)
			throw new Exception("Invalid UTF-8 in input.");

		for (size_t i = 1; i < n; ++i)
			s(str[i]);

		return toUtf32(str[0..n])[0];
	}
}

UTF8_BYTES_NEEDED is an RFC 3629 compliant version of Phobos's std.utf.UTF8stride.

Author Message

Posted: 02/03/07 00:53:24 -- Modified: 02/03/07 07:14:47 by
kris

If you had all your data in one chunk, life would be simple:

import Utf = tango.text.convert.Utf;

auto myBigChars = Utf.toUtf32 (myUtf8Text);

But it looks like you want to stream them in, al la chunking? If so, I suspect doing something like this might be more effective:

import tango.io.Console;

// grab the console input buffer, and convert available data to a dchar[] 
auto results = convert (Cin.buffer);

dchar[] convert (IBuffer buffer)
{
  // wait for available content
  buffer.wait;

  // convert what's available, leaving partial encodings untouched
  uint ate;
  auto myBigChars = Utf.toUtf32 (cast(char[]) buffer.slice, null, &ate);

  // gobble up what we converted (throw it away)
  buffer.slice (ate);

  // make room for more input (moves partial-encoding to buffer head)
  buffer.compress;
  
  return myBigChars;
}

The above is converting big chunks of (partial) input into a dchar[] and passing it back to the caller.

If you were doing something simpler, I'd suggest using buffer.slice(1) to pull back 1 byte at a time. There's also a buffer.skip() to jump back and/or forth in the buffer, which can be used as a simple unget.

Another approach would be to grab the Console conduit directly:

char[128] text;

auto bytesRead = Cin.conduit.read (text);

And work on the text[] yourself, perhaps using toUtf32(). A similar approach would be to map a new buffer onto local memory:

char[128] text;

auto buffer = new Buffer (text, 0);
buffer.setConduit(Cin.conduit);
buffer.wait;
auto available = buffer.readable;

Going back to Reader for a moment: Protocols, of which Reader/Writer are a part, are not equivalent to the phobos stream classes. Instead they are serialization converters, and deal in discrete elements rather than streams. If an element is not there, it's an exceptional condition rather than an Eof condition. Because of this, Reader is perhaps not the right solution for what you want? Buffer is a step below and deals with untyped streaming data. Conduit is the lowest layer for chatting directly with the (untyped) IO.

Hope this helps?

Posted: 02/03/07 03:08:16 -- Modified: 02/03/07 03:14:29 by
kris

Something like this might be useful to add somewhere:

import tango.core.Exception;

import tango.io.model.IBuffer;

import Utf = tango.text.convert.Utf;

class CharStream (T)
{
        private dchar[]         tmp,
                                slice;
        private dchar           pushed;
        private IBuffer         buffer;
        private uint            index;

        this (IBuffer buffer, uint size=128)
        {
                this.buffer = buffer;
                tmp = new dchar[size];
        }

        // throws an IOException when there's nothing left!
        dchar get ()
        {
                if (pushed != pushed.init)
                   {
                   auto ret = pushed;
                   pushed = pushed.init;
                   return ret;
                   }

                if (index < slice.length)
                    return slice [index++];

                slice = convert (buffer);
                index = 0;
                return get;
        }

        // push one dchar only
        void unget (dchar c)
        {
                assert (pushed is pushed.init && c != pushed.init);
                pushed = c;
        }

        private dchar[] convert (IBuffer buffer)
        {
                // wait for available content
                buffer.slice (1, false);

                // convert what's available, leaving partial encodings untouched
                uint ate;
                auto myBigChars = Utf.toUtf32 (cast(T[]) buffer.slice, tmp, &ate);

                // gobble up what we converted (throw it away)
                buffer.slice (ate);

                // make room for more input (moves partial-encoding to buffer head)
                buffer.compress;
  
                return myBigChars;
        }
}

Posted: 02/03/07 03:27:52 -- Modified: 02/03/07 19:13:51 by
kris -- Modified 4 Times

It would be also be feasible to attach a filter to Cin.conduit; where the filter would do the conversion before content reached any of the upper layers. With that in place, you'd be guaranteed everything returned from Cin.conduit.read() would be a dchar; and that everything retrieved from a related buffer would be a dchar.

Filters are pretty handy where you know what the content is going to be beforehand. Compression is a common example.

That aside, here's yet another approach:

dchar get (IBuffer buffer)
{
        char[] segment;

        uint scan (void[] data)
        {
                auto i = cast(uint) data[0];

                if (i & 0x80)
                   {
                   // isolate the number of bytes involved (may be incorrect)
                   i = ((i & 0xf0) - 0xb0) >> 4;
                   if (data.length < i)
                       return IConduit.Eof;
                   }
                else
                   i = 1;
                
                segment = cast(char[]) data [0..i];
                return i;
        }


        if (! buffer.next (&scan))
              return dchar.init;

        dchar[1] tmp;
        Utf.toUtf32 (segment, tmp);
        return tmp[0];
}

In this one, we're using the buffer tokenizing facilities [buffer.next] to chunk the input into something more appropriately sized, which we then convert using Utf.toUtf32(). The value dchar.init is returned when there's no more input available, and an exception is thrown where a decode error occurs. Naturally, this is untested; but it looks fairly ok :)

To squeeze a little more speed out of it, the decoding could be done inline (there's only one dchar involved) instead of using toUf32 - but the idea remains the same regardless. All the same, this is actually a really efficient mechanism, since the data is buffered from a device in big lumps and in turn sliced into smaller ones. This is how the StreamIterator? templates operate, located in the tango.text.stream package.

One of the nice things about this approach is that it can be interleaved with other buffer clients, and all will stay in synch correctly. In other words, as long as all buffer clients eat only what they need they can be combined with others.

Anyway ... I hope this has been a worthwhile exercise and that you get what you need out of it :)

Posted: 02/03/07 10:27:26

A standard CharStream?, or something like it, would be lovely. All it needs is an arbitrary-sized buffer for ungetting (I don't need this myself, but it would complete the functionality). And some testing: I presume the template is there so it could work with all of [dw]char, but it fails with wchar, at least - I get garbage output when trying to read a UTF-16 file.

Posted: 02/03/07 19:09:15

Deewiant wrote:

I presume the template is there so it could work with all of [dw]char, but it fails with wchar, at least - I get garbage output when trying to read a UTF-16 file.

Yuck ... probably endian issues :)

When dealing with files specifically, I'd recommend trying the UnicodeFile? module? Looks just like File, but with embedded unicode conversion. The template argument tells it what format you want to work with, and the rest is handled under the covers:

// access a unicode file 
auto file = new UnicodeFile!(char) ("myfile", Encoding.Unknown);

// display on console
Cout (file.read).newline;

Posted: 02/03/07 19:11:58

Deewiant wrote:

A standard CharStream?, or something like it, would be lovely. All it needs is an arbitrary-sized buffer for ungetting

We'll look into some kind of StreamIterator? approach then (using buffer.next), with a pushback buffer. I'd be interested to hear the results if you happen to fuss around with that buffer.next() example ...

Posted: 02/09/07 08:22:21 -- Modified: 02/09/07 08:24:41 by
Deewiant

Changing the nested scan() function to look like the following made it work:

uint scan (void[] data) {
       	if (!data.length)           // added this
       		return IConduit.Eof;

        // changed this: casting void to uint doesn't work, and casting void[] to uint[] causes array cast misalignment
        // also, the bit math didn't match RFC 3629 in all cases, so just use the lookup table
        auto i = UTF8_BYTES_NEEDED[(cast(ubyte[])data)[0]];

        // added !i: fail on invalid UTF-8 byte
        if (!i || data.length < i)
                return IConduit.Eof;

        segment = cast(char[]) data [0..i];
        return i;
}

private const ubyte[256] UTF8_BYTES_NEEDED = [
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
	2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
	3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
	4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0
];

However, I always got dchar.init from get() when inputting non-ASCII characters. I checked it out, and it seems Utf.toUtf32 was doing its job incorrectly. I don't know why (might have something to do with using the two-argument form instead of just returning Utf.toUtf32(segment)[0]), but my own function does work:

// Cheers to Kris's tango.text.convert.Utf for the basic bit math
// Cheers to the unit tests for the additional if statements
private dchar toUtf32(char[] utf8)
in {
	assert (UTF8_BYTES_NEEDED[utf8[0]] != 0);
} body {
	dchar c = cast(dchar)utf8[0];

	if (c & 0x80)
		if (c < 0xe0) {
			if (invalidContinuationByte(utf8[1]))
				return dchar.init;

			c &= 0x1f;
			c = (c << 6) | (utf8[1] & 0x3f);

		} else if (c < 0xf0) {
			if (
				invalidContinuationByte(utf8[1]) || invalidContinuationByte(utf8[2]) ||
				(c == 0xe0 && (utf8[1] | 0x9f) == 0x9f) ||
				(c == 0xed && utf8[1] > 0x9f) ||
				(c == 0xef && utf8[1] == 0xbf && utf8[2] > 0xbd)
			)
				return dchar.init;

			c &= 0x0f;
			c = (c << 6) | (utf8[1] & 0x3f);
			c = (c << 6) | (utf8[2] & 0x3f);

		} else {
			if (
				invalidContinuationByte(utf8[1]) || invalidContinuationByte(utf8[2]) || invalidContinuationByte(utf8[3]) ||
				(c == 0xf0 && (utf8[1] | 0x8f) == 0x8f)
			)
				return dchar.init;

			c &= 0x07;
			c = (c << 6) | (utf8[1] & 0x3f);
			c = (c << 6) | (utf8[2] & 0x3f);
			c = (c << 6) | (utf8[3] & 0x3f);
		}

	return c;
}

private bool invalidContinuationByte(char c) { return !(c & 0x80); }

unittest {
	dchar toUtf32(char[] utf8) {
		// this check would be done by the getchar function
		if (UTF8_BYTES_NEEDED[utf8[0]] == 0)
			return dchar.init;

		// eat more bytes, it's what the getchar function would do
		// this is an invalid case, but we need to test the toUtf32 function's handling of it
		while (UTF8_BYTES_NEEDED[utf8[0]] > utf8.length)
			utf8 ~= 'a';

		return .toUtf32(utf8);
	}

	assert (toUtf32(x"42")                == 'B');
	assert (toUtf32(x"7f")                == 0x7f);
	assert (toUtf32(x"c3 84")             == 0xc4);
	assert (toUtf32(x"d7 90")             == 0x05d0);
	assert (toUtf32(x"df ba")             == 0x07fA);
	assert (toUtf32(x"e0 b4 80")          == 0x0d00);
	assert (toUtf32(x"e2 98 a4")          == 0x2624);
	assert (toUtf32(x"e4 89 82")          == 0x4242);
	assert (toUtf32(x"ef bf a0")          == 0xffe0);

	// Cheers to Markus Kuhn's UTF-8 decoder capability and stress test
	// http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
	for (char c = 0x80; c <= 0xbf; ++c)
		assert (toUtf32([c]) == dchar.init);

	for (char c = 0xc0; c <= 0xfd; ++c)
		assert (toUtf32([c]) == dchar.init);          // Utf.toUtf32 fails

	assert (toUtf32(x"c0 22")             == dchar.init);
	assert (toUtf32(x"e0 80 22")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f0 80 80 22")       == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f8 80 80 80 22")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 22") == dchar.init);
	assert (toUtf32(x"df 22")             == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ef bf 22")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f7 bf bf 22")       == dchar.init);
	assert (toUtf32(x"fb bf bf bf 22")    == dchar.init);
	assert (toUtf32(x"fd bf bf bf bf 22") == dchar.init);

	assert (toUtf32(x"fe")                == dchar.init);
	assert (toUtf32(x"ff")                == dchar.init);
	assert (toUtf32(x"fe fe ff ff")       == dchar.init);

	assert (toUtf32(x"c0 af")             == dchar.init);
	assert (toUtf32(x"e0 80 af")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f0 80 80 af")       == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f8 80 80 80 af")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 af") == dchar.init);

	assert (toUtf32(x"c1 bf")             == dchar.init);
	assert (toUtf32(x"e0 9f bf")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f0 8f bf bf")       == dchar.init);
	assert (toUtf32(x"f8 87 bf bf bf")    == dchar.init);
	assert (toUtf32(x"fc 83 bf bf bf bf") == dchar.init);

	assert (toUtf32(x"c0 80")             == dchar.init);
	assert (toUtf32(x"e0 80 80")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f0 80 80 80")       == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"f8 80 80 80 80")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 80") == dchar.init);

	assert (toUtf32(x"ed a0 80")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ad bf")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ae 80")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed af bf")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed b0 80")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed be 80")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed bf bf")          == dchar.init); // Utf.toUtf32 fails

	assert (toUtf32(x"ed a0 80 ed b0 80") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed a0 80 ed bf bf") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ad bf ed b0 80") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ad bf ed bf bf") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ae 80 ed b0 80") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed ae 80 ed bf bf") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed af bf ed b0 80") == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ed af bf ed bf bf") == dchar.init); // Utf.toUtf32 fails

	assert (toUtf32(x"ef bf be")          == dchar.init); // Utf.toUtf32 fails
	assert (toUtf32(x"ef bf bf")          == dchar.init);
}

My testing was just this:

void main() {
	for (;;) {
		auto d = get(Cin.buffer);
		
		if (
			d == 0x1a || // DOS EOF, Ctrl-Z
			d == dchar.init // file EOF or invalid or incomplete UTF-8 sequence
		)
			break;
		
		Stdout.formatln("Got {0:X2}", cast(uint)d);
	}
}

// replaced the end of the get() function with the following

debug {
	char[] foo = "Raw data got: [";
        foreach (b; segment)
        	foo ~= Formatter("{0:X} ", cast(ubyte)b);
        foo[$-1] = ']';
        Stdout(foo).newline;

	return toUtf32(segment);
} else {
        dchar[1] tmp;
       	Utf.toUtf32(segment, tmp);
       	return tmp[0];
}

It might be worth changing the second return in get().scan() to throw something, so that one can differentiate between EOF and invalid UTF-8 without examining the buffer manually.

Posted: 02/10/07 16:35:50

Hey, this is great!

Thanks for all the effort you're putting in :)

Posted: 02/10/07 19:48:19

Not a problem. It's functionality I need and relatively easy to implement, so I might as well: it gets me what I want most quickly. <g> Even better if it's in the standard library (or "the better standard" library, whichever Tango is to be).

BTW, regarding UTF-8, tango.text.convert.Utf is too lax in what it allows: I suggest having a look at Markus Kuhn's UTF-8 decoder capability and stress test as a starting point. When everything therein works, make the few changes needed to support RFC 3629 - disallow bytes 0xF5 to 0xFD (0xFE and 0xFF should already be disallowed) since they are lead bytes of overly large codepoints.

I haven't looked at UTF-16 (IMHO a horrid encoding which should die a swift death) too much, so I don't know if there are similar concerns.

Another thought about the Utf module: some functions capable of Unicode normalization would be really handy. In the same vein, Unicode equivalence: the codepoint sequence \u0041\u0308 ('LATIN CAPITAL LETTER A' followed by 'COMBINING DIAERESIS', UTF-8 0x41 0xcc 0x88) is equivalent to \u00c4 ('LATIN CAPITAL LETTER A WITH DIAERESIS', UTF-8 0xc3 0x84).

It'd probably also be worth documenting somewhere what version of Unicode is supported by such (and any other Unicode-aware) functions.

Posted: 02/11/07 18:28:07

Since I'm so anal about standards conformance, I went and took a look at the official converters of Unicode, Inc. and the conformance chapter in The Unicode Standard, Version 4.0.0 and made the necessary changes to my toUtf32 function. While I can't guarantee Unicode 5.0.0 support since I don't have a copy of the (non-free) standard, AFAIK 5.0.0 didn't change anything regarding the various encoding forms. And even if it did, I would expect that to be reflected in Unicode, Inc.'s ConvertUTF.c.

While I was at it, I added some comments to the function and the unit tests. Here's the new code for anyone interested (and for Tango, if you're interested):

// Cheers to Kris for the basic bit math
// Cheers to The Unicode Standard, Version 4.0.0, Chapter 3.9, Table 3-6 for the validity checks
// (Available at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf)
private dchar toUtf32(char[] utf8)
in {
	assert (UTF8_BYTES_NEEDED[utf8[0]] != 0);
} body {
	dchar c = cast(dchar)utf8[0];

	if (c & 0x80) {
		// not ASCII
		if (c < 0xe0) {
			// two-byte sequence

			if (invalidContinuationByte(utf8[1]))
				return dchar.init;

			c &= 0x1f;
			c = ((c & 0x1f) << 6) | (utf8[1] & 0x3f);

		} else if (c < 0xf0) {
			// three-byte sequence

			if (
				invalidContinuationByte(utf8[1]) || invalidContinuationByte(utf8[2]) ||
				(c == 0xe0 && utf8[1] < 0xa0) ||
				(c == 0xed && utf8[1] > 0x9f)
			)
				return dchar.init;

			c &= 0x0f;
			c = (c << 6) | (utf8[1] & 0x3f);
			c = (c << 6) | (utf8[2] & 0x3f);

		} else {
			// four-byte sequence

			if (
				invalidContinuationByte(utf8[1]) || invalidContinuationByte(utf8[2]) || invalidContinuationByte(utf8[3]) ||
				(c == 0xf0 && utf8[1] < 0x90) ||
				(c == 0xf4 && utf8[1] > 0x8f)
			)
				return dchar.init;

			c &= 0x07;
			c = (c << 6) | (utf8[1] & 0x3f);
			c = (c << 6) | (utf8[2] & 0x3f);
			c = (c << 6) | (utf8[3] & 0x3f);
		}
	}

	return c;
}

// bytes beyond the first must be in the range [0x80, 0xbf]
private bool invalidContinuationByte(char c) { return (c & 0xc0) != 0x80; }

unittest {
	dchar toUtf32(char[] utf8) {
		// this check would be done by the getchar function
		if (UTF8_BYTES_NEEDED[utf8[0]] == 0)
			return dchar.init;

		// eat more bytes, it's what the getchar function would do
		// this is an invalid case, but we need to test the toUtf32 function's handling of it
		while (UTF8_BYTES_NEEDED[utf8[0]] > utf8.length)
			utf8 ~= 'a';

		return .toUtf32(utf8);
	}

	// just some tests
	assert (toUtf32(x"42")          == 'B');
	assert (toUtf32(x"7f")          == 0x7f);
	assert (toUtf32(x"c3 84")       == 0xc4);
	assert (toUtf32(x"d7 90")       == 0x05d0);
	assert (toUtf32(x"df ba")       == 0x07fA);
	assert (toUtf32(x"e0 b4 80")    == 0x0d00);
	assert (toUtf32(x"e2 98 a4")    == 0x2624);
	assert (toUtf32(x"e4 89 82")    == 0x4242);
	assert (toUtf32(x"ef bf a0")    == 0xffe0);
	assert (toUtf32(x"a9")          == dchar.init);

	// examples from The Unicode Standard, Version 4.0.0, chapter 3.9
	assert (toUtf32(x"4d")          == 0x4d);
	assert (toUtf32(x"d0 b0")       == 0x430);
	assert (toUtf32(x"e4 ba 8c")    == 0x4e8c);
	assert (toUtf32(x"f0 90 8c 82") == 0x10302);
	assert (toUtf32(x"c0 af")       == dchar.init);
	assert (toUtf32(x"e0 9f 80")    == dchar.init);
	assert (toUtf32(x"f4 80 83 92") == 0x1000d2);

	// chapter 3.10
	assert (toUtf32(x"ef bb bf")    == 0xfeff); // Byte Order Mark

	// Cheers to Markus Kuhn's UTF-8 decoder capability and stress test
	// http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

	// first possible sequence of some length
	assert (toUtf32(x"00")          == 0);
	assert (toUtf32(x"c2 80")       == 0x80);
	assert (toUtf32(x"e0 a0 80")    == 0x800);
	assert (toUtf32(x"f0 90 80 80") == 0x10000);

	// last possible sequence of some length
	assert (toUtf32(x"7f")          == 0x7f);
	assert (toUtf32(x"df bf")       == 0x7ff);
	assert (toUtf32(x"ef bf bf")    == 0xffff);

	// stress test is old: U+1fffff is no longer valid, test U+10ffff instead
	assert (toUtf32(x"f4 8f bf bf") == 0x10ffff);

	// continuation bytes
	for (char c = 0x80; c <= 0xbf; ++c)
		assert (toUtf32([c]) == dchar.init);

	// note that in all of the below, anything to do with 5 or 6-byte sequences
	// should fail anyway, since the standard disallows them since version 3.0.0

	// lonely first bytes of greater than 1-byte sequences
	for (char c = 0xc0; c <= 0xfd; ++c)
		assert (toUtf32([c]) == dchar.init);

	// lowest sequences with last continuation byte missing
	assert (toUtf32(x"c0 22")             == dchar.init);
	assert (toUtf32(x"e0 80 22")          == dchar.init);
	assert (toUtf32(x"f0 80 80 22")       == dchar.init);
	assert (toUtf32(x"f8 80 80 80 22")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 22") == dchar.init);

	// highest sequences with last continuation byte missing
	assert (toUtf32(x"df 22")             == dchar.init);
	assert (toUtf32(x"ef bf 22")          == dchar.init);
	assert (toUtf32(x"f7 bf bf 22")       == dchar.init);
	assert (toUtf32(x"fb bf bf bf 22")    == dchar.init);
	assert (toUtf32(x"fd bf bf bf bf 22") == dchar.init);

	// 0xfe and 0xff cannot appear in UTF-8
	assert (toUtf32(x"fe")                == dchar.init);
	assert (toUtf32(x"ff")                == dchar.init);
	assert (toUtf32(x"fe fe ff ff")       == dchar.init);

	// overlong representations of 0x2f
	assert (toUtf32(x"c0 af")             == dchar.init);
	assert (toUtf32(x"e0 80 af")          == dchar.init);
	assert (toUtf32(x"f0 80 80 af")       == dchar.init);
	assert (toUtf32(x"f8 80 80 80 af")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 af") == dchar.init);

	// highest overlong representations
	assert (toUtf32(x"c1 bf")             == dchar.init);
	assert (toUtf32(x"e0 9f bf")          == dchar.init);
	assert (toUtf32(x"f0 8f bf bf")       == dchar.init);
	assert (toUtf32(x"f8 87 bf bf bf")    == dchar.init);
	assert (toUtf32(x"fc 83 bf bf bf bf") == dchar.init);

	// overlong representations of 0x00
	assert (toUtf32(x"c0 80")             == dchar.init);
	assert (toUtf32(x"e0 80 80")          == dchar.init);
	assert (toUtf32(x"f0 80 80 80")       == dchar.init);
	assert (toUtf32(x"f8 80 80 80 80")    == dchar.init);
	assert (toUtf32(x"fc 80 80 80 80 80") == dchar.init);

	// single UTF-16 surrogates
	assert (toUtf32(x"ed a0 80")          == dchar.init);
	assert (toUtf32(x"ed ad bf")          == dchar.init);
	assert (toUtf32(x"ed ae 80")          == dchar.init);
	assert (toUtf32(x"ed af bf")          == dchar.init);
	assert (toUtf32(x"ed b0 80")          == dchar.init);
	assert (toUtf32(x"ed be 80")          == dchar.init);
	assert (toUtf32(x"ed bf bf")          == dchar.init);

	// paired UTF-16 surrogates
	assert (toUtf32(x"ed a0 80 ed b0 80") == dchar.init);
	assert (toUtf32(x"ed a0 80 ed bf bf") == dchar.init);
	assert (toUtf32(x"ed ad bf ed b0 80") == dchar.init);
	assert (toUtf32(x"ed ad bf ed bf bf") == dchar.init);
	assert (toUtf32(x"ed ae 80 ed b0 80") == dchar.init);
	assert (toUtf32(x"ed ae 80 ed bf bf") == dchar.init);
	assert (toUtf32(x"ed af bf ed b0 80") == dchar.init);
	assert (toUtf32(x"ed af bf ed bf bf") == dchar.init);
}

Posted: 02/12/07 11:27:46

Baaaa!!

I just realised that I can't use dchar.init to indicate an error, since it's 0xffff and not 0xffffffff like I thought it was. I can't use dchar.max either, since that's 0x10ffff (the max defined Unicode code point) and not 0xffffffff (the max the type can hold). dchar.min is, of course, 0, so that doesn't help either.

The problem with dchar.init is that it's a legal UTF-32 code point, it's just guaranteed to always be unassigned to a character and thus should never be used in interchange. However, I expect somebody might use it as a control character in interprocess (or intraprocess) communication, and it's not correct to return the same value for the valid sequence UTF-8 0xef 0xbb 0xbf as well as every invalid UTF-8 sequence.

The only smart error value I can think of is cast(dchar)-1 which should probably made a const called ERROR or something, or then an exception should be thrown.

For the CharStream? version, this would change "private dchar pushed;" into "private dchar pushed = cast(dchar)-1;" (and thus get and unget would have to change to match, I'd change unget's assertion into the form "assert (pushed == cast(dchar)-1 && c <= dchar.max);"). Alternatively, keep a boolean value of whether something has been ungot or not. Ideally, use the arbitrary-sized unget buffer and just check the length.

For the buffer.next() version the "if (!buffer.next(&scan))" case would have to return a smarter error value, or throw a specific kind of exception. Since Utf.toUtf32 throws an error if it fails, I presume that's what my one-dchar function should do, too.

I might (re)write all the toUtf* functions some time this week to make sure they're 100% conforming. I'll leave it to you to optimize them, if you want them in Tango. <g>

Posted: 02/12/07 19:42:32

Hey, this is all great stuff :)

Posted: 02/17/07 14:54:55

Sorry, I got lazy. I just tested extensively, but I don't think I'll be writing the converters any time soon since currently I only need my toUtf32 function. Filed a ticket instead: http://www.dsource.org/projects/tango/ticket/282