Forum Navigation
Getting a dchar from a Buffer containing UTF-8
Posted: 02/02/07 20:25:25A way of getting a dchar from a UTF-8 Buffer like Cin.buffer would be handy. This would mean that characters are read until a complete dchar can be constructed. I figured the following might work:
auto getter = new Reader(Cin.buffer); dchar c; getter(c);But giving stdin, for instance, an 'a' causes it to wait for more characters. Another few 'a's and what we get is "Unicode.toUtf8 : invalid dchar". Reading a UTF-32-encoded file would presumably work, but ASCII, at least, fails.
Is there a built-in way of doing this, or do I need to do read char by char and decode manually, like with Phobos? As an example, this is how I'd do it that way with my current, meager knowledge of Tango:
import tango.core.Exception; import tango.io.Stdout; import tango.io.Console : Cin; import tango.io.protocol.Reader; import tango.text.convert.Utf : toUtf32; void main() { auto getter = new Reader(Cin.buffer); dchar ch; for (;;) { try ch = readchar(getter); catch (IOException) { break; } // any smarter way of catching EOF? if (ch == 26) break; // any smarter way of catching DOS/Windows Ctrl-Z on the console? Stdout(ch); } Stdout(); } dchar readchar(Reader s) { const ubyte[256] UTF8_BYTES_NEEDED = [ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4,0,0,0,0,0,0,0,0,0,0,0 ]; char c; s(c); // ASCII if (c <= 127) return c; else { // UTF-8 char[4] str; str[0] = c; ubyte n = UTF8_BYTES_NEEDED[c]; if (!n) throw new Exception("Invalid UTF-8 in input."); for (size_t i = 1; i < n; ++i) s(str[i]); return toUtf32(str[0..n])[0]; } }UTF8_BYTES_NEEDED is an RFC 3629 compliant version of Phobos's std.utf.UTF8stride.