Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Utf String Iterator

Moderators: kris

Posted: 07/06/08 08:18:30 Modified: 07/06/08 08:22:08

I see many code samples that use the following iteration over a string:

for (int i = 0; i < str.length; ++i) {
   Stdout(str[i]); // just an example
}

Which is wrong, since char[] may contain multibyte characters.

foreach (dchar c; str) {} works great, but

a) it doesn't allow iteration over two strings in parallel.

b) it doesn't allow you to keep track over current iterator position. I mean, you can't say

  • what's the position (offset) of current character c in the source string
  • how much space does it occupy in the source Utf string (1, 2, 3, 4?)
  • how many characters left in a string.

c) once you break, you can't continue.

As an example, I will show use simple test case: Given two strings, cut first N characters, that match in both strings, i.e. "Hello, there" and "Hello, World!" -> "there" and "World!" while preserving utf correctness.

Here is my solution:

struct UtfStringIterator(CharType = char)
{
    public CharType[] str;      // the string
    public size_t offset;       // an offset of the current character
    public size_t nextOffset;   // an offset of the next character.
    public dchar value;         // current character
                                // its length can determined as nextOffset - offset.

    static UtfStringIterator opCall(CharType[] str) {
        UtfStringIterator it = void;
        it.str = str;
        it.offset = 0;
        it.nextOffset = 0;
        it.value = decode(str, it.nextOffset);

        return it;
    }

    bool isValid() {
        return this.offset < str.length;
    }

    void moveNext() {
        offset = nextOffset;
        if (isValid()) {
            value = decode(str, nextOffset);
        }
    }

    int opApply(int delegate(ref dchar d) dg) {
        while (isValid) {
            int result = dg(value);
            if (result != 0) {
                return result;
            }
            moveNext();
        }
        return 0;
    }
}

unittest {
    auto iter = UtfStringIterator("Hello, World!");
    while (iter.isValid) {
        Stdout(iter.value);
        iter.moveNext();
    }

    auto iter2 = UtfStringIterator!(wchar)("Hello, World!"w);
    foreach (dchar c; iter2) {
        Stdout(c);
    }

    return 0;
}

I would be happy to see something like this in Tango.

Author Message

Posted: 07/06/08 12:14:08

If you look at tango.text.stream.* and make it fit in there, and then post it to a wishlist ticket, we can discuss it :)

From the onset, it looks like nice work though :)