FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Textual Encoding
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
 
Post new topic   Reply to topic     Forum Index -> Mango
View previous topic :: View next topic  
Author Message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Mon Dec 19, 2005 1:51 pm    Post subject: Reply with quote

teqdruid wrote:
Ya know what's really annoying? The shortcut keys for cutting text in Emacs- my favored editor- is ctrl-w. Guess what that same shortcut sequence does in Firefox? If you guessed that it closed the current tab, you're right.

I'm all too familiar with that annoyance Crying or Very sad

teqdruid wrote:
I had a rather long argument typed up that essentially made the argument that in many circumstances Utf conversions can be avoided altogether (or reduced to one) with the UniversalString design, whereas in the same application it took 19 conversions with the current method- using wchar[] for everthing in the XML parser.

I'll re-write it later, but the trick is a class called ConstantString which would be a substitute for hard-coded strings. It has all three encodings of the same string in it- where they are transcoded at compile time.

I've thought about that aspect too. One might have a struct with char[], wchar[] and dchar[] attributes initialized as Triplet foo {"foo", "foo", "foo"};

Yet, the use for such things is surely somewhat application-specific? To continue playing Devils' advocate, I'd much prefer to have a known encoding and then utilize a hashMap or string-switch for pattern matching. In other words, I wonder if the use-cases are quite limited where the above Triplet and/or UniversalString would prove to be an attractive design option?


Last edited by kris on Mon Dec 19, 2005 2:10 pm; edited 1 time in total
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Mon Dec 19, 2005 2:08 pm    Post subject: Reply with quote

kris wrote:
Yet, the use for such things is surely somewhat application-specific? To continue playing Devils' advocate, I'd much prefer to have a known encoding and then utilize a hashMap or string-switch for pattern matching. In other words, I wonder if the use-cases are quite limited where the above Triplet and/or UniversalString would prove to be an attractive design option?


You're making the assumption that a decision structure like switch-case or a HashMap are totally incompatible with a UniversalString. I'm not certain that they are. Obviously your typical HashMap wouldn't work right off the bat... Let me ponder a bit.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
pragma



Joined: 28 May 2004
Posts: 607
Location: Washington, DC

PostPosted: Mon Dec 19, 2005 2:53 pm    Post subject: Reply with quote

(following Kris' invitation)

I've mulled it over. I like Kris' class hierarchy for the String types, but I'm not 100? sold on the necessity for so many levels in the tree. What is the use case for UtfString or Slice... or is this just being prepared for the unforseen?

As for the Universal type, I think it would be happiest if it could play fair with such a tree. The problem is that you would have to adapt it to each class in the tree, or be satisfied with just a single UniversalString class that extends MutableString.

To work around that, you'd have to go to a common abstract base or interface and implement both standard and universal types from that. Templating also becomes a bit more transparent as a consequence, but you do take on some bloat on the base-class (format routines for each char type), and you loose Kris' elegant hierarchy -- most everything would code against that one single base string interface.

Code:
interface IString; // or use 'abstract class' if you prefer ;)
interface IMutableString : IString;
class String(T) : IString;
class MutableString(T): IMutableString;
class UniversalString: IString;
class MutableUniversalString(T): IMutableString;


The two types of strings are likely to have radically different internals, so there's no advantage for one to inherit from the other. You could try adapting this to Kris' proposed hierachry above, but you're likely to run into covariance problems with some strategies.

I'll also add that this strategy opens the door for other types. Say, if you wanted a Mango conduit to masquerade as a string, or wrap parts of ICU, you could easily do that with this tree.
_________________
-- !Eric.t.Anderton at gmail
Back to top
View user's profile Send private message Yahoo Messenger
pragma



Joined: 28 May 2004
Posts: 607
Location: Washington, DC

PostPosted: Mon Dec 19, 2005 3:11 pm    Post subject: Reply with quote

teqdruid wrote:
kris wrote:
Yet, the use for such things is surely somewhat application-specific? To continue playing Devils' advocate, I'd much prefer to have a known encoding and then utilize a hashMap or string-switch for pattern matching. In other words, I wonder if the use-cases are quite limited where the above Triplet and/or UniversalString would prove to be an attractive design option?


You're making the assumption that a decision structure like switch-case or a HashMap are totally incompatible with a UniversalString. I'm not certain that they are. Obviously your typical HashMap wouldn't work right off the bat... Let me ponder a bit.

~John


Shooting from the hip: Given a hashmap with a string hash function:
Code:
char[] a = "Hello world";
dchar[] b = "Hello world";

assert(hash(a) == hash(b));

does the assert fail or not, and is that desirable behavior? As I know nothing about transcoding, I can't make any assumptions. This looks pretty bad. Sad
_________________
-- !Eric.t.Anderton at gmail
Back to top
View user's profile Send private message Yahoo Messenger
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Mon Dec 19, 2005 3:25 pm    Post subject: Reply with quote

pragma wrote:
As for the Universal type, I think it would be happiest if it could play fair with such a tree. The problem is that you would have to adapt it to each class in the tree, or be satisfied with just a single UniversalString class that extends MutableString.


What I'm currently working on fits it pretty well:
Code:
abstract class String;
abstract class MutableString: String;

abstract class Utf8String: String;
abstract class MutableUtf8String: MutableString;

abstract class Utf16String: String;
abstract class MutableUtf16String: MutableString;

abstract class Utf32String: String;
abstract class MutableUtf32String: MutableString;


and Kris' two templated classes have been converted to templates, so the UtfXString and MutableUtfXString are implemented with two mixins. It's working pretty well so far, I just haven't had as much time to work on it as I'd like. Expect something working to be checked in tonight.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Mon Dec 19, 2005 3:36 pm    Post subject: Reply with quote

pragma wrote:

Shooting from the hip: Given a hashmap with a string hash function:
Code:
char[] a = "Hello world";
dchar[] b = "Hello world";

assert(hash(a) == hash(b));

does the assert fail or not, and is that desirable behavior? As I know nothing about transcoding, I can't make any assumptions. This looks pretty bad. Sad


For the UniversalString hash function to work as expected in a HashMap, we need that assert to pass. I'm pretty sure the current hash function won't pass that.... I wonder if there's a Unicode hash algorithm that would pass that assertion?

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Mon Dec 19, 2005 4:09 pm    Post subject: Reply with quote

pragma wrote:
Shooting from the hip: Given a hashmap with a string hash function:
Code:
char[] a = "Hello world";
dchar[] b = "Hello world";

assert(hash(a) == hash(b));

does the assert fail or not, and is that desirable behavior? As I know nothing about transcoding, I can't make any assumptions. This looks pretty bad. Sad

The thing that confused me about all this encoding business is that, from what little I understand, multiple UTF-8 sequences might map to the same UTF-32 sequence. This is what initially inspired me to do all the coding for readf using dchars, as it didn't seem feasible to expect accurate pattern matching with UTF-8 strings in all cases. Can someone with a bit more Unicode experience verify this? And assuming I'm correct, doesn't this imply that the hash function might necessarily require a UTF conversion? Or might distinguishing between "equivalent" sequences actually be relevant for some applications?
Back to top
View user's profile Send private message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Mon Dec 19, 2005 4:46 pm    Post subject: Reply with quote

sean wrote:
pragma wrote:
Shooting from the hip: Given a hashmap with a string hash function:
Code:
char[] a = "Hello world";
dchar[] b = "Hello world";

assert(hash(a) == hash(b));

does the assert fail or not, and is that desirable behavior? As I know nothing about transcoding, I can't make any assumptions. This looks pretty bad. Sad

The thing that confused me about all this encoding business is that, from what little I understand, multiple UTF-8 sequences might map to the same UTF-32 sequence. This is what initially inspired me to do all the coding for readf using dchars, as it didn't seem feasible to expect accurate pattern matching with UTF-8 strings in all cases. Can someone with a bit more Unicode experience verify this? And assuming I'm correct, doesn't this imply that the hash function might necessarily require a UTF conversion? Or might distinguishing between "equivalent" sequences actually be relevant for some applications?

Take a look here, Sean: http://icu.sourceforge.net/docs/papers/forms_of_unicode/#t1

While not an ideal description, the relevant info is in there ~ instances of utf encodings are always matchable and, more to your point, shortest form is mandated but not enforced. The shortest form thing is about ensuring utf8 is always represented in a maximum of 4 bytes, rather than 6. The 5 & 6 byte variations are effectively outlawed, and Utf8 decoders are 'encouraged' to enforce this.

The upshot, as I understand it, is that only an invalid utf8 string could exhibit the multiple-representation problem you note.
Back to top
View user's profile Send private message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Mon Dec 19, 2005 5:00 pm    Post subject: Reply with quote

pragma wrote:
but I'm not 100? sold on the necessity for so many levels in the tree. What is the use case for UtfString or Slice...

Oh, we're just playing with variations on a theme. UtfString exists there only because it's untyped, and thus can be viewed as a generic string. It effectively represents a poor man's UniversalString. Once you have a UtfString, you can use it to construct any of the higher-level abstractions, or just manipulate the transcoded content as is.

I think one of the more interesting aspects discussed here is whether or not a true UniversalString is actually a valuable tool? While it seems clear (at least to me) that the templated classes can have value, I'm becoming somewhat less convinced regarding a fully featured UniversalString ~ at least in the way we're currently describing its role Sad
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Mon Dec 19, 2005 5:11 pm    Post subject: Reply with quote

kris wrote:
The upshot, as I understand it, is that only an invalid utf8 string could exhibit the multiple-representation problem you note.

So as long as the user isn't feeding in bad data, all will work as expected. Seems reasonable Smile I'll admit that the alternative had me wondering just how to do anything reliably with Unicode.
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Mon Dec 19, 2005 5:53 pm    Post subject: Reply with quote

kris wrote:
I think one of the more interesting aspects discussed here is whether or not a true UniversalString is actually a valuable tool? While it seems clear (at least to me) that the templated classes can have value, I'm becoming somewhat less convinced regarding a fully featured UniversalString ~ at least in the way we're currently describing its role Sad

The template classes provide a fairly useful set of basic string operations that behave in a straightforward manner. My concern with a non type-specific string class has always been that there is no way to avoid "hidden" costs, and that my requirements might vary from application to application (ie. I may be willing to sacrifice size for speed in one instance but not in another). In most cases, I think it would be far more reasonable to settle on a specific internal encoding and to handle conversions once during IO. This is what I've always done for XML parsing via SAX--convert everything to UTF-8 right off the bat and go from there.
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Mon Dec 19, 2005 6:25 pm    Post subject: Reply with quote

So I've just committed a new copy of the UniversalString.d. This one has MutableString, and it seems to be working pretty well. There's also some new stuff in the test file:
http://trac.dsource.org/projects/mango/browser/trunk/mango/test/universalstring.d?rev=680

You'll also notice that all of the methods in Kris' templated classes are available for the specific types: that is, if you want to use Utf8MutableString, then you can do everything that Kris' MutableString!(char) can do, and it'll implicitly convert to a generic String, which can do a helluva lot- see the link above. The only hierarchy problem with it is that MutableUtf8String doesn't inherit from Utf8String. This is due to D's lack of multiple inheritance. The solution to the problem is to make String and MutableString interfaces, but then MutableString can't be implicitly casted to String, since interfaces are not covariant with each other (because DMD's interface support sucks) and that limitation makes interfaces not an option.

I don't get how one can look at the test file and *not* see how useful the universal string stuff is. At the very least (aside from the slight inheritance problem noted above) I don't see any way that it would be a bad thing, since you can still work with specific encodings if you want.

Oh yeah, and the hash method doesn't work right, since two Strings of different encodings (but the same string) will .opEquals) to the same, but won't necessarily have the same hash. Other than always converting to a specific encoding, then running the hash, I'm not certain how to solve this one.

Thoughts?
~John
Back to top
View user's profile Send private message Send e-mail AIM Address
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Mon Dec 19, 2005 6:54 pm    Post subject: Reply with quote

kris wrote:
Just to play Devils' advocate for a moment, I'm seeing little real-world use for a true UniversalString, above and beyond what the toUtfXX() transcoding methods can offer. For example; if I were writing an XML parser, I'd choose a system-wide encoding and then make everything consistent by transcoding as necessary at the I/O boundary ~ UniversalString would have no place there, and life would be simple for everyone given a standard String type. On the other hand, if UniversalString were in vogue and I were a SAX client, or a DOM client, I'd convert the incoming strings to a known state before processing. That is, I would not be calling equals() or startsWith() to see which XML element or attribute tag I'd just been handed ~ I'd convert it first so I could look it up in a hashMap, or in a string-switch; that's notably more efficient, and the mutli-type actually gets in the way. I'm usually a proponent of "pay as you go" designs, but I'm not seeing significant value of UniversalString as currently envisaged.


OK, here's the second go at the universal string defense... This time I'm writing it in emacs, then pasting it into firefox.

Currently, my DOM parser uses wchar[] for everything. Why? 'Cause it uses UText, which uses wchar. When my application recieves an incoming XML-RPC document, it feeds it to the parser. The parser first converts the document to Utf16. Next, it parses the document, and constructs the DOM. The DOM contains many UTexts (wchar[]'s for our purposes). My application, however, uses Utf8 for all communication, so while it's going through the DOM it ends up converting about half of the UTexts to Utf8. Each XML-RPC DOM has about a dozen UTexts, so we're already at about 7 Utf conversions. It gets worse from there. Once my application does it's processing, it uses an XML encoder to respond. Since the encoder uses wchar[]'s as well, about a dozen conversions have to be done for each of my application's char[]s, then it gets converted back to char[] for final output. So, the "pick an encoding and go with it" style uses about 20 Utf conversions. This is, of course, a worst case senario, but it's how it's happening now, and will happen on occasion whenever an encoding has to be arbitrarily choosen at compile time.

Now, assume we have a class ConstantString which contains all three encodings of a particular string, all transcoded at compile time. Obviously, this class is able to compare itself to any other String very efficiently, since it doesn't ever need to make Utf conversion.

If my parser used a universal string along with these ConstantStrings (to replace all of the hardcoded strings currently stored as wchar[]s) it would operate much more efficiently. First, my application recieves the XML-RPC document (in *any* encoding) and makes a String. Next, the parser parses it using the ConstantStrings- still no conversions. Once that's done, it feeds back a DOM with Strings. These strings will still be in the original encoding. My application can continue to use ConstantStrings to look at the strings. Only if my application requires that the output be in a specific encoding does a transcode need to happen. In this case, my application's output happens in the same encoding as the input happens in- Utf8, so at no point are *any* transcodings being done. If, however, for some reason the request comes in as Utf16, but I still want to output Utf8, it's the same deal, but a conversion happens at the very end to make the output Utf16. Overall the score is 17 to zero or one, where a lower score is better.

Are there ways for my application to operate more efficiently without a universal string? Certainly. If I was to change my parser to a templated class and use T[] instead of UText, then the client could choose the encoding- but only at compile time. I'm lucky in that my application knows it's gonna be recieving and sending in Utf8 so I can just choose that. Not all applications will be so lucky, and I'm pretty certain that the universal string will have less conversions on average for those unknown cases. Also, quite frankly, the universal string is a more dymanic, and cleaner solution. It makes it simpler to use different encodings- the programmer doesn't even have to worry about the encoding, unless they really want to. I think D itself a very good example that things frequency work better and faster when the programmer doesn't micromanage (think: GC vs malloc/free) and this is just another level of logic that can and should be abstracted away from the programmer as much as possible.

I think with the recent commits I've made, I'm getting closer to showing that a universal string is technically viable. The only (big) problem left is that of indexes, but I'm pretty sure that this can be solved with some sort of pointer object to abstract away integer indexes. Assuming that universal string is technically viable, have I made a convincing argument that it would be more useful and efficient?

Thoughts?
~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Mon Dec 19, 2005 7:55 pm    Post subject: Reply with quote

teqdruid wrote:
I think with the recent commits I've made, I'm getting closer to showing that a universal string is technically viable. The only (big) problem left is that of indexes, but I'm pretty sure that this can be solved with some sort of pointer object to abstract away integer indexes. Assuming that universal string is technically viable, have I made a convincing argument that it would be more useful and efficient?

I don't think there's ever been a question as to whether it's technically possible or not. For developers, the question is: why would I use something like that when there are potentially better options open to me? The use of "better" is in the eye of the beholder, but it's a perfectly valid perspective.

For the sake of argument, one could envisage a scenario where transcoding is performed at the I/O barrier only. Assuming the internal and external encodings match, there would be zero transcoding taking place. At very worst there would be two transcodings (input and output). One could argue that the end-result there is cleaner and more efficient. No? For me, UniversalString has to offer something really compelling before I'd use it ~ partly because it hides the transcoding barrier for one or two simple things Smile

There are often very good reasons for maintaining explicit control over the transcoding aspect. In those worlds, why should I use UniversalString? What benefit does it bring to the table that can't be achieved through other means? And, at what cost?

That's not to say what you've done with UniversalString ain't cool ~ it is ... as before, I'm having a hard time identifying any use-case where I'd actually use it Sad
Back to top
View user's profile Send private message
pragma



Joined: 28 May 2004
Posts: 607
Location: Washington, DC

PostPosted: Mon Dec 19, 2005 8:23 pm    Post subject: Reply with quote

I think John's argument is a good one. I'm leaning a bit closer to sticking with keeping universals to a special-use niche. But I'm still thinking about it. Confused Here's my thoughts so far:

I think that a universal type does the job in a pinch, and is a great tool when you are between two technologies, which you cannot change, that dont' use the same encoding. This is a *very* likely scenario should D gain in popularity, so there's no reason why you wouldn't have this in your toolbox. Its like working on your car only to find that bolt you've been merrily rounding off is metric, and that 1/2" socket doesn't quite cut it; so you instinctively reach for the adjustable wrench instead (and pray to god the bolt isn't torqued). Wink

(Or like the plumbing in my parent's house: its old enough that the threads are all on the "wrong end" of the works, so you can't use any of the typical stuff at Home Depot. You have to get special parts and adapters to hook up that new garbage disposal, but you still want to keep your sink...)

In the case of Mango, XML and other related technologies, we are in a position to *dictate* a "Lingua Franca" of sorts and say "use this encoding for optimal performance" and leave it to the library user to transcode at the I/O boundary (or just use universal string where appropriate). I think Mango's filters are especially well suited to this the task of transcoding for situations like this, especially when you can't get the whole stream at once as with over a socket.

So to sum up: UniversalString is an awesome general purpose tool, that solves all kinds of edge cases and odd problems, but the jury is still out as to wether or not its the right fit for the XML family of libs. John makes a very compelling argument, and I can see where he's coming from.
_________________
-- !Eric.t.Anderton at gmail
Back to top
View user's profile Send private message Yahoo Messenger
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> Mango All times are GMT - 6 Hours
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 4 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group