dsource.org - forums

dsource.org

Open Source Development for
the D Programming Language

FAQ

Memberlist

Usergroups

Profile

Textual Encoding
Goto page Previous 1, 2, 3, 4, 5, 6, 7

Forum Index -> Mango

View previous topic :: View next topic

Author

Message

kris

Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

Posted: Sat Dec 31, 2005 12:53 pm Post subject:

teqdruid wrote:

I've actually been thinking about allocating a new String every time, but allocating it on the stack using a manual memory allocator. This way, if the client copies the reference, it won't be a valid reference, so (hopefull) their code will outright crash when it's run, instead of every other string being the same. I'm not sure, however, which behavior is preferable, and if the crashing behavior is preferable, I'm not certain that it's worth it since class stack allocation is still slower than using mutable strings, and is most definately a hack.

I'm doing my best to write a parser that doesn't do any heap allocations. It's not too hard, actually, what with D's array slicing, to not directly do any heap allocations. It's ensuring that none of the code I'm calling is doing any that's tricky/impossible. I'm planning on letting mango.io handle all of the transcodings that I need, but I'm not certain if it'll be making heap allocations. What's the correct way to do this to eliminate/minimize allocations? I was going to use the stuff in mango.io.BufferCodec and attach it to a Reader, since this seems the straight forward way to do it.

I'd highly recommend that you do not allocate off the stack in that manner Smile

As far as performance goes, consider this: Readers and Writers are for converting between an in-memory model and a mixed data-type streaming model (think int, byte, short[], bool, real, utf etc). If you already know that your all your data is text, then the most efficient thing to do is sidestep Reader & Writer. They are somewhat superfluous in such a scenario, yes? At the very least, they add a thin layer that you won't really take much advantage of.

Thus, assuming all the data is text, one might simply grab the raw content from the Conduit and pass that around. This would be the most direct route. If you did that, then you could use convert.Unicode directly ~ to convert the raw data into your internal representation (Utf16?) as necessary. The convert.UnicodeBom may be your friend here also, although the encoding should be specified in the XML preamble. Regardless, you'd also have to be cogniscent of encoding tails (e.g utf8 fragments) and adjust your buffering accordingly ~ the notion there is that you should not pass utf fragments around internally, but instead move any fragment to the beginning of the subsequent raw-data read. The convert.Unicode functions will tell you about fragments.

To make the unicode conversion as efficient as possible, maintain a destination/target buffer that's big enough to house the resultant conversion. The documentation in convert.Unicode tells you how to do this, and about various other optimization strategies.

An io.Reader can do all this for you, and handles array input without allocation via the ArrayAllocator module. But the most effective way, for your scenario (big blocks of text), is to be aggressive about it Twisted Evil

In short, to make it go as fast as possible, one should consider sidestepping the sugar-coated wrappers and get right down to the heart of the matter Smile

I can write you some example code, if you need that.

Last edited by kris on Sat Dec 31, 2005 3:10 pm; edited 1 time in total

kris

Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

Posted: Sat Dec 31, 2005 1:08 pm Post subject:

teqdruid wrote:

I'm talking about a worst case, and most applications are not going to be a worst case. I keep using this worst case, however, because I think that it's indicitive of an inherient problem to having multiple encoding types strongly typed in the library/language: that anybody can arbitrarly choose any one of the three (there are good reasons to use any of the three) and when I try to interface my code with the other guy's code, we'll quite possibly end up with transcodings at the code border. Will this happen in all code bases? No, not at all. However, I think you underestimate the extent to which this could be a problem. In any application that uses Mango's strings AND another library, there's only a 1 in 3 chance that both are using the same encoding type (obviously), and it, of course, only gets worse with the number of libraries that you involve. Giving the programmer the freedom to choose the encoding that Mango (and any libraries that use Mango use) greately reduces this problem.

This latter part is why (a) Mango text handling is templated, and (b) all string instances can be converted to another representation (via UniString). We're just going around in circles here Smile

The crux of your post is more important though. The best way I can think of to answer it is like this: in theory, writing libraries is easy. You just get the algorithms together and toss them into a bag. Right? In practice, the hard part is knowing where to draw a balance between generality, ease of use, performance, code-size, and cohesiveness. They are all powerful mistresses/masters, and it's usually very hard to please all simultaneously. I don't pretend to know the best way to do that, but I'm trying the best I can. The UniversalString is one particular beast that apparently likes to fight with all of these at the same time. That's why I'm content to let it sit a while, and become accustomed to its surroundings.

Analogies usually sound like shite, and this one is no different in that respect. Yet, the truth is in there somewhere Smile

teqdruid

Joined: 11 May 2004
Posts: 390
Location: UMD

Posted: Sat Jan 07, 2006 5:17 pm Post subject:

teqdruid wrote:

You mentioned earlier in this discussion that there's some merit to having a unified string that just chooses an encoding and sticks with it. I agree. What do you think about having this in Mango, wherein the encoding is choosen at compile time. That is, insert something like the following at the bottom of your String.d instead of what's currently there:

Code:

version(Utf8) {
public alias char character;
} else version (Utf16) {
public alias wchar character;
} else version (Utf32) {
public alias dchar character;
} else {
static assert(false); //You must choose an encoding!
}

typedef StringTemplate!(character) String;
typedef MutableStringTemplate!(character) MutableString;

This way, I can use the symbol "String" all over the place, and my clients can still choose what encoding it's in. It's probably also acceptable to let it default to Utf8. Since everyone's using build now, it should work right. For anyone still using the pre-compiled library- they probably don't care about customization as much, so let Utf8 be pre-prescribed to them.

Kris,
As near as I can tell, you haven't commented on the above proposal. So, what do you think? Also, have you given any more though to module interdependencies? (for the String extends IWritable issue)

I was unexpectly engaged in a new (professional) project so I've been a bit occupied as of late, but am planning on finishing the SAX parser soon. I need some changes to happen with the string stuff first, however. Without some of these changes, I'll need to end up putting a few hacks into the parser, and I'd rather not have to do that.

~John

kris

Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

Posted: Sat Jan 07, 2006 6:52 pm Post subject:

I can't alias the char type, since that would prevent instantiation of say, a combination of char and dchar within an application. Plus it would wreak havoc with code from multiple parties who compete for the alias. Not something I wish to do with a library. The dependency thing is being resolved. The mango.text package will have a limited dependency on mango.io, permitting String to import mango.io.model.IWritable

teqdruid

Joined: 11 May 2004
Posts: 390
Location: UMD

Posted: Sat Jan 07, 2006 9:30 pm Post subject:

Code:

Index: io/model/IWriter.d
===================================================================
--- io/model/IWriter.d (revision 736)
+++ io/model/IWriter.d (working copy)
@@ -150,7 +150,16 @@
***********************************************************************/

abstract void setEncoder (AbstractEncoder);
+
+ /***********************************************************************
+
+ Gives the output encoding of an attached AbstractEncoder.
+
+ Returns Type.Raw if no encoder is attached.

+ ***********************************************************************/
+ abstract uint getEncoderType ();
+
/***********************************************************************

Output a newline. Do this indirectly so that it can be
Index: io/Writer.d
===================================================================
--- io/Writer.d (revision 736)
+++ io/Writer.d (working copy)
@@ -136,6 +136,7 @@
protected IBuffer buffer;
private bool prefixArray;
private IBuffer.Converter textEncoder;
+ private uint textEncoderType = Type.Raw;

/***********************************************************************

@@ -209,8 +210,22 @@
{
e.bind (buffer);
this.textEncoder = &e.encoder;
+ this.textEncoderType = e.type();
}
+
+ /***********************************************************************
+
+ Gives the output encoding of an attached AbstractEncoder.
+
+ Returns Type.Raw if no encoder is attached.

+ ***********************************************************************/
+
+ uint getEncoderType ()
+ {
+ return textEncoderType;
+ }
+
/***********************************************************************

Is this Writer text oriented?

This was also part of the last code I posted, but is unrelated to the String class. It would be useful for me to be able to get the output encoding of a Writer at run time. If this is a decent way to do it can I commit? If not, how should it be done? If it shouldn't be done, why not?

~John

kris

Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

Posted: Mon Jan 09, 2006 9:23 pm Post subject:

I've been tempted to expose a getEncoder() method instead, and a corresponding getDecoder() in the Reader. Just seems more useful for some reason? Maybe not.

teqdruid

Joined: 11 May 2004
Posts: 390
Location: UMD

Posted: Mon Jan 09, 2006 10:48 pm Post subject:

kris wrote:

I've been tempted to expose a getEncoder() method instead, and a corresponding getDecoder() in the Reader. Just seems more useful for some reason? Maybe not.

That sounds just as reasonable, and my gut tells me it's more flexible. I'd say go for it.

~John

Display posts from previous:

	Forum Index -> Mango	All times are GMT - 6 Hours Goto page Previous 1, 2, 3, 4, 5, 6, 7
Page 7 of 7

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum