FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Textual Encoding
Goto page 1, 2, 3, 4, 5, 6, 7  Next
 
Post new topic   Reply to topic     Forum Index -> Mango
View previous topic :: View next topic  
Author Message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Wed Dec 14, 2005 2:15 am    Post subject: Textual Encoding Reply with quote

With the flurry of recent changes, I'm a little lost these days when it comes to text encoding and decoding in Mango.

Essentially, what is the preferred way read text from a stream both if one wants to specify the encoding, or if one wants it to be autodetected? Also, what is the preferred method of outputting text to a stream where the output encoding can be selected at runtime?

With both mango.icu, and mango.convert I'm unsure how this should be done. Also, how do either of these packages fit into the recent changes to mango.io?

If you haven't noticed, I've picked up work again on the mango.xml package, so I want to make sure I'm doing the text encoding/decoding in a manner consistent with the rest of Mango.

Thanks,
John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Wed Dec 14, 2005 1:11 pm    Post subject: Reply with quote

Let's make a clarification first, between streaming and non-streaming I/O ~ the former allows for incremental reading and writing, whilst the latter reads and writes the entire content in one operation.

Non-streaming I/O provides for a really simple usage model, and is exposed in Mango via the File and UnicodeFile modules. These allow one to read and write the entire file in one go, with the latter providing for encoding detection and conversion. Said conversion is for utf8/16/32 only.

Streaming I/O is where Conduit, Buffer, Reader and Writer come into play. There are multiple strategies supported here, but the most common uses the Reader and/or Writer to perform an appropriate conversion. That is, you can hand any of char[], wchar[] or dchar[] to a Writer and it will convert it as necessary whilst writing. The Writer knows what encoding to target through an attached encoder (which dictates the target encoding). The same is true of a Reader, which may have a decoder attached. These encoders and decoders may originate from mango.io or from ICU, where the latter has a far wider selection (mango.io and mango.convert support utf8, utf16 and utf32 only).

You might think of Reader/Writer performing conversion "piecemeal", though the piece in question might actually be very large. You can also bypass the Reader/Writer completely and perform your own conversions. Or, where appropriate, one could gather up all the content and convert it in one go ~ just as the non-streaming variety does.

Auto-detection is currently employed by UnicodeFile only. You can certainly use the UnicodeBom module to detect an encoding at the start of a stream, and subsequently attach the appropriate encoder/decoder to a Reader/Writer pair. Or, if you need to detect some of the more esoteric encodings, use the ICU detector instead.

So, to answer your questions ~ when streaming, the preferred method is usually to attach an encoder/decoder to a Writer/Reader pair. Currently, there is not auto-detection configured for streaming, but it's easy to do with UnicodeBom (grab the first few bytes from the stream, ask UnicodeBom or ICU to decode them, and then attach the appropriate encoder/decoder).

If you're not streaming, and dealing with unicode files in utf8/16/32, the trivial approach is to use UnicodeFile.

For your XML project, you're likely dealing with streams over sockets and so on. The simplest way to handle that is to use Reader/Writer with the appropriate encoder/decoder pair, as you've done previously. Squeezing the last drop of performance out of the transfer is usually application-specific. For example, an XML parser might transcode en masse. This can be done using a Reader/Writer, but can also be done by working directly with the Conduit and explicitly invoking a transcoder upon the raw content. To do this for utf8/16/32 encodings, convert.Unicode is quite adept and convenient. For other encodings ICU is the way to go.

Does that help?
Back to top
View user's profile Send private message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Wed Dec 14, 2005 1:19 pm    Post subject: Reply with quote

Question for you John:

I imagine that most XML documents are thrown around the web as Utf8 encodings? Is that the case, or do some folks encode them as Utf7/16/32 or some other weirdness?
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Wed Dec 14, 2005 3:07 pm    Post subject: Reply with quote

kris wrote:
Question for you John:

I imagine that most XML documents are thrown around the web as Utf8 encodings? Is that the case, or do some folks encode them as Utf7/16/32 or some other weirdness?


I dunno. I, also, would assume that most/a lot of XML documents are in Utf8. I use Utf8 exclusively in my stuff, and I know that I'm getting some speed decreases in converting from Utf8 to Utf16 (for the wchar[] in UText) and then back (for some data). I'd like the ability to switch, but I'm using UText's and UString's everywhere and they use wchar[] exclusively... I'm actually a bit disappointed with these two classes- both in performance and usage. I don't have the numbers for performance (it just feels slow) but they don't integrate into the rest of Mango real well, so I'm constantly creating new'ing them and using .toUtf8 to feed the text to other parts of Mango, and those conversions have got to be hurting my speed.

Would you be open to some (non-API) modifications? I don't remember your position on a String class, but I find them extremely important.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Wed Dec 14, 2005 5:36 pm    Post subject: Reply with quote

teqdruid wrote:
kris wrote:
Question for you John:

I imagine that most XML documents are thrown around the web as Utf8 encodings? Is that the case, or do some folks encode them as Utf7/16/32 or some other weirdness?


I dunno. I, also, would assume that most/a lot of XML documents are in Utf8. I use Utf8 exclusively in my stuff, and I know that I'm getting some speed decreases in converting from Utf8 to Utf16 (for the wchar[] in UText) and then back (for some data). I'd like the ability to switch, but I'm using UText's and UString's everywhere and they use wchar[] exclusively... I'm actually a bit disappointed with these two classes- both in performance and usage. I don't have the numbers for performance (it just feels slow) but they don't integrate into the rest of Mango real well, so I'm constantly creating new'ing them and using .toUtf8 to feed the text to other parts of Mango, and those conversions have got to be hurting my speed.

Would you be open to some (non-API) modifications? I don't remember your position on a String class, but I find them extremely important.

~John

The ICU String/Text class is there for compatability with ICU primarily. If you're doing a lot of work with ICU then you need a String class.

That aside, having to convert back and forth between utf8 and utf16 will always be a pain; and it will always slow things down. There's probably a good reason why you're doing these conversions, or is it just to get the String class?

If the latter, then you may be interested in a new String class being built for the mango.text package? It will be compatible with the ICU classes, and more flexible (and templated). Having worked with UString/UText, I'd really like your input on what's OK and what's not. The new class would be expected to drop into existing code with just a change to the class name.

I'd really like to see the SAX engine be notably efficient, and will be more than happy to help get it there. Perhaps it would be a worthwhile exercise to note all the things you'd like to see operate differently? That would be useful. And perhaps some strategic goals? For example, would you be interested in maintaining Utf8 all the way through the SAX chain? Or does it truly have to get transcoded at some point?

- Kris

P.S. modifications are always welcome!
Back to top
View user's profile Send private message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Wed Dec 14, 2005 6:40 pm    Post subject: Reply with quote

I should have noted this before: ICU is what I'd consider to be a heavyweight library ~ it's got some truly serious unicode capabilites. The unicode support in mango.io and mango.convert is very lightweight by comparison.

You get to choose what suits best Smile
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Wed Dec 14, 2005 6:41 pm    Post subject: Reply with quote

kris wrote:

If the latter, then you may be interested in a new String class being built for the mango.text package? It will be compatible with the ICU classes, and more flexible (and templated). Having worked with UString/UText, I'd really like your input on what's OK and what's not. The new class would be expected to drop into existing code with just a change to the class name.

I'd really like to see the SAX engine be notably efficient, and will be more than happy to help get it there. Perhaps it would be a worthwhile exercise to note all the things you'd like to see operate differently? That would be useful. And perhaps some strategic goals? For example, would you be interested in maintaining Utf8 all the way through the SAX chain? Or does it truly have to get transcoded at some point?

- Kris

P.S. modifications are always welcome!


I use UString and UText purely just to use a String class. A few classes in mango.text would also be acceptable. Ideally, I want a string class (an immutable one) that doesn't expose it's internal representation and has the ability to quickly and efficiently "talk" to other strings. That is, I don't want the SAX chain to work in all Utf8, Utf16, or any particular encoding- I want it to operate entirely in this String class, and I'll let the String class determine which format to store everything in. In the case of SAX, this isn't too hard. The encoding should be whatever the original encoding is, unless the user specifies a different encoding to transcode everthing to, but this should probably be done via some sort of filter. Whenever the SAX library calls Handler methods it should pass Strings into it, and then the programmer can get the string outta the string in any encoding they wish, or just work with it as a String.

I also want it to integrate better with Mango. For instance, it should implement IReadable and IWritable, so that I can do stuff like the following:
Code:
String message = new String("Hello World"c);
Stdout(message)(CR);


I'd also like to be able to using Stdin to read directly into a String.

The idea is that I want to never have to see a char[], wchar[], or dchar[] and thus never have to worry about encoding except for efficiency (if I know I have to write Utf8 out at the end, better to decode an entire document to Utf8 at the beginning instead of doing it piecewise at the end.)

Once a good immutable String class is ready, I also want a mutable one similar to Java's StringBuffer, but I also want to go further. Currently I have a class in my parser called ReaderString. Essentially what it does is acts like a String class, but whenever more data is requested past the end it reads more in from the buffer (as much as possible) and decodes it in larger chunks. It works great. I'd want something like this as well... Two actually, one that works with generic streams, and one which would mmap itself to a text file, and we could treat the entire file as one big string... I think this last idea has a lot of potential.

In summary, I don't see a reason why the average programmer should ever have to worry about text encoding (except, perhaps, to give Mango some hints for efficiency reasons) when it doesn't seem too hard to (somewhat) efficiently abstract it away. If such a system is built, I would certainly (besides helping on it) use it in my SAX implementation. I would also like to ensure that my SAX implementation can easily beat out a Java one- I doubt it's too hard Smile.

Does this all sound reasonable and technically pheasable, or am I living in a dream world?


I've also got a few ideas concerning String creation, but if the above is still half-baked, then these thoughts haven't even made it to the oven yet.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Wed Dec 14, 2005 7:31 pm    Post subject: Reply with quote

Sounds like we're talking the same language Smile

IReadable and IWritable are indeed supported by UString, and will be by the mango.text version too. Mango.text.String will be as immutable as it can be (given the lack of read-only arrays), with a mutable subclass. It will support extraction of the content in whatever flavor is desired, probably with caching enabled on the decode. It may have a setRefill() method, similar to text.Token, to be bridged onto a Buffer/Conduit.

Talking of MM, Mango.io has always had support for this via the MappedFile and MappedBuffer classes. Though it may be tricky to use them with a String, given the transcoding aspects ~ perhaps the entire file would be transcoded on the fly, and the resultant temp-file be mapped instead?

What are you thinking about re String creation? I'm not sure what to do about content aliasing yet (a la UString/UText), but since you're creating so many String instances it would be worthwhile implementing a FreeList in the String class (like SocketConduit does).
Back to top
View user's profile Send private message
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Thu Dec 15, 2005 12:28 am    Post subject: Reply with quote

Checked-in the first pass at a String class. Be interested to hear feedback ... http://trac.dsource.org/projects/mango/browser/trunk/mango/text/String.d
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Thu Dec 15, 2005 10:48 am    Post subject: Reply with quote

kris wrote:
Checked-in the first pass at a String class. Be interested to hear feedback ... http://trac.dsource.org/projects/mango/browser/trunk/mango/text/String.d



Ummmm.... Given this code, I don't hink we're on the same wavelength. Lemme put some code together for what I was thinking.

~John
Back to top
View user's profile Send private message Send e-mail AIM Address
kris



Joined: 27 Mar 2004
Posts: 1494
Location: South Pacific

PostPosted: Thu Dec 15, 2005 12:13 pm    Post subject: Reply with quote

teqdruid wrote:
Ummmm.... Given this code, I don't hink we're on the same wavelength. Lemme put some code together for what I was thinking.

OK
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Thu Dec 15, 2005 1:25 pm    Post subject: Reply with quote

teqdruid wrote:
kris wrote:
Checked-in the first pass at a String class. Be interested to hear feedback ... http://trac.dsource.org/projects/mango/browser/trunk/mango/text/String.d



Ummmm.... Given this code, I don't hink we're on the same wavelength. Lemme put some code together for what I was thinking.

~John


So I've been trying to code up something for the past hour, and haven't gotten very far- I can't think of a clean way to do everything. So let me lay out what I'm trying to do:

First, here's my desired hierarchy:
Code:
class String {} //Immutable string
class MutableString: String {}


Notice there's no templates there. One of the big problems I have with your first pass at it is the use of templates- this essentially makes three different hierarchies, one for each character type. I want this to be 100? internal to the class, this way a string that internally stores wchars will work with one that uses dchars.

For now, I'm not worrying about MutableString.

When a String string is created it gets fed a char[] or wchar[] or dchar[] or some other sort of character data. This character data is stored in the string in the original encoding. When a method like opSlice gets called, it slices the character data, creates a new String and returns that.

Next, say I have two strings:
Code:
String a = ...;
String b = ...;

Let's also assume that string a's internal data representation is Utf8 and string b's internal representation is Utf16. When doing a comparison, one of them obviously has to transcode it's internal string, then do the comparison. If the string being transcoded isn't too long, I'd also like the String to cache the transcoded version. This way, when I run:
Code:
if (a == b) {// do stuff}
if (b==a) { //do other stuff}

Transcoding only takes place during during the first call to opEquals.

This seems pretty straight forward to do in pseudo code, but I'm having a lot of trouble figuring out how to internally represent the character arrays so that each method doesn't have an n long switch case, or even worse in methods that compare two strings n! different ways to compare! I was hoping to find an efficient way to do it, but if it can't be done in a clean an efficient manner, then it's probably faster to transcode each time.

If my requirement for efficient caching is taken away, then my "representation hiding" requirement can probably be furfilled by changing your current String module as follows:
-Rename AbstractString to String.
-Merge StringTemplate and MutableStringTemplate
-Move as many method signatures as possible into AbstractString (aka String)
-Make the StringTemplate!(char) able to compare itself to StringTemplate!(wchar), ect for all permutations

The only hang up now is string creation. A few static methods in String as follows would do the trick, methinks:
Code:
class String{
static String create(char[] str) {
  return new MutableString!(char)(str);
}
static String create(wchar[] str) {
  return new MutableString!(wchar)(str);
}


Hmmmm.... to combine my thoughts on string creation and caching, perhaps a template could be created to create CacheStrings at runtime- that is, immutable strings that are compiled with all three encodings already in them, for strings that one hardcodes... But I think I'm getting ahead of myself at this point.

Do you have any thoughts re all the strings implementing some sort of caching? Unless it can be done well, it's probably not worth doing. If it's not worth doing, I'll hack at your String module to make the above modifications and send it to you.

~John

(Sorry for the long post~ it's something of a brain dump, if any of it doesn't make sense to you, just ask and I'll try to translate the Johnspeak)
Back to top
View user's profile Send private message Send e-mail AIM Address
Derek Parnell



Joined: 22 Apr 2004
Posts: 408
Location: Melbourne, Australia

PostPosted: Thu Dec 15, 2005 1:52 pm    Post subject: Interested Reply with quote

BTW, I'm following this conversation with some interest.

I had a mickey-mouse String class written to explore possibilities some time back.

So far I like the way that John's thinking as it is practically how I wanted things to go as well.
_________________
--
Derek
skype name: derek.j.parnell
Back to top
View user's profile Send private message
teqdruid



Joined: 11 May 2004
Posts: 390
Location: UMD

PostPosted: Thu Dec 15, 2005 1:57 pm    Post subject: Re: Interested Reply with quote

Derek Parnell wrote:
BTW, I'm following this conversation with some interest.

I had a mickey-mouse String class written to explore possibilities some time back.

So far I like the way that John's thinking as it is practically how I wanted things to go as well.


So what are your thoughts on Strings caching different encodings?
Back to top
View user's profile Send private message Send e-mail AIM Address
Derek Parnell



Joined: 22 Apr 2004
Posts: 408
Location: Melbourne, Australia

PostPosted: Thu Dec 15, 2005 2:07 pm    Post subject: Re: Interested Reply with quote

teqdruid wrote:
Derek Parnell wrote:
BTW, I'm following this conversation with some interest.

I had a mickey-mouse String class written to explore possibilities some time back.

So far I like the way that John's thinking as it is practically how I wanted things to go as well.


So what are your thoughts on Strings caching different encodings?


To be honest, I wouldn't bother for now. It sounds a bit like premature optimisation. Instead, I'd get the String class API working and 'complete', and then if actual performance problems were being reported I'd look into some caching mechanism. To do it now might be wasted effort.
_________________
--
Derek
skype name: derek.j.parnell
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> Mango All times are GMT - 6 Hours
Goto page 1, 2, 3, 4, 5, 6, 7  Next
Page 1 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group