FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

TextStream

 
Post new topic   Reply to topic     Forum Index -> Indigo
View previous topic :: View next topic  
Author Message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 2:52 am    Post subject: TextStream Reply with quote

There has been some discussion between Stewart and me about a possible implementation of TextStream (a counterpart to the Qt QTextStream) in the newsgroup. I post excerpts here.

Stewart:
I've now got it installed and running. Just a few issues drop it short of perfect. For example, it's a step behind std.stream in that it only detects EOF after trying to read past the end.

Uwe:
Hmm, i simply used feof() for determining the File end. If it does not support detecting EOF before actually hitting the end, it would have to be replaced by another function, which in turn would lead to the replacement of all the fileengine functions (fopen, fread, fwrite, etc.). I am not sure if this is worth this improvement? As is stated in the File documentation, i intentionally used the C streaming API because it is portable (look at the amount of code required to get stat() working on three platforms to get an idea of this advantage), and it already includes buffering and all that jazz which is needed for higher-level devices like DataStream and TextStream (which often read only a few bytes at a time). The alternative is to use the open, read, write family of functions, and implement buffering ourselves. Another 1000 lines of code (minimum).

Stewart:
My thought is that an I/O library should be able to detect EOF when it gets there, and that one should also be able to rely on exceptions to catch a premature end of file.

Uwe:
Well, the IODevices are not meant to be used directly. I know, in my helper program imupdate and others i use them directly, but that is only because there is no TextStream yet. Look at DataStream, how it handles premature EOF. This is the interface the user will see: simple shifting of values into / out of the stream, and if something fails an exception is thrown. TextStream should be implemented in a similar manner. IODevice/File should only be used directly by the user if he really wants to, normally he will create a stream for every kind of IODevice (perhaps there will be a socket implementation in the future, or some kind of memory-based file).

Stewart:
But I believe a TextStream is a feasible addition to this library, with the help of a few more IODevice members to help with UTF detection and the like.

Uwe:
UTF detection is also something that should be taken care of in a higher-level protocol, i.e. TextStream. IODevices encapsulate reading/writing raw data to/from an abstract kind of device, like a file, socket, memory. Interpretation of this data is done by DataStream (look into its implementation, it takes care of the number of bytes for every data type, the storing format used for arrays and the like, and the byte order of the machine) or TextStream.

By the way, i am currently writing Unicode character properties and message formatting. The character properties are already finished. I am not sure if they are needed for TextStream implementation (isSpace() perhaps?). Anyways, i uploaded the current Indigo 0.94 to my homepage, thus you can look at the changes.


Last edited by uwe on Fri Jul 22, 2005 2:59 am; edited 1 time in total
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 2:57 am    Post subject: Reply with quote

Stewart:
Rewriting one function doesn't have to mean changing which API the whole library wraps. Especially when this single rewrite doesn't rely on access to another API at all.

The solution for files is very simple: just test pos() == size().

For sequential streams such as stdin, process I/O streams and (?) sockets, it isn't quite so simple, but it isn't complicated either. To test for EOF, read in the next byte. If it returns EOF, return true. Otherwise, add it to the unread buffer or something. (Is the unread buffer supposed to be internal? Or are the wrappers in DataStream and/or
TextStream waiting to be written?)

But testing for EOF should certainly be part of the API that one is meant to use. I refer you back to the concept of expected versus unexpected EOF.

It didn't throw an exception when I tried it. At least as far as an infinite loop of DataStream >> char is indeed infinite. Has this changed?

Uwe:
Huh? I could not believe that, but i tried myself and it is indeed infinite. ?) It is funny that message catalogues are working perfect, despite their usage of DataStream. Always interesting how bugs can hide. Well, this is not intended behaviour, and i'll fix it as fast as possible.

Stewart:
The way encoding detection would work depends on the kind of IODevice. So you're saying TextStream should enumerate the possibilities and take appropriate action?

(For that matter, what is OS API support like for detecting what encoding the console is using, for stdin/out/err stuff?)

Uwe:
Hm, perhaps you're right. I thought the TextStream could read the first bytes and determine if they are a byte order mark. But that does only make sense if it is instantly plugged to a File. Thus the encoding information should be moved into the IODevice, yes. I'll take a look at that.
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 3:01 am    Post subject: Reply with quote

Well, in Qt the TextStream assumes that the device is in the local 8-bit encoding, but autodetects UTF encodings if the first thing it reads is a BOM. There are functions to turn autodetection off, and there are functions to set the encoding. Problems are:

    (1) As you pointed out, this is not always the best solution. But it does its job, as terminals and stuff are local 8-bit encoded, and sockets should be set by the user anyways.

    (2) We have no codecs currently. Just some (pretty fast) functions in indigo.i18n.conversion for the different UTF flavours. I wanted to add a toAscii function there. But codecs for more local 8-bits? Some chinese multibyte encodings??? This is a little bit too ambitious, i think. Anyways, the question remains if we provide the "Codec" base class, and some derived classes for the UTFs, ASCII and ISO-8859-1 to be prepared for later expansion.
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 3:05 am    Post subject: Reply with quote

Stewart:
I see. But what about heuristic detection? (http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/7102)

Uwe:
Nice thing. We'll add that later if we feel like it. Maybe as a member function of the codecs, for example like this:

Code:
int Codec.suitability(void[] someData);


The higher the return value, the more appropiate the codec is for the contents of someData. Then we register all existing codecs in a list, and test them. But only if the user requests that (IODevice.determineEncoding or similar).

And you are right, i think the best solution is to code this into the IODevice, i.e. the different devices have functions for receiving their encoding, and TextStream only calls them after creation. The IODevice base class implements common functionality like this heuristic or BOM detection.

Stewart:
But how can we determine the local 8-bit encoding, in the cases where it isn't a constant of the platform?

Uwe:
Huh, i'm not even sure how to detect the local 8 bit on the different platforms. In Linux its an appendix to the LANG environment variable, and Windows will have some function, i guess. Mac OS too. And if there's nothing, we fall back to ASCII :)

Stewart:
I think that (the Codecs support) would be a good idea.

Uwe:
All of them should be easy to write, cause the functions already exist or are trivial (ASCII and ISO-8859-1). Well, the class should be "TextCodec", and it should work roughly like http://doc.trolltech.com/4.0/qtextcodec.html
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 8:08 am    Post subject: Reply with quote

Stewart:
Not sure. Now that I come to think about it, the only thing that's really dependent on the kind of IODevice is whether it makes sense to apply heuristics. And thinking about it now, if we're going to rely on the class user to request heuristic detection anyway, I guess we can go with the first quoted paragraph and don't really need to add this stuff to IODevice.

Here's how I'm thinking of implementing it now.

The only thing that needs to be added to IODevice is a read-only codec property. This would return null by default, and implement the platform-dependent logic to detect the local codec for stdin/out/err. TextStream would retrieve this from the IODevice on construction, or on setting the device if no codec is already set.

When the time comes to read some data from the TextStream, if the codec is null then it'll do the BOM detection, and if no BOM is present, fall back to UTF-8. We could also have a detectBOM method, which will look for a BOM at the current point and set the codec as appropriate if one is present, otherwise leave the codec unchanged. This'll enable stuff like

Code:
stream.codec = ISO_8859_1;  /* or whatever naming convention we decide on/copy */
stream.detectBOM();


meaning "if there's a BOM, honour it, otherwise treat it as ISO-8859-1". Except that QTextStream has an autoDetectUnicode property - where would this fit into the equation?

I'll try and get somewhere with coding up TextStream over the weekend. Probably supporting only UTF-8 at first, and then improving it to support codecs once these are implemented.
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Fri Jul 22, 2005 8:26 am    Post subject: Reply with quote

Ok, this sounds quite well. My thoughts on it:

IODevice should not contain a codec itself. As you said, IODevice returns a (newly created) codec that it thinks fits the contents:

Code:
TextCodec IODevice::deviceCodec()
{
  return null;
}

TextCodec File::deviceCodec()
{
  // Do some magic OS-inquiry...
  return new SomeFancyCodec;
}


For the moment, just add the IODevice function. This is simple, and everything TextStream needs to implement its features.

I would suggest TextStream should let its codec be null on construction, and nullify it when setDevice() is called. When its internal readSomething() function is called and the codec is null, it does a BOM detection and takes the appropriate UTF codec, if that fails it takes the deviceCodec, and if that is null it falls back to the default local 8-bit encoding for the platform. autoDetectUnicode is true normally, if the user sets it to false then the codec will not do a BOM detection as explained above, but instantly take the deviceCodec / default local 8-bit.

Well, that's what i think. It would have the advantage that the user does not have to care about encodings normally, just create the TextStream and read. And the implementation of readSomething() (i.e. the internal reading function of TextStream, however it is called) would only be slowed down by 1 comparison of m_codec to null, not more.

But this is not necessarily the best variant...

Anyways, if you code something up over the weekend, i would suggest that you first write the abstract TextCodec, and at least 1 derived codec, and then start with TextStream, so that its implementation is consistent. You have plenty of time, as i am going into vacations for 1 week tomorrow.

Ciao & thanks
uwe
Back to top
View user's profile Send private message
smjg



Joined: 29 Sep 2004
Posts: 41

PostPosted: Mon Jul 25, 2005 10:22 am    Post subject: Reply with quote

uwe wrote:
Ok, this sounds quite well. My thoughts on it:

IODevice should not contain a codec itself. As you said, IODevice returns a (newly created) codec that it thinks fits the contents:

Code:
TextCodec IODevice::deviceCodec()
{
  return null;
}

TextCodec File::deviceCodec()
{
  // Do some magic OS-inquiry...
  return new SomeFancyCodec;
}



Is having it in File for if the OS API has a built-in means of detecting the encoding of a file, which we shall deem to be superior to ours?

uwe wrote:
Anyways, if you code something up over the weekend, i would suggest that you first write the abstract TextCodec, and at least 1 derived codec, and then start with TextStream, so that its implementation is consistent. You have plenty of time, as i am going into vacations for 1 week tomorrow.

Ciao & thanks
uwe


I've started work on it. I've stumbled upon quite a few issues in the process:

1. I've found myself tweaking the API in places from the Qt spec, partly so that it fits in with UTF-8 being the standard for D and D's built-in arrays, and partly as ConverterState seems of little use - it seems to be a workaround for Qt's unwillingness to use exceptions. At the moment, I've made convertFromUnicode and convertToUnicode work with void[] and dchar[], and let fromUnicode and toUnicode handle the conversion to/from char[]. This significantly simplifies the process of picking off the requested number of characters.

2. I can't make up my mind whether codecForName/codecForMib should return null or throw an exception if no such codec is registered.

3. I don't see any documentation of how name/MIB clashes between codecs should be handled.

4. Looking at mibEnum, the documentation of the function gives

"It is important that each QTextCodec subclass returns the correct unique value for this function."

And in the "Creating Your Own Codec Class" section,

"Return the MIB enum for the encoding if it is listed in the IANA character-sets encoding file."

Well, what if it isn't listed there?

5. The codecForLocale documentation makes no comment on whether it's supposed to be the GUI or console encoding. However, I'm guessing it would make sense to make this the GUI encoding, and leave the deviceCodec of the stdin/out/err wrappers to determine the console encoding.

I'll also be away next week, but I'll try and get some more done before I depart.

Stewart.
Back to top
View user's profile Send private message
smjg



Joined: 29 Sep 2004
Posts: 41

PostPosted: Thu Jul 28, 2005 5:27 am    Post subject: Reply with quote

We ought to come to an agreement on a few things:
  • What type should be used to represent boolean values? bit, bool, int or something else?
  • It seems to be a common practice to use int for nearly all integer parameters/returns, regardless of whether they actually can be negative. Should we be copying them? Or making use of the unsigned types where they fit?
See you in a week and a bit....

Stewart.
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Sat Jul 30, 2005 8:45 am    Post subject: Reply with quote

Ok, i try to answer the simple questions at first:

  • Boolean values are represented by int. This is already a standard throughout the library. In the function documentation there should be some text like "... returns true if blablubb, and false otherwise" to clearly indicate this fact. And the function name should reflect that it returns a boolean value (like isEmpty in contrast to empty, which could also be understood as "empty it").

  • I think we should use unsigned values where they fit. As the D specs suggest, i use size_t for all types of positive counting, and ptrdiff_t for differences between pointers/indizes. They will also expand to 64 bit automatically, and are faster on these machines.


I have looked at your code, and there are really a lot of difficult issues we have to work out. As you will be away for a few days, i'll look into it, adjust some things and write some comments. I'll also work on a function that determines the local encoding. I hope this will not destroy some TextStream code you've already written. Just some fast thoughts that come to my mind:

  • We don't need the codec for C strings (as D sources are all Unicode, many many thanks to Walter).

  • As you said, UTF-8 is the default in D. As there is no QString like class currently, TextCodec should "spit out" char[] arrays, and consume void[] arrays. We should get rid of the two-stage-conversion, as it is slow?

  • tr() is the Qt function that takes care of message translation. This is absolutely no issue for us, as D sources are Unicode. The message translation is already finished, and it is fully Unicode based.

  • I have rethought about IODevice::deviceCodec(), and i am not sure if this was such a good idea of me. You are right, why should the OS know the encoding of a file? And what happens if we have a dumb OS and need to find out ourselves? I'll try to write the function that detects the local encoding. Perhaps this mess will clean up itself then...
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Tue Aug 02, 2005 6:46 am    Post subject: Difficult... Reply with quote

Hmm. This codec stuff is difficult, to say the least. At first some observations i made:

  • I think we don't need the len/count parameters of the conversion functions, as we pass dynamic arrays instead of pointers. They already contain a length. The user can specify the length with slicing.

  • Please use the indigo.i18n.conversion functions for conversion between the different UTF dialects. They are much faster than the Phobos functions (factor 5-20).

  • I have noticed that you give the parameter types of functions in the doc comments. This is a nice idea. For consistency with the rest of the lib i have deleted them for now, but perhaps i will add them to all functions in the lib in the next time.


Ok. What i did not like in your code are the many allocations and conversions that could be unnecessary. Imagine a file that is encoded in UTF-16. Now the user wants it as UTF-8. It is then encoded chunk after chunk into UTF-32, just to be encoded into UTF-8 afterwards. With memory allocations all along the way, of course. Imagine the mass of memory needed for a large file. This will significantly slow down everything.

A thing i cannot decide on is which output encoding should be used: UTF-8/16/32? The problem is: UTF-8 is the right one for console output. UTF-16 is correct for Win32 API functions, and it will be the format used by String (after the completion of the localization module we will not need ICU any more, and this is the point where i will begin to use String heavily in all applications, and of course in Indigo itself). UTF-32 is the simplest for all the codecs to implement, as most of them only need a table TheirEncoding <=> UTF-32 character.

What i want to achieve is the following:

  • A simple interface for encoding/decoding single strings (without open encodings at the end) which are in UTF-8/16. It need not be ultimately fast, and may contain an unnecessary allocation. In Qt, this is provided by the public TextCodec functions.

  • An interface for encoding/decoding lots of data chunk after chunk, with possible open encodings at the end of chunks, from/to UTF-8/16. This interface must be optimized for speed as far as possible, and it must be usable without *any* unnecessary allocation/conversion. This means also that UTF-8/16 encoded files are simply passed on, and no conversion takes place. In Qt, this is provided by the TextEncoder/Decoder classes.


Please contact me as soon as you return, before you continue working on TextCodec. I will try to work out an API in the meantime. I guess you agree that our TextCodec API will be quite different from the Qt one... i drool over comments!

Ciao
uwe
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Tue Aug 02, 2005 7:20 am    Post subject: Main encoding used by TextCodec Reply with quote

After some thinking, i concluded it would be best if TextCodec uses UTF-16 as its internal encoding:

Code:
abstract wchar[] convertToUtf16(inout void[] input, wchar[] buffer);


This function would replace the old convertToUnicode(). Its output can be used by String directly, it is suitable for Win32 API functions, and if we ever write up the GUI parts of Qt, this is the format that's needed. The buffer parameter is used by the function for storing the UTF-16 result if it suffices.

If we decide to support UTF-8 as well (should be clever for all users that directly output to console, or need it for their own code, where i guess the majority of coders uses char[]), we could add this protected function:

Code:
char[] convertToUtf8(inout void[] input, char[] buffer, wchar[] utf16Buffer)
{
  return indigo.i18n.conversion.toUtf8(convertToUtf16(input, utf16Buffer), buffer);
}


This function will do the job, using both of the buffers if possible and needed. Derived classes can overload it, for example the UTF-8 TextCodec will overload it like this:

Code:
char[] convertToUtf8(inout void[] input, char[] buffer, wchar[] utf16Buffer)
{
  wchar[] result = cast(wchar[]) input;
  input = null;
  return result;
}


But there are other codecs that could provide a better implementation as well, for example the ASCII codec.

Now the public interface of TextCodec simply uses these functions, and checks that there is no remainder:

Code:
final wchar[] toUtf16(void[] input)
{
  wchar[] result = convertToUtf16(input, null);
  if (input.length != 0)
    throw new CodecException(i18n("Illegal remainder!"));

  return result;
}


The TextEncoder/Decoder classes maintain buffers which are passed to the convert functions of TextCodec, to avoid allocations. For the ConverterState: you are right, we should get rid of that baggage. The user should use TextEncoder/Decoder if he wants to do this kind of stuff, and the public TextCodec functions simply throw exceptions if there are leftovers.

On a sidenote: i am not sure about how the remainder should be returned in the protected functions. But this is just a question of definition. The inout parameter is a good idea, but if you think that an "out size_t remainder" which denotes the number of bytes that are left at the end of the input is better, just go with that.

Ciao
uwe
Back to top
View user's profile Send private message
smjg



Joined: 29 Sep 2004
Posts: 41

PostPosted: Sat Aug 06, 2005 12:55 pm    Post subject: Reply with quote

uwe wrote:
  • I have rethought about IODevice::deviceCodec(), and i am not sure if this was such a good idea of me. You are right, why should the OS know the encoding of a file? And what happens if we have a dumb OS and need to find out ourselves? I'll try to write the function that detects the local encoding. Perhaps this mess will clean up itself then...

I certainly consider it a good idea, for the reason I thought you had in mind in the first place: to detect the console encoding and hence make the stdin/out/err wrappers we will write usable straight out of the box. Of course, most IODevice types other than stdin/out/err would return null here denoting that TextStream will detect the encoding or use a default encoding depending on how it's configured.

uwe wrote:
Hmm. This codec stuff is difficult, to say the least. At first some observations i made:
  • I think we don't need the len/count parameters of the conversion functions, as we pass dynamic arrays instead of pointers. They already contain a length. The user can specify the length with slicing.

The reason I kept the count parameter in a few functions is to facilitate encoding/decoding of a specific number of characters. Given a UTF-8 string, we can't extract the first five characters by array slicing alone, since we don't know in advance how many bytes these five characters occupy.

uwe wrote:
  • Please use the indigo.i18n.conversion functions for conversion between the different UTF dialects. They are much faster than the Phobos functions (factor 5-20).
  • I have noticed that you give the parameter types of functions in the doc comments. This is a nice idea. For consistency with the rest of the lib i have deleted them for now, but perhaps i will add them to all functions in the lib in the next time.

I personally don't really like the notation of functionName() on everything - it makes it look as if the function really has no parameters.

uwe wrote:
A thing i cannot decide on is which output encoding should be used: UTF-8/16/32? The problem is: UTF-8 is the right one for console output.

Try telling that to someone who isn't using the same OS as you.

uwe wrote:
UTF-16 is correct for Win32 API functions, and it will be the format used by String (after the completion of the localization module we will not need ICU any more, and this is the point where i will begin to use String heavily in all applications, and of course in Indigo itself). UTF-32 is the simplest for all the codecs to implement, as most of them only need a table TheirEncoding <=> UTF-32 character.

You indeed have a good question. I've indeed been using UTF-8 as the output encoding, since this is what Phobos uses. But when you talk of using a String class, do you mean:
  • forcing the users of Indigo classes to contend with your string class (not a good idea IMO)?
  • giving the user the choice of using your strings or D's own?
  • using it only internally?

Stewart.
Back to top
View user's profile Send private message
smjg



Joined: 29 Sep 2004
Posts: 41

PostPosted: Sat Aug 06, 2005 1:37 pm    Post subject: Remainder Reply with quote

uwe wrote:
On a sidenote: i am not sure about how the remainder should be returned in the protected functions. But this is just a question of definition. The inout parameter is a good idea, but if you think that an "out size_t remainder" which denotes the number of bytes that are left at the end of the input is better, just go with that.

Ciao
uwe

My plan was that the to/fromUnicode(input, number, remainder) versions would be used if one wants to pick off a specific number of characters from a stream. In principle, remainder is out, but I declared it as inout to support the idiom
Code:
decodedText = toUnicode(buf, 42, buf)

to pick off a number of characters and at the same time remove them from the input buffer.

Stewart.
Back to top
View user's profile Send private message
uwe



Joined: 05 Apr 2005
Posts: 34
Location: Stuttgart, Germany

PostPosted: Sun Aug 07, 2005 7:26 am    Post subject: Reply with quote

You are right, deviceCodec() is a good idea to detect the console encoding. It is different from the standard local encoding in Windows, by the way. We'll add that.

I cannot think of a good example where we need to specify how many characters to encode/decode. And if so, he should find out where these characters end by some other function (there are some in std.utf, and perhaps Indigo String will get one, too?).

I will add the parameters to the function names in the docs. This really is a good idea.

For the encodings: look at the code in stewart.tar.gz, i think this is a good solution. It does not force the user to use String, but it makes using String as fast as using wchar[], and it provides an as-fast-as-possible UTF-16 output and an almost-as-fast UTF-8/UTF-32 output for those that use UTF-8 in their code (the majority?). I hope that makes everybody happy. :)

With "using String heavily" i meant for my personal use, and in Indigo internally. I will not enforce String use, as it is much simpler to receive a wchar[], and converting a string to a wchar[] is a one-liner.

To your email question about TextEncoder/Decoder: They are designed to convert a lot of text efficiently (without unnecessary allocations), and thus should be used by a class like TextStream. Encoder/Decoder will also deal with the remainders in a transparent manner, so TextStream does not have to worry about them. Anyways, they are to be written only once, not for every encoding, and i don't think they will be more than 800 lines of code + docs.

Ciao
uwe
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> Indigo All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group