FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Faster to read 1.1 mb and zlib.decompress or read 4.1 mb?

 
Post new topic   Reply to topic     Forum Index -> General
View previous topic :: View next topic  
Author Message
Lynn



Joined: 27 Aug 2004
Posts: 89

PostPosted: Sun Aug 29, 2004 5:08 am    Post subject: Faster to read 1.1 mb and zlib.decompress or read 4.1 mb? Reply with quote

<alert comment="newbie asking question that may be off-topic">

It appears there a number of people with significant zlib experience that hang out around here. Hope it is ok to ask this:

I have several freeware apps that currently read in a moderately large plain text file (1 mb to 4 mb) before they get started. These flat file "databases" compress very well and have a relatively small "vocabulary". Based on using wc2.d (word count), there are only about 30,000 unique words, and many of these are variations of a root with various prefixes and suffixes.

A 'niche' for this freeware is obsolete computers (486 + Win95 + slow hard drive + 8 mb memory), so performance and having a "small footprint" is an issue. I want the app to load/launch very fast and reading the 4.1 mb file is a major bottleneck. Another niche is pda's.

My questions: Am I likely to see speed gains from reading in the 1.1 mb file and then using std.zlib.decompress? The 1.1 mb file definitely takes less time to read than 4.1 mb, but am I likely to gain or lose because of the decompress time. (I have done some preliminary experiments with the C version of zlib and was disappointed. I asked a similar question in the 7-zip forum but didn't find out much.)

Will decompress time be the same regardless of whether I use compression factor 1, 5, or 9? My impression is that it is slower to use 9 than 1 for compression, but decompression is the same. I will only be compressing once, but decompressing at every app launch.

Are there "tuning factors" to optimize zlib.decompress? The text is quite regular, with 5 files of 4 mb each (20 mb total) only having about 30,000 different words (plus punctuation and white space). My understanding is that it might be somewhat advantageous to supply a "dictionary" of words I know happen a LOT in the files (based on wc2.d). How about rebuilding std.zlib with tweaks that optimize for regular text?

Is it possible for 5 files of 4 mb each to share the same "dictionary", so that the overhead of the lookups is shared by all 5 files? My impression is that the LZW logic might have a code 42 that stands for "in", code 14 that stands for "the", code 1019 that stands for "beginning", etc. More common words are represented with fewer bits.

Thanks for your guidance. Note that I'm aware of other approaches to make load/launch faster, such as "lazy read" and "partial reads".

</alert>
Back to top
View user's profile Send private message
jcc7



Joined: 22 Feb 2004
Posts: 657
Location: Muskogee, OK, USA

PostPosted: Sun Aug 29, 2004 11:43 am    Post subject: Reply with quote

I don't know anything about the mechanics of zlib, but I think I might know a way to make this process faster. If the text could be divided into sections and chapters, then the program might call up the needed text on demand. Whether it was stored as folders and file on the disk or within the zip, I'd expect it to run faster and consume less RAM.

But that might throw off the rest of your current design.
Back to top
View user's profile Send private message AIM Address
Lynn



Joined: 27 Aug 2004
Posts: 89

PostPosted: Mon Aug 30, 2004 5:44 am    Post subject: Reply with quote

Just curious ... what is meant by "Serenity is still flying" ?

I'll probably implement your suggestion in some form or another. It would help with getting the app loaded/launched, but not necessarily searching.

I'm evaluating the feasibility of a "smart Bible module" with a small "footprint", fast searching, and enough "opaqueness" to satisfy Bible publishers to allow permission to use their texts (MKJV, KJ21, MSG, AMP, ESV, etc.)

It seems D might be a good language for this project.

Probably reinventing the wheel? The SWORD Project has a separate file with verse-by-verse offsets into the main data file. However, mosts texts have embedded tags, so searching can be quite slooooooow. LcdBible is a project that is mostly compatible with the sword-api and their modules. BerBible shares the same code base, but doesn't use the sword-api, isn't GPL, and has its own "verse per line" vpl files for AKJV, ASV, BBE, KJV, and WEB.

The other issue is holding down the memory to have the entire OT+NT in a memory buffer to speed up searches. That can give speed improvements of 5x to 50x. I've seen articles about compression techniques that allow searching in the uncompressed text, but they were over my head.
Back to top
View user's profile Send private message
jcc7



Joined: 22 Feb 2004
Posts: 657
Location: Muskogee, OK, USA

PostPosted: Mon Aug 30, 2004 11:20 am    Post subject: Reply with quote

Lynn wrote:
Just curious ... what is meant by "Serenity is still flying" ?
A movie called Serenity is going to be released next year. It's based on cool TV show called Firefly. My sig has a hyperlink attached for more info.

Lynn wrote:
I'll probably implement your suggestion in some form or another. It would help with getting the app loaded/launched, but not necessarily searching.
Hopefully, it wouldn't be noticeably slower, but you're right it shouldn't be any faster for searching.

Lynn wrote:
The other issue is holding down the memory to have the entire OT+NT in a memory buffer to speed up searches. That can give speed improvements of 5x to 50x. I've seen articles about compression techniques that allow searching in the uncompressed text, but they were over my head.
It seems to be that the availability of RAM is really going to vary from machine to machine. I wonder if you could have a "hold text in RAM" optimization option in the program where people with a GB of RAM could benefit from keeping it in RAM and others with no spare memory could load it from the disk each time. (I'm just thinking out loud...)
Back to top
View user's profile Send private message AIM Address
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Mon Aug 30, 2004 2:59 pm    Post subject: Reply with quote

Firefly was the best darn show to air last year, but because I liked it the show was doomed to fail. Billed as a space western, the show was actually more of a situation drama. IIRC was moved from WB to Fox after a few months and then cancelled halfway through the season (to be replaced by some terrible reality show) having never aired the final 2 or 3 episodes. Thankfully, the entire series has been released on DVD.

As for zlib... I've only used it out of necessity so I'd never done any performance comparisons, but I decided to throw one together. This test was done using a custom gzip stream filter I wrote in C++ that uses inflate/deflate. Here's the test code:
Code:

#include "gz_stream.hpp"
#include <fstream>
#include <ctime>

void perfgz()
{
    std::string     buf;
    std::ifstream   ifile( "data.gz", std::ifstream::binary );
    gz_istream      gzin( ifile );
    clock_t         start,
                    stop;
    char            ch;

    buf.reserve( 6000000 );
    start = clock();
    for( ; gzin.get( ch ); )
    {
        buf.push_back( ch );
    }
    stop = clock();
    std::cout << "*** gz file performance ***\n"
              << "size: " << buf.size() << '\n'
              << "start: " << start << '\n'
              << "stop: " << stop << '\n'
              << "elapsed: " << stop - start << '\n';
}

void perftx()
{
    std::string     buf;
    std::ifstream   ifile( "data.txt", std::ifstream::binary );
    clock_t         start,
                    stop;
    char            ch;

    buf.reserve( 6000000 );
    start = clock();
    for( ; ifile.get( ch ); )
    {
        buf.push_back( ch );
    }
    stop = clock();
    std::cout << "*** text file performance ***\n"
              << "size: " << buf.size() << '\n'
              << "start: " << start << '\n'
              << "stop: " << stop << '\n'
              << "elapsed: " << stop - start << '\n';
}


int main()
{
    perfgz(); perftx();
    perfgz(); perftx();
    perfgz(); perftx();
    return 0;
}

I built a text file of size 5,990,488 bytes, which compressed to 1,068,414 bytes. I decided to run the app twice and here is the output:

Code:

*** gz file performance ***
size: 5990488
start: 0
stop: 1642
elapsed: 1642
*** text file performance ***
size: 5990488
start: 1652
stop: 3244
elapsed: 1592
*** gz file performance ***
size: 5990488
start: 3244
stop: 4897
elapsed: 1653
*** text file performance ***
size: 5990488
start: 4897
stop: 6489
elapsed: 1592
*** gz file performance ***
size: 5990488
start: 6499
stop: 8131
elapsed: 1632
*** text file performance ***
size: 5990488
start: 8141
stop: 9734
elapsed: 1593
*** gz file performance ***
size: 5990488
start: 0
stop: 1652
elapsed: 1652
*** text file performance ***
size: 5990488
start: 1652
stop: 3254
elapsed: 1602
*** gz file performance ***
size: 5990488
start: 3254
stop: 4897
elapsed: 1643
*** text file performance ***
size: 5990488
start: 4897
stop: 6499
elapsed: 1602
*** gz file performance ***
size: 5990488
start: 6509
stop: 8151
elapsed: 1642
*** text file performance ***
size: 5990488
start: 8161
stop: 9754
elapsed: 1593

Average gzip perofrmance: 1644
Average text performance: 1596

So on my laptop (4500rpm HD, 512MB RAM, 1.4GHz CPU, Win2k) performance is nearly equivalent. Also, I expect you could get a better compression ratio if you created your own dictionary, though I've never actually done that myself. I also have no idea if using a custom dictionary would impact performance. Finally, an optimized C approach would perform better as it wouldn't involve the buffer chaining my filter method requires. Using the D routines should yield worse performance though, because they allocate a new buffer for every call to uncompress. I may try some comparisons with the D class as well just to see what the numbers look like.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> General All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group