FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

std.string

 
Post new topic   Reply to topic     Forum Index -> Ares
View previous topic :: View next topic  
Author Message
lightoze



Joined: 12 Feb 2006
Posts: 35

PostPosted: Wed Mar 08, 2006 9:32 am    Post subject: std.string Reply with quote

1) This simple unittest hangs up:
Code:
assert( rfind!(char)( "more fffruits", "fff" ) == 5 );

2) I am rewriting "find" and "rfind" functions now using much more efficient Knuth-Morris-Pratt algorithm. Also I will add useful "findAll" function and send new file this night.
3) I think current "split" function does not act as it is expected. May be it must act as PHP "explode" function - so you can give delimeter which would be used?
Back to top
View user's profile Send private message
brad
Site Admin


Joined: 22 Feb 2004
Posts: 490
Location: Atlanta, GA USA

PostPosted: Wed Mar 08, 2006 11:13 am    Post subject: Re: std.string Reply with quote

lightoze wrote:
2) I am rewriting "find" and "rfind" functions now using much more efficient Knuth-Morris-Pratt algorithm.

Is this a better / faster algorithm than Boyer-Moore ??

http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm

BA
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Wed Mar 08, 2006 12:02 pm    Post subject: Reply with quote

The algorithm I used is similar to the Knuth one except I begin comparing with pattern[0] and simply reduce the search string by pattern.length instead of matching against pattern[$-1] as I believe the Knuth version does. But it's obviously still broken. I'll fix it today and if you want to send me your rewrite then please do so.
Back to top
View user's profile Send private message
lightoze



Joined: 12 Feb 2006
Posts: 35

PostPosted: Wed Mar 08, 2006 12:08 pm    Post subject: Reply with quote

On quite small strings KMP is faster a bit, on large strings - otherwise. This algorithms has similar asymptotics.
Also I have recently noticed, that pos argument is not useful, because using slicing is more clean.
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Wed Mar 08, 2006 12:46 pm    Post subject: Reply with quote

Good point regarding pos. Perhaps it should be removed.

As for split--I'm planning to include split and join from Phobos, and that's the only function I'd gotten to when 149 was released. So expect functions to split on whitespace, a single char pivot, and possibly a substring. Join will be much the same.
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Wed Mar 08, 2006 5:03 pm    Post subject: Reply with quote

I've checked in some updates to std.string, with more forthcoming. The code is generally just tightened up and bugs have been fixed. I've held off on switching to the Knuth algorithm for now as the memory allocation for prefixes can be problematic for very large strings, and I want these algorithms to be usable for all cases. I think it may be reasonable is to choose an appropriate implementation based on pattern length, so average-sized strings would use the Knuth algorithm and large strings would use the current algorithm. An alternative would be to provide both functions with different names so a user could choose the appropriate one for the situation.

[edit]

Ah, it was the Boyer-Moore algorithm I was recalling that simply indexes chars in the pattern. I suspect that it's faster than the KMP algorithm when there are few partial matches, and that the KMP algorithm is faster when there are many partial matches.

Am I placing too much importance on memory use? In practice I suspect it's likely that the pattern string will always be significantly smaller than the search string, whether each are measured in bytes or in gigabytes, and so the BM and KMP algorithms will probably always perform better than the naieve version I've implemented. Still, I don't want the algorithms to simply be unusable for specialized applications.
Back to top
View user's profile Send private message
lightoze



Joined: 12 Feb 2006
Posts: 35

PostPosted: Wed Mar 08, 2006 6:25 pm    Post subject: Reply with quote

Ok, suppose that pattern will be 100MB size, so native search algorythm will take str.length * 100'000'000 time. This is MUCH MORE longer if you would use KMP or BM and on-disc swap. Any protests?
P.S. In this case, KMP will be more efficient because of less memory usage also.
P.P.S. Why does wiki downloads page have url ".../ares/wiki/Downlaods"?
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Wed Mar 08, 2006 6:46 pm    Post subject: Reply with quote

lightoze wrote:
Ok, suppose that pattern will be 100MB size, so native search algorythm will take str.length * 100'000'000 time. This is MUCH MORE longer if you would use KMP or BM and on-disc swap. Any protests?

I think you're right for average case, though such large allocations can take a while. I was also concerned about memory overhead, but it's probably not worth the speed cost. I'll switch to using the KMP algorithm and leave the current code around as a curiosity.
Quote:
P.S. In this case, KMP will be more efficient because of less memory usage also.

Don't both KMP and BM allocate a buffer equal to the pattern size? How is KMP more efficient?
Quote:
P.P.S. Why does wiki downloads page have url ".../ares/wiki/Downlaods"?

I'll change it to be all lowercase.
Back to top
View user's profile Send private message
lightoze



Joined: 12 Feb 2006
Posts: 35

PostPosted: Wed Mar 08, 2006 6:58 pm    Post subject: Reply with quote

sean wrote:
Don't both KMP and BM allocate a buffer equal to the pattern size? How is KMP more efficient?

BM uses two additionals arrays, instead of one in KMP.
sean wrote:
I'll change it to be all lowercase.

I meant "why downlAods?".
Back to top
View user's profile Send private message
sean



Joined: 24 Jun 2004
Posts: 609
Location: Bay Area, CA

PostPosted: Wed Mar 08, 2006 7:01 pm    Post subject: Reply with quote

Because I didn't see the typo Embarassed It's now fixed.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic     Forum Index -> Ares All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group