Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #1445 (new enhancement)

Opened 15 years ago

Last modified 14 years ago

Proposal: Mutiple regex engines

Reported by: pcwalton Assigned to: jascha
Priority: normal Milestone: 2.0
Component: Tango Version: trunk
Keywords: Cc:

Description

After researching more about regex engines, I believe that the Haskell people have it right. There are many regex engines, and one algorithm cannot possibly fit all needs. Currently D uses the Tagged NFA approach, which has difficulty with backreferences and greedy/non-greedy quantifiers.

Here is what Haskell does: http://www.haskell.org/haskellwiki/Regular_expressions

I would be interested in writing such an interface for inclusion in Tango if the Tango community agrees.

Change History

(follow-up: ↓ 6 ) 01/20/09 06:55:43 changed by kris

I think it's worth investigating further, perhaps with pcre as an alternate backend candidate?

(follow-up: ↓ 4 ) 01/20/09 07:53:28 changed by Don Clugston

Can't you select the engine based on the contents of the regexp? You obviously need to know which regexp definition you are using (Perl, Posix, or maybe even Perl6), since that affects the meaning. But I don't think the choice of engine should be visible in the user code. In cases where the regex string is available at compile time (which is potentially 95% of the time), you could avoid linking in unused engines.

01/20/09 08:03:46 changed by larsivi

I think we only should (if its a possible solution) have one syntax/definition, but based on the features used in the string select a suitable engine - in practice I think more than 2 is likely to be overkill.

(in reply to: ↑ 2 ) 01/20/09 08:32:38 changed by pcwalton

Replying to Don Clugston:

Can't you select the engine based on the contents of the regexp? You obviously need to know which regexp definition you are using (Perl, Posix, or maybe even Perl6), since that affects the meaning. But I don't think the choice of engine should be visible in the user code. In cases where the regex string is available at compile time (which is potentially 95% of the time), you could avoid linking in unused engines.

That's possible and is an interesting idea. I've already started a PCRE-based regex implementation using the JavaScript? interface, which of course could be changed.

The semantics of a regex do change based on the regex library in use... for example, POSIX is longest-match, while Perl is leftmost-match. TRE is the most popular library that implements Tagged NFA, but it has several 2+-year-old issues found by the Haskell guys, which shows that Tagged NFA is hard to get right in practice. PCRE is slower, but has the advantage of being well-tested and provides lots of features. PCRE also implements a DFA algorithm using the little-known pcre_dfa_exec() function.

Some Haskell hackers have started work on a bounded backtracking algorithm, which looks really interesting and could provide a good compromise between the two once it stabilizes.

01/20/09 08:46:13 changed by larsivi

  • owner changed from kris to jascha.
  • milestone changed from 0.99.8 to 0.99.9.

(in reply to: ↑ 1 ) 03/27/09 11:42:09 changed by yidabu

Replying to kris:

I think it's worth investigating further, perhaps with pcre as an alternate backend candidate?

Good idea. curent tagno Regex is very slowly, and has many bugs.

Write a new and good Regex will take too much time.

(follow-up: ↓ 8 ) 03/27/09 18:55:48 changed by larsivi

slowly?

Slow is the last thing the regex is afaik, but it has been developed a bit slow, and compiling the regex is slow in some cases.

Also PCRE is an unlikely candidate for integration in Tango due to its license (BSD).

(in reply to: ↑ 7 ) 03/27/09 22:17:51 changed by yidabu

Replying to larsivi:

slowly? Slow is the last thing the regex is afaik, but it has been developed a bit slow, and compiling the regex is slow in some cases.

I tested a match, tango(0.99.8) Regex is very slowly.

03/27/09 23:21:54 changed by kris

there are a small number of backtracking-style regex patterns that are sub-optimal in the Tango regex (it was not designed for those). However, for everything else it has been shown to be exceptionally fast at executing searches.

You may be using one of the backtracking patterns, so perhaps you'd prefer to use pcre instead? There's nothing in Tango that would hinder you in doing so.

03/28/09 00:15:00 changed by yidabu

e.g.

auto s = `"07MUF7Sexzq1AM:","http://www.domain.com","108","122","abc","","","335 x 377 - 85k"`;;
auto p = `"(\w{14}:)","([^"]+)","(\d+)","(\d+)",(?"[^"]*",){3}"(\d+) +x +(\d+) +- +\w+"`;
auto regex = new Regex(p);
Stdout(regex.test(s)).newline;

It takce 17 seconds on my PC.

I use PCRE now.

01/30/10 14:46:22 changed by larsivi

  • milestone changed from 0.99.9 to 2.0.