Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ MUSHclient ➜ General ➜ RegExp VM

RegExp VM

It is now over 60 days since the last post. This thread is closed.     Refresh page


Pages: 1  2 3  

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #15 on Mon 26 Apr 2004 06:57 AM (UTC)
Message
With all due respect, Shadowfyr, I think that you're trying to solve a problem that isn't really solvable.

Quote:
PCRE simply isn't able to deal with text on the fly, which is why you are messing with a look-back in the first place.


Let's think about this for a bit.

Regular expressions are implemented as deterministic finite-state automata. Basically, these machines take an input string, and if they are in an 'accept' state at the end of input, then the string is "valid", otherwise the string is "invalid".

A theoretical note:
Any regex can be converted to a DFA, and any DFA can be converted to a regex.


So, with this in mind, let's think about parsing regular expressions "on the fly". This is definitely possible, in a sense, if you run all of your machines in parallel, stopping each one as soon as you hit an invalid input. You will end up, in the end, with one or more success machines, or none at all.

The problem is that a regular expression matching DFA - if it is actually correct - *cannot* know if something matches until all of the input is consumed, except in very special circumstances.

The problem is that we're not just checking for belonging to the language defined by the regular expression. We are also checking for *how far* we can match until the string stops belonging to the language.

So, how could one possibly match "on the fly"? You would take input as you receive it, and then feed it to your state machine; in the meantime, what do you do with the state machine? What do you do with the MUD output? Do you leave the machines hanging as you receive and print output, and maybe later act on the input - and perhaps go change the output window contents?


I think the major problem that you are grappling with is that yes, the problem *should* be simple, but it just isn't. If the PCRE folks haven't done this, don't you think there is a reason? The solutions that Nick proposes add a degree of complexity (which I consider to be fairly minimal) but provide correctness. The solutions you provide seem to be "practical" in some cases (but limiting in others), but also just seem to be incorrect. Yes, they would work in 95% of the cases. But in those extra 5%, you'll have to apply some kind of fudging that just really isn't a good idea.

As far as I can tell, Nick's solution can do everything yours can. It involves a little extra work in some cases, but provides functionality that your method does not. Personally, I don't think that the list matching you dismiss is that useless. In any case, I think that the added complexity (which again I don't think is that major) is worth the generality and especially correctness. To be frank it seems that your main problem is just that it's not what you had in mind for certain specific examples. :)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #16 on Mon 26 Apr 2004 06:07 PM (UTC)

Amended on Mon 26 Apr 2004 06:12 PM (UTC) by Shadowfyr

Message
It wasn't what I had in mind for most examples. You are correct that his does automatically handle the case of 'did all of this match'. With mine you would need to keep track of the data yourself, until it either finally succeded or failed. In some respects I think this tends to be more flexible. I didn't really consider the issue with needing to know 'all' of the lines before doing something. Unfortunately, the very things I was looking for a way to fix, can't be fixed by either method, i.e. coloring lines or omitting from output. For those you end up being forced to fall back on old methods. Neither concept really solves this or necessarilly can.

The only real advantage is that mine didn't need the extra 'look for x number of lines', though it would have required doing what Nick is doing to work correctly (as you point out), i.e. storing each line matched, until a final match happens, then doing the stuff in 'send' or a script. Its not a big change to the design and would, with that change, work exactly like his, but without the risk of catastrophic malfunctions in those cases where the user does something dumb, like sending two 'inventory' commands in a row and using an insane value like 50 for the number or possible lines to match. It isn't an impossibility that such would happen or that the inventory would change between them, think, checking it, then having some clown into the room and hand you something... Ooops! Its this possibility for error that I don't really like about it, even if I was admittedly slightly off track on my own design.

Oh, and mine wouldn't need to be turned off when you 'think' you don't need them.
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #17 on Mon 26 Apr 2004 07:18 PM (UTC)
Message
Well, I'm not sure the problem you bring up is actually a problem. It depends on whether the regular expression is greedy or not. You would have to make it non-greedy in this case. And if the user is writing bad triggers, well, that's the user's fault. :-)

I think that perhaps there could be a more official kind of support for stacks of single-line triggers. That seems to be your method of choice; having triggers that turn each other on in some way where the user keeps track of data. Perhaps you should formalize exactly and precisely what you would want to see such a system do, instead of lots of ideas... The thing is that Nick's system is an additional feature, the subsequent triggers are more of a way of using the current system, if you see what I mean. What more exactly do you want added to that?

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,158 posts)  Bio   Forum Administrator
Date Reply #18 on Mon 26 Apr 2004 09:43 PM (UTC)
Message
Quote:

... the risk of catastrophic malfunctions in those cases where the user does something dumb, like sending two 'inventory' commands in a row and using an insane value like 50 for the number or possible lines to match.


Again, you anchor the trigger with \z as the last character so it always matches the most recent inventory. However I agree that you may need to use ungreedy wildcards to ensure you get the smaller rather than the larger inventory. However this is always an issue with regexps.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #19 on Mon 26 Apr 2004 11:16 PM (UTC)
Message
Yeah Nick. Already forgot that \z bit. lol

And Ksilyan, like I said, I failed to consider the way multilines need to match. If I had, I would have suggested returning the entire thing in a wildcard, just like Nick is doing. In that case, the *only* difference between my state machine and Nick's system is that mine wouldn't need to be told how many lines are needed before hand. The bahaviour would be basically identical in all respects. I am not sure exactly what you think I need to 'formalize' anything. I wasn't thinking carefully enough when suggesting that such a state machine based system should return each line and group of wildcards immediately. But now that I am thinking in those terms, it could have done both with the addition of one simple switch, which could have told it to treat each match as a seperate event when you needed that. As a rule, it isn't, as you point out, how multiline triggers normally work, so making the default behaviour return the result 'in total' would make more sense.

Given this, the real difference between the approaches is mine providing one option that Nick's can't, and the fact that his executes 'after' the lines are all available, while mine would process each individual line one after the other and return the result as soon as all criteria have been met. Mine would have a slight added overhead for all instances, the overhead for his grows for each line it needs to test against. But *both* could work exactly the same way with respect to 'how' and 'when' they return results. I just goofed up when initially thinking about them and what 'multiline' really means.
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #20 on Tue 27 Apr 2004 12:31 AM (UTC)
Message
Quote:

In that case, the *only* difference between my state machine and Nick's system is that mine wouldn't need to be told how many lines are needed before hand.


How do you know when to stop matching?

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #21 on Tue 27 Apr 2004 02:56 AM (UTC)

Amended on Tue 27 Apr 2004 02:58 AM (UTC) by Shadowfyr

Message
Umm.. When the termination string you put into the trigger tells it to or it fails to match something? lol Seriously, if someone makes one that is so sloppy it matches forever, why is it the fault of the algorythm or client that they messed up? Even Nick appears to be using a bailout (according to one post I read today) of about 100 lines, so the buffer can't be set to like 4000 lines. If the trigger is designed right, it should have an obvious point to stop, if not, telling to to look for 20 lines is convenient to have, maybe, but all your doing is specifically telling it to bail out of the matching 'before' whatever internal limit the client itself uses. But having it optional, when you need it is significantly different than using the wrong number and having to buffer 100 lines, when you actually only match 10, or worse, not matching 11, because you told it the bailout should be 10.
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #22 on Tue 27 Apr 2004 05:37 AM (UTC)
Message
Consider this:


Inventory:
   a sword
   a spear

< hp ma mv > inventory
Inventory:
   a sword
   a spear



Now consider the regular expression:
^Inventory:$(.*)^$

Or something to that effect. How do you know when to stop matching? Do you get the first bit? Or the whole thing?

In this case it's clear, the answer is that you only want the first. But that means your algorithm is non-greedy. What if you want it to be greedy?

Perl has a non-greedy specifier, ?, which could work. But Nick already mentioned that as the solution.

How would you propose, Shadowfyr, to deal with greedy vs. non-greedy regular expressions?

I mean no offense but it seems that you are operating under the "it's obvious what I meant" principle. As you are well aware, you can't design a system based on what the user meant; you can *only* design systems that do *exactly* what the user specifies. Otherwise you are sacrificing correctness.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,158 posts)  Bio   Forum Administrator
Date Reply #23 on Tue 27 Apr 2004 06:23 AM (UTC)
Message
This is pretty-much my point. That particular hypothetical pattern is matching:

Inventory:
(anything at all)
(a blank line)

How does the state engine know to move from (anything at all) to (a blank line) given that (a blank line) falls into the category of (anything at all)?

This is why the regexp executer really needs to take the text as a whole block, you can't really break it into smaller pieces and have an external decision about when to move from one piece to another.

Even having a simpler expression:

Inventory:
(anything)
You have x items.

... still has that problem. The phrase "You have x items." still matches the middle expression of (anything).


- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #24 on Tue 27 Apr 2004 11:43 PM (UTC)
Message
I have conceeded tha tmy idea does not work with greedy expressions. I did suggest that in those cases it is just as easy to allow the user to specify an 'optional' limit on the number of lines. Or did you miss reading this? You seem to still be operating on the assumption of the original idea, without taking into acount the changes I proposed to make it work more like Nick's. The point here is 'optional' *with* 'greedy triggers', not 'always required'. Nick's always requires it, even when the regular expression itself has no requirement for this. And as I said, it is a difference between a *slight* increased overhead for all regexp and a *major* increase in overhead the moment you start looking for large numbers of lines, since Nick's has to retest the entire buffer of lines it builds, every single time. Mine only tests one line at a time, but *could* be made to do:

1. Non-greedy - Store all matched lines until all are found or it fails.
2. Greedy with a user defined limit - Store all matched lines, but fail if the limit is exceeded.
3. Act like each match is a seperate event and return each line + wildcards as they are matched, instead of storing the entire result.

Nick's does:

1. Assume all triggers are greedy, require a limit to be set and store everything until all matches succeed or one fails. It also has a built in limit, (I assume from another post I saw), so even if the user specifies 200 lines, it cannot match, since the internal limit of the buffer is 100. Unless I misread the meaning of that post where you asked how many lines users are likely to want to test Nick?

Mine is more flexible because it doesn't need the buffer, but can do exactly the same behaviour, when needed.

However, the point is mute anyway, since I don't have the skill with parsers to build one that can correctly delegate each section of the expression to the state machine instructions. Thus, unless someone else built and proved the concept, it isn't going to happen.
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #25 on Wed 28 Apr 2004 (UTC)
Message
My point is that I don't believe such a parser is even possible to build, to be precise and correct at the same time.

Quote:
1. Non-greedy - Store all matched lines until all are found or it fails.

If you're storing all matched lines, you're not really being non-greedy. A non-greedy expression would match the *least* amount of lines possible.
Quote:
2. Greedy with a user defined limit - Store all matched lines, but fail if the limit is exceeded.

This is greedy, yes. Although there is the problem of matching lines that may not have the terminating condition - imagine you have a list that is 105 lines long, but your limit is 100. When you reach 100, what do you do? How do you *know* that the text beyond those 100 lines still matches the regular expression? Perhaps everything you have "matched" up to now simply does not correspond to the regular expression.
Quote:
3. Act like each match is a seperate event and return each line + wildcards as they are matched, instead of storing the entire result.

One cannot determine wildcard matches until one has the whole match, because the wildcards are built during the parsing of the whole text.
How does one "return" lines and wildcards as they are matched? Again, you have the problem of terminating conditions.

Basically, like I said I have the feeling that you're operating under what is convenient to use, but not what is correct in all cases. I'm truly not convinced that it's possible to write a parser as you are suggesting, that can somehow "guess" if input beyond what it received is valid or not.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #26 on Wed 28 Apr 2004 01:19 AM (UTC)

Amended on Wed 28 Apr 2004 01:20 AM (UTC) by Shadowfyr

Message
Quote:
1. Non-greedy - Store all matched lines until all are found or it fails.

If you're storing all matched lines, you're not really being non-greedy. A non-greedy expression would match the *least* amount of lines possible.


Huh?? If I have it match three specific lines and *only* those lines, then obviously it is going to return "all matched lines". Your arguing symantics here. It does match the least amount possible, but it returns all that it matched, not all that it *could* match.

Quote:
2. Greedy with a user defined limit - Store all matched lines, but fail if the limit is exceeded.

This is greedy, yes. Although there is the problem of matching lines that may not have the terminating condition - imagine you have a list that is 105 lines long, but your limit is 100. When you reach 100, what do you do? How do you *know* that the text beyond those 100 lines still matches the regular expression? Perhaps everything you have "matched" up to now simply does not correspond to the regular expression.


Please, enlighten me how providing a user defined limit or having it automatically fail when it hits 200 lines (check Nick's release notes, this is what his does) differs from Nick's?

If it is 105 lines long and you tell Nick's version to look for 100 lines, it still isn't going to find 105 lines. Nick's answer *and* mine is that the limit means that it is *a limit*. If by that point it has captured all the lines it was asked to, in cases like "(?s)^You have\:$(.*)*" then it is considered a match, it can't look for 105 lines, because you told it to "only" find 100. If the lines is "(?s)^You have\:$(.*)*^$", then the rules change. In this case it would fail if you have 105 lines and return nothing, again because you told it to stop at 100 lines. This is no different if you use mine or Nick's, but mine only *requires* it in the first case, since it can correctly match the second case, even if you have 200, 500, or 1,000,000 lines. It already knows 'when' to stop in such a case.

Quote:
3. Act like each match is a seperate event and return each line + wildcards as they are matched, instead of storing the entire result.

One cannot determine wildcard matches until one has the whole match, because the wildcards are built during the parsing of the whole text.
How does one "return" lines and wildcards as they are matched? Again, you have the problem of terminating conditions.


This is a special case, it is not intended to act like a normal multiline trigger, but to provide a single trigger to solve a problem that currently requires several triggers working together to work. Again, this is *not* the normal behaviour it would have, but an option that could be turned on in cases where such behaviour is wanted. Yes, it doesn't technically work the way multiline triggers are supposed to, but it isn't supposed to in this case and it would be off by default.
Top

Posted by David Haley   USA  (3,881 posts)  Bio
Date Reply #27 on Wed 28 Apr 2004 02:33 AM (UTC)
Message
Quote:
Huh?? If I have it match three specific lines and *only* those lines, then obviously it is going to return "all matched lines". Your arguing symantics here. It does match the least amount possible, but it returns all that it matched, not all that it *could* match.


I'm not sure I see what you're talking about. The problem of greedy vs. non-greedy comes when you have wildcards in your regular expression. So it wouldn't be matching three specific lines; it's matching a block of text.

I was just commenting on the fact that your statement wasn't really precise in an algorithmic sense.

Quote:
Please, enlighten me how providing a user defined limit or having it automatically fail when it hits 200 lines (check Nick's release notes, this is what his does) differs from Nick's?


I misread your original statement, I did not see the part about failing upon hitting the limit.

Quote:
This is a special case, it is not intended to act like a normal multiline trigger, but to provide a single trigger to solve a problem that currently requires several triggers working together to work. Again, this is *not* the normal behaviour it would have, but an option that could be turned on in cases where such behaviour is wanted. Yes, it doesn't technically work the way multiline triggers are supposed to, but it isn't supposed to in this case and it would be off by default.


It seems to me as if you're breaking the elegance of the system for the sake of convenience. It just reeks of "hack" to me. :)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
Top

Posted by Nick Gammon   Australia  (23,158 posts)  Bio   Forum Administrator
Date Reply #28 on Wed 28 Apr 2004 03:52 AM (UTC)
Message
Quote:

It also has a built in limit, (I assume from another post I saw), so even if the user specifies 200 lines, it cannot match, since the internal limit of the buffer is 100.


There is a built-in limit, yes, basically to limit the number of lines that are stored "just in case" you may want to do a multi-line trigger. After all, each line takes memory.

However you will not be allowed to ask to match more than the number of lines in this limit. It has been increased from 100 to 200 in case some people have big inventories, or "who" lists or whatever.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
Top

Posted by Shadowfyr   USA  (1,791 posts)  Bio
Date Reply #29 on Wed 28 Apr 2004 04:47 AM (UTC)
Message
Quote:
It seems to me as if you're breaking the elegance of the system for the sake of convenience. It just reeks of "hack" to me. :)


Ah, true, but not more inelegant or a hack than dumping a copy of every line into a bucket and telling your assistant to dig through it after every line you drop in to see if now contains everything you wanted to find. ;) lol
Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


78,781 views.

This is page 2, subject is 3 pages long:  [Previous page]  1  2 3  [Next page]

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.