Inconsistent/Incorrect Brace Matching

pwdiener · Post by **pwdiener** » Thu Jul 26, 2012 5:08 pm

I've been developing a somewhat complicated regex for a macro I'm trying to put together, and appear to have hit some threshold where brace matching stops working predicatably. The regex is:

Code: Select all

^......[^\*][ ]*((([0-9][0-9]|FD|SD|RD|CD)[ ]+)?(([a-zA-Z0-9_-]+)([ ]+SECTION([ ]+([0-9]*))?)?[.]?))?[ ]*COPY[ ]+(((DDS?R?)-((ALL-FORMATS?)|([A-Za-z0-9_]+))(-I-O|-I|-O)?(-INDIC(ATORS?)?)?)|([a-zA-Z0-9_-]+))([ ]+(IN|OF)?[ ]*(([a-zA-Z0-9]+-)?([a-zA-Z0-9]+))?)?

There are a couple of things that I've noticed here that aren't quite as I expected

1) At position 8, nothing is highlighted. This happens a couple other places too, the next ones being 17 and 49. It appears that, at least in some cases, the look forward isn't detecting a character that needs to be matched.
2) At position 66, the brace matcher chooses to show the match for the ending brace in position 65 rather than the opening brace at 66. I'm not sure there is a "correct" way to do this, but I would probably have expected the opposite.
3) At position 78 and 83, the same issue as 1) appears.
4) At position 90, a correct ending match is shown. However, at 91, 93, 99, and 100, the ending match shows the same starting match character (position 83), which is wrong. This is probably the most serious issue, since it shows a match this is just incorrect instead of just not quite as expected.

Not sure what's going on here - whether the nesting levels of parens have anything to do with this or not. I had put together a slightly less complicated version of the regex previously, and noticed no issues. That version of the regex is:

Code: Select all

^......[^\*][ ]*(([0-9][0-9]|FD|SD|RD|CD)[ ]+([a-zA-Z0-9_-]+[.]?))?[ ]*COPY[ ]+(((DDS?R?)-((ALL-FORMATS?)|([A-Za-z0-9_]+))(-I-O|-I|-O)?(-INDIC(ATORS?)?)?)|([a-zA-Z0-9_-]+))([ ]+(IN|OF)?[ ]*(([a-zA-Z0-9]+-)?([a-zA-Z0-9]+))?)?

With this one, position 8 still doesn't display a match, but position 17 does, where the only apparent difference is 2 opening parens instead of 3. I did not see any of the other issues with this version - everything seemed to be working as expected.

I also don't know if this is new behavior. I'm running 3.97n-beta9. BTW, both are accepted as legal regex by search/replace and appear to work as expected.

pwdiener · Post by **pwdiener** » Thu Jul 26, 2012 5:25 pm

As an aside, it would also be nice to have syntax that supported more than 9 captures in a replace string. I'm not sure what it should look like, though - sometimes I might want \11 to be the first capture followed by a 1, other times the 11th capture is what I want. I guess that's why there isn't such a syntax already.

Bill Diener

Post by **jussij** » Thu Jul 26, 2012 11:55 pm

I will take a look at the regexp and see why the brace matching is not working.

But one thing to remember is that by design the brace matching ignores braces inside strings or comments.

As an aside, it would also be nice to have syntax that supported more than 9 captures in a replace string.

Zeus already has 19 captures.

For Example:

Code: Select all

Search: ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+) ([0-9]+)

Replace: \0\1\2\3\4\5\6\7\8\9\10\11\12\13\14\15\16\17\18\19

Text: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22

Also if you turn on the debug option found in the Options, Editor Options menu and then do a regexp search and replace you will find that some regexp debug information is produced in the Macros, Macro/Debug Output window

sometimes I might want \11 to be the first capture followed by a 1, other times the 11th capture

That is a good point.

At present it is always 11 th capture which is not quite right

Cheers Jussi

pwdiener · Post by **pwdiener** » Fri Jul 27, 2012 3:18 pm

Yes, I remember that from some previous discussions about brace matching. There are no quoted strings in the pattern above.

Another thing I noticed is that the matching sometimes appeared to be looking at the previous character and looking for it's match. I would expect it to look at the character actually under the cursor, then look either forward or backward for a matching brace.

The regex above has 23 captures, although many of them are larger "outer" strings where there are also smaller "inner" strings captured. For my actual use of this regex, I actually only care about 3 of the captures - the rest of the regex is there to recognize parts of the statement that might be there. Unfortunately, they are 18, 22, and 23.

I don't quite know what to suggest about designating captures in a replacement string - I've seen a couple ways of dealing with that issue in macro processors over the years, but they all seem to be contrary to the regex "philosophy" and practice.

Bill

Post by **jussij** » Fri Jul 27, 2012 4:23 pm

I don't quite know what to suggest about designating captures in a replacement string

I am planning to do the following.

For a replacement string like \11 then it will be replaced with the 11 th subcategory if it exists. But if it does not exist it will be treated as \1 and 1 instead.

Cheers Jussi

pwdiener · Post by **pwdiener** » Fri Jul 27, 2012 4:28 pm

I was playing around a bit with the extended captures (10-19) and they seem to work pretty well. There is a bit of an inconsistency though - if the 10th capture is null, the replacement value is \:, instead of \10 that might be expected. Not a big deal, just a momentary bit of confusion.

Interestingly, when I tried \20 in the replacement string, it behaved as if it was \2 with a 0 following.

Might this "internal" representation of captures beyond 9 be made "official" as a means of specifying them in the replacement string? For example, \A = 10th capture (it appears to be the 17th now), \B = 11th, etc. A,B,C would be easier to remember than using the punctuation marks, I think, and could extend the number of captures all the way to 35 without any confusion about interpretation.

Bill

Post by **jussij** » Fri Jul 27, 2012 5:00 pm

I'm trying to put together, and appear to have hit some threshold where brace matching stops working predicatably.

Unfortunatrly I am not seeing any of these issues

What I did was change the coloring of the brace highlighting to greatly highlight the braces (i.e. changed the background and foregrond colors).

I then pasted your text into a file with a txt extension:

Code: Select all

^......[^\*][ ]*((([0-9][0-9]|FD|SD|RD|CD)[ ]+)?(([a-zA-Z0-9_-]+)([ ]+SECTION([ ]+([0-9]*))?)?[.]?))?[ ]*COPY[ ]+(((DDS?R?)-((ALL-FORMATS?)|([A-Za-z0-9_]+))(-I-O|-I|-O)?(-INDIC(ATORS?)?)?)|([a-zA-Z0-9_-]+))([ ]+(IN|OF)?[ ]*(([a-zA-Z0-9]+-)?([a-zA-Z0-9]+))?)?

When I move the cursor left to right over the cursor positions that you indicate are wrong, they appear correct to me

if the 10th capture is null, the replacement value is \:, instead of \10 that might be expected.

That is a bug that will be fixed in the next beta

Interestingly, when I tried \20 in the replacement string, it behaved as if it was \2 with a 0 following.

That looks ok, since \19 is the last capture (\0 thru \20), so \20 is \2 plus 0.

For example, \A = 10th capture

I though of using this pattern but I came to the conclusion the \A pattern would cause more false positives than the \11 pattern.

For example this pattern immediately creates an inconsitency with \a and \A etc.

Also since the \ character is used in DOS file names things like \a might turn up quite a bit, causing confusion.

(it appears to be the 17th now)

That might be a bug.

Cheers Jussi

pwdiener · Post by **pwdiener** » Fri Jul 27, 2012 5:22 pm

Yes, you are right about the \A causing some potential confusion (e.g. in file names), both false positives and case issues. You could say the same about \1, but that is well-established convention so not really an issue.

Since relatively few regex have so many captures, you could probably argue that \A would almost always be replaced by itself, since the 10th capture would almost always be null.

If you had a replacement string where you needed to put \2 in the actual resulting string, but there may or may not be a 2nd capture, is there a way to specify that? \\2 for example, meaning an "escaped" \2 which is never replaced by the 2nd capture? Is "escaping" ever used in a replacement string?

I can live with whatever Zeus does in this area as long as I can understand what it's doing. The issue that comes to mind with that interpretation of \11 is some confusion when the 11th capture is null. Now, you can tell than easily, because you see \; in the result. With this new interpretation, that might be impossible.

I put together the regex I'm using as the example in this thread as part of an attempt to implement something like FileDisplayInline for COBOL COPY statements, specifically on an AS/400. Capture 22/23 will be null most of the time, but if present indicate that only a specific directory should be searched for the COPY file, not a search list. So it's important that I be able to tell that they are null or not. I realize that Zeus currently doesn't support capture 22/23 anyway, but I can "shift that down" to capture 14/15 by doing some other manipulation of the source line with LUA code before running the search/capture/replace. But I'm still left with the issue of \14 maybe showing up as whatever the 1st capture is followed by a 4 instead of just \14, indicating that there was no 14th capture.

pwdiener · Post by **pwdiener** » Fri Jul 27, 2012 5:30 pm

Jussi,

My regex example was in a file with a .CBL extension (even though it's obviously not COBOL code). When I put it in a file with a .TXT extension, it also appeared to work just fine. The coloring you mentioned is exactly what I do to make it very clear where the matching braces are.

Why the apparent difference between .TXT and .CBL handling? I thought the brace matching was hard-coded in...I can't see where it's configured by document type.

Bill

Post by **jussij** » Sat Jul 28, 2012 2:49 am

Since relatively few regex have so many captures, you could probably argue that \A would almost always be replaced by itself, since the 10th capture would almost always be null.

That is a good point.

If you had a replacement string where you needed to put \2 in the actual resulting string, but there may or may not be a 2nd capture, is there a way to specify that? \\2 for example, meaning an "escaped" \2 which is never replaced by the 2nd capture? Is "escaping" ever used in a replacement string?

Yes, this is how it currently works.

Now, you can tell than easily, because you see \;

That is a bug. It should have been \11 and not the \; value.

With this new interpretation, that might be impossible.

See the mention of debug output from earlier.

So it's important that I be able to tell that they are null or not.

I will give this some more thought.

Why the apparent difference between .TXT and .CBL handling?

The difference is everything that is in the document type

I suspect the difference will be something in the Templates and/or General section.

This is what I would do.

1) Start a second Zeus and create a new test document type.

2) Make sure the brace matching continues to work.

3) Start with the Templates, move over the settings from the .CBL document type, one and a time and at each step make sure the brace matching continues to work.

I thought the brace matching was hard-coded in...

It does use details from the document type.

Cheers Jussi

Post by **jussij** » Sun Jul 29, 2012 1:13 pm

The latest Zeus beta now does 0-9 and A-Z sub expressions: http://www.zeusedit.com/z300/zeus-beta-patch.zip

Cheers Jussi

pwdiener · Post by **pwdiener** » Mon Jul 30, 2012 3:40 pm

Cool! I'll give that a shot and let you know how it works for me.

I'll also try to track down what's causing the difference in brace matching between CBL and TXT files.

Bill

Post by **jussij** » Tue Jul 31, 2012 7:30 am

I'll also try to track down what's causing the difference in brace matching between CBL and TXT files.

I had a quick look at this and the issue is the 7:* line comment setting in the keywords section.

This is a bug since Zeus is seeing the * as a comment even if it is not on the 7 th column

Cheers Jussi

Post by **jussij** » Wed Aug 01, 2012 6:22 am

This issue should be fixed in the Zeus latest patch found here: http://www.zeusedit.com/z300/zeus-beta-patch.zip

This version also adds the following new scripting functions.

string error_message()

Return the current compiler error message string for the currently active document.

int get_window_type()

Get the type identifier for the currently active window. Possible type values are:

Code: Select all

Document   =  0
Compiler   =  1
Debug      =  2
Debugger   =  3
Difference =  4
Function   =  5
Project    =  6
TagBuild   =  7
TagFile    =  8
TagSearch  =  9
Tool       = 10
Macro      = 11
Printer    = 12

Cheers Jussi

pwdiener · Post by **pwdiener** » Wed Aug 01, 2012 11:39 pm

Excellent! I'll give these a try too.

The \A-\Z replacement strings work just fine for me. Thanks, Jussi!