06-13-12 - MSVC RegExp Find-Replace

This took me a while to figure out so I thought I'd write it down. You can use the MSVC regexp find-rep to match function args and pass them through. This is in VC2005 so YMMV with other versions.

In particular I wanted to change :




for any "blah". The way to do this is :




What it means :

\(  is escaped ( character
:b* matches any amount of white space
{}  wrap an expression which we can later refer to with \1,\2,etc.
.@  means match any character, and @ means match as few as possible (don't use .*)

A few things that tripped me up :

1. The replace string is *not* a regexp ; in particular notice there's no \ for the parens; I had \ on the parens in the output and the damn thing just silently refused to do the replace. So that's another hint - if you click "Find" and it works, and you click "Replace" and it just silently does nothing, that might mean that it doesn't like your output string.

2. There's a ":i" special tag that matches a C identifier. (:i is equal to ([a-zA-Z_$][a-zA-Z0-9_$]*) ) You might think that :i is a nice way to match a function argument, but it's not. It only works if the function argument is a simple identifier, it won't match "array[3]" or "obj.member" or anything like that. It would have been nice if they provided a :I or something that matched a complex identifier.

In cblib/chsh, I could have just done




which, while not remotely as powerful as a full regexp match, I find much more intuitive and easy to use, and it works for 99% of the find-reps that I want.

(in cblib a * in the search string always matches the minimum number of characters, and a * in the replace string means put the chars matched in the search string at the same slot)

MSVC supports a similar kind of simple wild match for searching, but it doesn't seem to support replacing in the simple wild mode, which is too bad.

I'm doing a ton of Find-Replacing trying to clean up the Oodle public API, and it has made it clear to me how fucking awful the find-replace in most of our code editors is.

I wrote before about how "keep case" is an obvious feature that you should have for code find-replace. But there's so much more that you should expect from your find-rep. For example :

1. I frequently want to do things like rename "VFS_" to "OodleVFS_" , but only if it occurs at the beginning of a word (and of course with keep-case as well). So "only at head of word" or "only at tail of word" would be nice.

2. All modern code editors have syntax parsing so they know if words are types, variable names, comments, etc. I should be able to say do this find-replace but only apply it to function names.

An extremely simple "duh" check-box on any find-replace should be "search code" and "search comments". A lot of the time I want to do a find-rep only on code and not comments.

An even more sophisticated type-aware find-rep would let you do things like :

enum MyEnum
I want to find-rep "red" and make it "Oodle_MyEnum_Red" , but only where the word "red" is being used in a variable of type MyEnum.

That sounds like rather a lot to expect of your find-rep but by god no it is not. The computer knows how to do it; if it can compile the code it can do that find-rep very easily. What's outrageous is that a human being has to do it.

3. A very common annoyance for me is accidental repeated find-reps. That is, I'll do something like find-rep "eLZH_" to "OodleLZH_" , but if I accidentally do it twice I get "OodlOodleLZH_" which is something I didn't expect. Almost always when doing these kind of big find-reps, once I fix a word it's done, so these problems could be avoided by having an option to exclude any match which has already been modified in the current find-rep session.

4. Obviously it should have a check box for "ignore whitespace that doesn't affect C". I shouldn't have to use regexp to mark up every spot where there could be benign whitespace in an expression. eg. if I search for "dumb(world)" and ignore C whitespace it should find "dumb ( world )" but not "du mb(world)".

etc. I'm sure if we could wipe out our preconceptions about how fucking lame the find-rep is, lots of ideas would come to mind about what it obviously should be able to do.

I see there are a bunch of commercial "Refactoring" (aka cleaning up code) tools that might do these type of things for you. In my experience those tools tend to be ungodly slow and flakey; part of the problem is they try to incrementally maintain a browse info database, and they always fuck it up. The compiler is plenty fast and I know it gets it right.


Anonymous said...

Doesn't work if "blah" contains commas, e.g. if blah is another function call with multiple parameters.

(and in fact provably unparseable with regular expressions)

Doesn't matter much if you avoid that sort of style, though, so you may have never run into it. But I think it's pretty perilous in general.

Anonymous said...

And yeah, you want a find/replace engine embedded in a programming IDE to be able to understand the syntax of the language, and e.g. (optionally) recognize token boundaries.

Knowing about types is a lot harder, unfortunately:

- it requires a full parse (and a full parse is a super PITA in C++ compared to C),

- to do it right you need a more sophisticated c-preprocessor than the one in the compiler (you need one that keeps a tighter correlation between input tokens and output tokens, since the types come from output tokens but the strings you're replacing are input tokens)

- #ifdefs fuck you regardless of whether you preprocessor macro expansion or not

johnb said...

Most regex engines provide facilities for word-boundary matching (or more general boundary conditions).

MSDN says that it wants '<' to match the beginning of a word, '>' to match the end of a word.


Anonymous said...

For renaming enum values, I've found Visual Assist to do a fair job, and it will even (try to) find references to them in comments. It's not impossible to confuse it, but it presents a nice summary of what it's going to do before it does it, so you can usually catch any problems before they occur. Serious hidden mistakes seem to be no more likely than if you did the change yourself.

cbloom said...

Another one :

I should be able to do "insert default args" using the compiler to do text-pasting.

If you ever need to change or remove a default arg, you should be able to automatically paste the previous value into all places in the code that were using it.

old rants