perl script question.

Matthew Seaman m.seaman at infracaninophile.co.uk
Sun Jan 11 03:52:46 PST 2004


On Sat, Jan 10, 2004 at 05:34:34PM -0800, Gary Kline wrote:
> On Sat, Jan 10, 2004 at 11:02:18PM +0000, Matthew Seaman wrote:

> >     perl -pi.bak -e 's/\s*\w+_\w+\.?//g;' filename

> 	The lines do indeed wrap so this does the job on a test file.
> 	I do have the re-exp book but this one is far ovr my head.
> 	What do the "\s*" mean, and also thr "\.?/" ?

OK.  Time to disect a regular expression.  Let's just isolate the RE
bits from the surrounding stuff:

    \s*\w+_\w+\.?

There are 5 parts to this:

   1 \s*
   2    \w+
   3       _
   4        \w+
   5           \.?

1) \s* -- '\s' is a metacharacter for matching whitespace: it's equivalent
   to saying [ \t\n\r\f].  The '*' operator says "any number of these,
   including zero".

2) \w+ -- '\w' is a metacharacter for matching 'word' characters.
   What it means is locale dependent, but if you're using the ASCII
   locale it corresponds to [a-zA-Z_0-9].  The '+' operator means "one
   or more or these".  Note that while \w+ matches character sequences
   containing _, it will also match words that don't: hence

3) _ -- match a literal '_' character.  ie. this forces the matched
   text to contain at least one underscore.

4) \w+ -- as (2) matches the rest of the stuff_separated_by_underscores
   after the underscore we've forced a match to[1].

5) \.? -- \. matches a literal '.' It has to be escaped (with a \)
   because plain '.' on it's own is the used as the wildcard to match
   any character.  The '?' operator means "optional", or more precisely,
   either zero or one of those.

Now, the whole command:

   perl -pi.bak -e 's/${re}//g;' filename

scans through the file line_by_line, matching strings_connected_with
underscores on each line.  Björn Andersson noticed that you would need
the 'g' option to the s/// substitution command which means "repeat
this substitution more than once, if necessary".  Like in the first
line_of_this_paragraph.

Then I realised that there were situations, like the last line of the
previous paragraph, where there wouldn't be any leading whitespace to
match.

Of course, this all depends on the sequences of words_connected_with_
underscores not wrapping around onto more than one line, as in this
contrived example, where the word 'underscores' on the second line of
this paragraph wouldn't be deleted.  There are several other edge
cases like that, if word-wrap is permitted. But it was never specified
if that was the case or not and I've assumed not because coping with
that sort of thing is a bit trickier.

	Cheers,

	Matthew

[1] In fact, due to the way regular expressions work, the literal
underscore (3) will actually match at the last underscore out of all
the stuff we're matching, and the stuff matched by chunk (4) won't
contain any underscores.

-- 
Dr Matthew J Seaman MA, D.Phil.                       26 The Paddocks
                                                      Savill Way
PGP: http://www.infracaninophile.co.uk/pgpkey         Marlow
Tel: +44 1628 476614                                  Bucks., SL7 1TH UK
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20040111/af2dc8ee/attachment.bin


More information about the freebsd-questions mailing list