Data: Strings
The answer to this question is usually a regular expression, perhaps with auxiliary logic.
See the more specific questions (numbers, mail addresses, etc.) for details.
It depends just what you mean by ``escape''. URL escapes are dealt with in perlfaq9. Shell escapes with the
backslash (\) character are removed with
This won't expand "\n" or "\t" or any other
special escapes.
To turn "abbcccd" into "abccd":
s/(.)\1/$1/g; # add /s to include newlines
|
|
Here's a solution that turns "abbcccd" to "abcd":
y///cs; # y == tr, but shorter :-)
|
|
This is documented in perlref.
In general, this is fraught with quoting and readability problems, but it is possible. To
interpolate a subroutine call (in list context) into a string:
print "My sub returned @{[mysub(1,2,3)]} that time.\n";
|
|
If you prefer scalar context, similar chicanery is also useful for arbitrary expressions:
print "That yields ${\($n + 5)} widgets\n";
|
|
Version 5.004 of Perl had a bug that gave list context to the expression in ${...},
but this is fixed in version 5.005.
See also ``How can I expand variables in text strings?'' in this section of the FAQ.
This isn't something that can be done in one regular expression, no matter how complicated.
To find something between two single characters, a pattern like /x([^x]*)x/ will
get the intervening bits in $1. For multiple ones, then something more like /alpha(.*?)omega/
would be needed. But none of these deals with nested patterns, nor can they. For that you'll
have to write a parser.
If you are serious about writing a parser, there are a number of modules or oddities that
will make your life a lot easier. There are the CPAN modules Parse::RecDescent, Parse::Yapp, and
Text::Balanced; and the byacc program. Starting from perl 5.8 the Text::Balanced is part of the
standard distribution.
One simple destructive, inside-out approach that you might try is to pull out the smallest
nesting parts one at a time:
while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
# do something with $1
}
|
|
A more complicated and sneaky approach is to make Perl's regular expression engine do it for
you. This is courtesy Dean Inada, and rather has the nature of an Obfuscated Perl Contest entry,
but it really does work:
# $_ contains the string to parse
# BEGIN and END are the opening and closing markers for the
# nested text.
@( = ('(','');
@) = (')','');
($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
@$ = (eval{/$re/},$@!~/unmatched/i);
print join("\n",@$[0..$#$]) if( $$[-1] );
|
|
Use reverse() in scalar context, as documented in perlfunc/reverse.
$reversed = reverse $string;
|
|
You can do it yourself:
1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
|
|
Or you can just use the Text::Tabs module (part of the standard Perl distribution).
use Text::Tabs;
@expanded_lines = expand(@lines_with_tabs);
|
|
Use Text::Wrap (part of the standard Perl distribution):
use Text::Wrap;
print wrap("\t", ' ', @paragraphs);
|
|
The paragraphs you give to Text::Wrap should not contain embedded newlines. Text::Wrap
doesn't justify the lines (flush-right).
Or use the CPAN module Text::Autoformat. Formatting files can be easily done by making a
shell alias, like so:
alias fmt="perl -i -MText::Autoformat -n0777 \
-e 'print autoformat $_, {all=>1}' $*"
|
|
See the documentation for Text::Autoformat to appreciate its many capabilities.
There are many ways. If you just want to grab a copy, use substr():
$first_byte = substr($a, 0, 1);
|
|
If you want to modify part of a string, the simplest way is often to use substr() as an
lvalue:
substr($a, 0, 3) = "Tom";
|
|
Although those with a pattern matching kind of thought process will likely prefer
You have to keep track of N yourself. For example, let's say you want to change the fifth
occurrence of "whoever" or "whomever" into "whosoever"
or "whomsoever", case insensitively. These all assume that $_ contains
the string to be altered.
$count = 0;
s{((whom?)ever)}{
++$count == 5 # is it the 5th?
? "${2}soever" # yes, swap
: $1 # renege and leave it there
}ige;
|
|
In the more general case, you can use the /g modifier in a while
loop, keeping count of matches.
$WANT = 3;
$count = 0;
$_ = "One fish two fish red fish blue fish";
while (/(\w+)\s+fish\b/gi) {
if (++$count == $WANT) {
print "The third fish is a $1 one.\n";
}
}
|
|
That prints out: "The third fish is a red one." You can also use a
repetition count and repeated pattern like this:
/(?:\w+\s+fish\s+){2}(\w+)\s+fish/i;
|
|
There are a number of ways, with varying efficiency. If you want a count of a certain single
character (X) within a string, you can use the tr/// function like so:
$string = "ThisXlineXhasXsomeXx'sXinXit";
$count = ($string =~ tr/X//);
print "There are $count X characters in the string";
|
|
This is fine if you are just looking for a single character. However, if you are trying to
count multiple character substrings within a larger string, tr/// won't work. What
you can do is wrap a while() loop around a global pattern match. For example, let's count
negative integers:
$string = "-9 55 48 -2 23 -76 4 14 -44";
while ($string =~ /-\d+/g) { $count++ }
print "There are $count negative numbers in the string";
|
|
Another version uses a global match in list context, then assigns the result to a scalar,
producing a count of the number of matches.
$count = () = $string =~ /-\d+/g;
|
|
To make the first letter of each word upper case:
$line =~ s/\b(\w)/\U$1/g;
|
|
This has the strange effect of turning "don't do it" into "Don'T
Do It". Sometimes you might want this. Other times you might need a more thorough
solution (Suggested by brian d foy):
$string =~ s/ (
(^\w) #at the beginning of the line
| # or
(\s\w) #preceded by whitespace
)
/\U$1/xg;
$string =~ /([\w']+)/\u\L$1/g;
|
|
To make the whole line upper case:
To force each word to be lower case, with the first letter upper case:
$line =~ s/(\w+)/\u\L$1/g;
|
|
You can (and probably should) enable locale awareness of those characters by placing a use
locale pragma in your program. See perllocale for endless details
on locales.
This is sometimes referred to as putting something into "title case", but that's
not quite accurate. Consider the proper capitalization of the movie Dr. Strangelove or: How I
Learned to Stop Worrying and Love the Bomb, for example.
Take the example case of trying to split a string that is comma-separated into its different
fields. (We'll pretend you said comma-separated, not comma-delimited, which is different and
almost never what you mean.) You can't use split(/,/) because you shouldn't split
if the comma is inside quotes. For example, take a data line like this:
SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"
|
|
Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have
Jeffrey Friedl, author of a highly recommended book on regular expressions, to handle these for
us. He suggests (assuming your string is contained in $text):
@new = ();
push(@new, $+) while $text =~ m{
"([^\"\\]*(?:\\.[^\"\\]*)*)",? # groups the phrase inside the quotes
| ([^,]+),?
| ,
}gx;
push(@new, undef) if substr($text,-1,1) eq ',';
|
|
If you want to represent quotation marks inside a quotation-mark-delimited field, escape them
with backslashes (eg, "like \"this\"". Unescaping them is a
task addressed earlier in this section.
Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you
say:
use Text::ParseWords;
@new = quotewords(",", 0, $text);
|
|
There's also a Text::CSV (Comma-Separated Values) module on CPAN.
Although the simplest approach would seem to be
$string =~ s/^\s*(.*?)\s*$/$1/;
|
|
not only is this unnecessarily slow and destructive, it also fails with embedded newlines. It
is much faster to do this operation in two steps:
$string =~ s/^\s+//;
$string =~ s/\s+$//;
|
|
Or more nicely written as:
for ($string) {
s/^\s+//;
s/\s+$//;
}
|
|
This idiom takes advantage of the foreach loop's aliasing behavior to factor out
common code. You can do this on several strings at once, or arrays, or even the values of a hash
if you use a slice:
# trim whitespace in the scalar, the array,
# and all the values in the hash
foreach ($scalar, @array, @hash{keys %hash}) {
s/^\s+//;
s/\s+$//;
}
|
|
(This answer contributed by Uri Guttman, with kibitzing from Bart Lateur.)
In the following examples, $pad_len is the length to which you wish to pad the
string, $text or $num contains the string to be padded, and $pad_char
contains the padding character. You can use a single character string constant instead of the $pad_char
variable if you know what it is in advance. And in the same way you can use an integer in place
of $pad_len if you know the pad length in advance.
The simplest method uses the sprintf function. It can pad on the left or right
with blanks and on the left with zeroes and it will not truncate the result. The pack
function can only pad strings on the right with blanks and it will truncate the result to a
maximum length of $pad_len.
# Left padding a string with blanks (no truncation):
$padded = sprintf("%${pad_len}s", $text);
# Right padding a string with blanks (no truncation):
$padded = sprintf("%-${pad_len}s", $text);
# Left padding a number with 0 (no truncation):
$padded = sprintf("%0${pad_len}d", $num);
# Right padding a string with blanks using pack (will truncate):
$padded = pack("A$pad_len",$text);
|
|
If you need to pad with a character other than blank or zero you can use one of the following
methods. They all generate a pad string with the x operator and combine that with $text.
These methods do not truncate $text.
Left and right padding with any character, creating a new string:
$padded = $pad_char x ( $pad_len - length( $text ) ) . $text;
$padded = $text . $pad_char x ( $pad_len - length( $text ) );
|
|
Left and right padding with any character, modifying $text directly:
substr( $text, 0, 0 ) = $pad_char x ( $pad_len - length( $text ) );
$text .= $pad_char x ( $pad_len - length( $text ) );
|
|
Use substr() or unpack(), both documented in perlfunc. If you prefer thinking
in terms of columns instead of widths, you can use this kind of thing:
# determine the unpack format needed to split Linux ps output
# arguments are cut columns
my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72);
sub cut2fmt {
my(@positions) = @_;
my $template = '';
my $lastpos = 1;
for my $place (@positions) {
$template .= "A" . ($place - $lastpos) . " ";
$lastpos = $place;
}
$template .= "A*";
return $template;
}
|
|
Use the standard Text::Soundex module distributed with Perl. Before you do so, you may want
to determine whether `soundex' is in fact what you think it is. Knuth's soundex algorithm
compresses words into a small space, and so it does not necessarily distinguish between two
words which you might want to appear separately. For example, the last names `Knuth' and `Kant'
are both mapped to the soundex code K530. If Text::Soundex does not do what you are looking for,
you might want to consider the String::Approx module available at CPAN.
Let's assume that you have a string like:
$text = 'this has a $foo in it and a $bar';
|
|
If those were both global variables, then this would suffice:
$text =~ s/\$(\w+)/${$1}/g; # no /e needed
|
|
But since they are probably lexicals, or at least, they could be, you'd have to do this:
$text =~ s/(\$\w+)/$1/eeg;
die if $@; # needed /ee, not /e
|
|
It's probably better in the general case to treat those variables as entries in some special
hash. For example:
%user_defs = (
foo => 23,
bar => 19,
);
$text =~ s/\$(\w+)/$user_defs{$1}/g;
|
|
See also ``How do I expand function calls in a string?'' in this section of the FAQ.
The problem is that those double-quotes force stringification-- coercing numbers and
references into strings--even when you don't want them to be strings. Think of it this way:
double-quote expansion is used to produce new strings. If you already have a string, why do you
need more?
If you get used to writing odd things like these:
print "$var"; # BAD
$new = "$old"; # BAD
somefunc("$var"); # BAD
|
|
You'll be in trouble. Those should (in 99.8% of the cases) be the simpler and more direct:
print $var;
$new = $old;
somefunc($var);
|
|
Otherwise, besides slowing you down, you're going to break code when the thing in the scalar
is actually neither a string nor a number, but a reference:
func(\@array);
sub func {
my $aref = shift;
my $oref = "$aref"; # WRONG
}
|
|
You can also get into subtle problems on those few operations in Perl that actually do care
about the difference between a string and a number, such as the magical ++
autoincrement operator or the syscall() function.
Stringification also destroys arrays.
@lines = `command`;
print "@lines"; # WRONG - extra blanks
print @lines; # right
|
|
Check for these three things:
- 1. There must be no space after
the << part.
-
- 2. There (probably) should
be a semicolon at the end.
-
- 3. You can't (easily)
have any space in front of the tag.
-
If you want to indent the text in the here document, you can do this:
# all in one
($VAR = <<HERE_TARGET) =~ s/^\s+//gm;
your text
goes here
HERE_TARGET
|
|
But the HERE_TARGET must still be flush against the margin. If you want that indented also,
you'll have to quote in the indentation.
($quote = <<' FINIS') =~ s/^\s+//gm;
...we will have peace, when you and all your works have
perished--and the works of your dark master to whom you
would deliver us. You are a liar, Saruman, and a corrupter
of men's hearts. --Theoden in /usr/src/perl/taint.c
FINIS
$quote =~ s/\s+--/\n--/;
|
|
A nice general-purpose fixer-upper function for indented here documents follows. It expects
to be called with a here document as its argument. It looks to see whether each line begins with
a common substring, and if so, strips that substring off. Otherwise, it takes the amount of
leading whitespace found on the first line and removes that much off each subsequent line.
sub fix {
local $_ = shift;
my ($white, $leader); # common whitespace and common leading string
if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) {
($white, $leader) = ($2, quotemeta($1));
} else {
($white, $leader) = (/^(\s+)/, '');
}
s/^\s*?$leader(?:$white)?//gm;
return $_;
}
|
|
This works with leading special strings, dynamically determined:
$remember_the_main = fix<<' MAIN_INTERPRETER_LOOP';
@@@ int
@@@ runops() {
@@@ SAVEI32(runlevel);
@@@ runlevel++;
@@@ while ( op = (*op->op_ppaddr)() );
@@@ TAINT_NOT;
@@@ return 0;
@@@ }
MAIN_INTERPRETER_LOOP
|
|
Or with a fixed amount of leading whitespace, with remaining indentation correctly preserved:
$poem = fix<<EVER_ON_AND_ON;
Now far ahead the Road has gone,
And I must follow, if I can,
Pursuing it with eager feet,
Until it joins some larger way
Where many paths and errands meet.
And whither then? I cannot say.
--Bilbo in /usr/src/perl/pp_ctl.c
EVER_ON_AND_ON
|
|
|
|