Website hosting service by Active-Venture.com
  

 Back to Index

Data: Strings

How do I validate input?

The answer to this question is usually a regular expression, perhaps with auxiliary logic. See the more specific questions (numbers, mail addresses, etc.) for details.

How do I unescape a string?

It depends just what you mean by ``escape''. URL escapes are dealt with in perlfaq9. Shell escapes with the backslash (\) character are removed with

 
    s/\\(.)/$1/g;  

This won't expand "\n" or "\t" or any other special escapes.

How do I remove consecutive pairs of characters?

To turn "abbcccd" into "abccd":

 
    s/(.)\1/$1/g;	# add /s to include newlines  

Here's a solution that turns "abbcccd" to "abcd":

 
    y///cs;	# y == tr, but shorter :-)  

How do I expand function calls in a string?

This is documented in perlref. In general, this is fraught with quoting and readability problems, but it is possible. To interpolate a subroutine call (in list context) into a string:

 
    print "My sub returned @{[mysub(1,2,3)]} that time.\n";  

If you prefer scalar context, similar chicanery is also useful for arbitrary expressions:

 
    print "That yields ${\($n + 5)} widgets\n";  

Version 5.004 of Perl had a bug that gave list context to the expression in ${...}, but this is fixed in version 5.005.

See also ``How can I expand variables in text strings?'' in this section of the FAQ.

How do I find matching/nesting anything?

This isn't something that can be done in one regular expression, no matter how complicated. To find something between two single characters, a pattern like /x([^x]*)x/ will get the intervening bits in $1. For multiple ones, then something more like /alpha(.*?)omega/ would be needed. But none of these deals with nested patterns, nor can they. For that you'll have to write a parser.

If you are serious about writing a parser, there are a number of modules or oddities that will make your life a lot easier. There are the CPAN modules Parse::RecDescent, Parse::Yapp, and Text::Balanced; and the byacc program. Starting from perl 5.8 the Text::Balanced is part of the standard distribution.

One simple destructive, inside-out approach that you might try is to pull out the smallest nesting parts one at a time:

 
    while (s/BEGIN((?:(?!BEGIN)(?!END).)*)END//gs) {
	# do something with $1
    }   

A more complicated and sneaky approach is to make Perl's regular expression engine do it for you. This is courtesy Dean Inada, and rather has the nature of an Obfuscated Perl Contest entry, but it really does work:

 
    # $_ contains the string to parse
    # BEGIN and END are the opening and closing markers for the
    # nested text.

    @( = ('(','');
    @) = (')','');
    ($re=$_)=~s/((BEGIN)|(END)|.)/$)[!$3]\Q$1\E$([!$2]/gs;
    @$ = (eval{/$re/},$@!~/unmatched/i);
    print join("\n",@$[0..$#$]) if( $$[-1] );  

How do I reverse a string?

Use reverse() in scalar context, as documented in perlfunc/reverse.

 
    $reversed = reverse $string;  

How do I expand tabs in a string?

You can do it yourself:

 
    1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;  

Or you can just use the Text::Tabs module (part of the standard Perl distribution).

 
    use Text::Tabs;
    @expanded_lines = expand(@lines_with_tabs);  

How do I reformat a paragraph?

Use Text::Wrap (part of the standard Perl distribution):

 
    use Text::Wrap;
    print wrap("\t", '  ', @paragraphs);  

The paragraphs you give to Text::Wrap should not contain embedded newlines. Text::Wrap doesn't justify the lines (flush-right).

Or use the CPAN module Text::Autoformat. Formatting files can be easily done by making a shell alias, like so:

 
    alias fmt="perl -i -MText::Autoformat -n0777 \
        -e 'print autoformat $_, {all=>1}' $*"  

See the documentation for Text::Autoformat to appreciate its many capabilities.

How can I access/change the first N letters of a string?

There are many ways. If you just want to grab a copy, use substr():

 
    $first_byte = substr($a, 0, 1);  

If you want to modify part of a string, the simplest way is often to use substr() as an lvalue:

 
    substr($a, 0, 3) = "Tom";  

Although those with a pattern matching kind of thought process will likely prefer

 
    $a =~ s/^.../Tom/;  

How do I change the Nth occurrence of something?

You have to keep track of N yourself. For example, let's say you want to change the fifth occurrence of "whoever" or "whomever" into "whosoever" or "whomsoever", case insensitively. These all assume that $_ contains the string to be altered.

 
    $count = 0;
    s{((whom?)ever)}{
	++$count == 5   	# is it the 5th?
	    ? "${2}soever"	# yes, swap
	    : $1		# renege and leave it there
    }ige;  

In the more general case, you can use the /g modifier in a while loop, keeping count of matches.

 
    $WANT = 3;
    $count = 0;
    $_ = "One fish two fish red fish blue fish";
    while (/(\w+)\s+fish\b/gi) {
        if (++$count == $WANT) {
            print "The third fish is a $1 one.\n";
        }
    }  

That prints out: "The third fish is a red one." You can also use a repetition count and repeated pattern like this:

 
    /(?:\w+\s+fish\s+){2}(\w+)\s+fish/i;  

How can I count the number of occurrences of a substring within a string?

There are a number of ways, with varying efficiency. If you want a count of a certain single character (X) within a string, you can use the tr/// function like so:

 
    $string = "ThisXlineXhasXsomeXx'sXinXit";
    $count = ($string =~ tr/X//);
    print "There are $count X characters in the string";  

This is fine if you are just looking for a single character. However, if you are trying to count multiple character substrings within a larger string, tr/// won't work. What you can do is wrap a while() loop around a global pattern match. For example, let's count negative integers:

 
    $string = "-9 55 48 -2 23 -76 4 14 -44";
    while ($string =~ /-\d+/g) { $count++ }
    print "There are $count negative numbers in the string";  

Another version uses a global match in list context, then assigns the result to a scalar, producing a count of the number of matches.

 
	$count = () = $string =~ /-\d+/g;  

How do I capitalize all the words on one line?

To make the first letter of each word upper case:

 
        $line =~ s/\b(\w)/\U$1/g;  

This has the strange effect of turning "don't do it" into "Don'T Do It". Sometimes you might want this. Other times you might need a more thorough solution (Suggested by brian d foy):

 
    $string =~ s/ (
                 (^\w)    #at the beginning of the line
                   |      # or
                 (\s\w)   #preceded by whitespace
                   )
                /\U$1/xg;
    $string =~ /([\w']+)/\u\L$1/g;  

To make the whole line upper case:

 
        $line = uc($line);  

To force each word to be lower case, with the first letter upper case:

 
        $line =~ s/(\w+)/\u\L$1/g;  

You can (and probably should) enable locale awareness of those characters by placing a use locale pragma in your program. See perllocale for endless details on locales.

This is sometimes referred to as putting something into "title case", but that's not quite accurate. Consider the proper capitalization of the movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, for example.

How can I split a [character] delimited string except when inside [character]? (Comma-separated files)

Take the example case of trying to split a string that is comma-separated into its different fields. (We'll pretend you said comma-separated, not comma-delimited, which is different and almost never what you mean.) You can't use split(/,/) because you shouldn't split if the comma is inside quotes. For example, take a data line like this:

 
    SAR001,"","Cimetrix, Inc","Bob Smith","CAM",N,8,1,0,7,"Error, Core Dumped"  

Due to the restriction of the quotes, this is a fairly complex problem. Thankfully, we have Jeffrey Friedl, author of a highly recommended book on regular expressions, to handle these for us. He suggests (assuming your string is contained in $text):

 
     @new = ();
     push(@new, $+) while $text =~ m{
         "([^\"\\]*(?:\\.[^\"\\]*)*)",?  # groups the phrase inside the quotes
       | ([^,]+),?
       | ,
     }gx;
     push(@new, undef) if substr($text,-1,1) eq ',';  

If you want to represent quotation marks inside a quotation-mark-delimited field, escape them with backslashes (eg, "like \"this\"". Unescaping them is a task addressed earlier in this section.

Alternatively, the Text::ParseWords module (part of the standard Perl distribution) lets you say:

 
    use Text::ParseWords;
    @new = quotewords(",", 0, $text);  

There's also a Text::CSV (Comma-Separated Values) module on CPAN.

How do I strip blank space from the beginning/end of a string?

Although the simplest approach would seem to be

 
    $string =~ s/^\s*(.*?)\s*$/$1/;  

not only is this unnecessarily slow and destructive, it also fails with embedded newlines. It is much faster to do this operation in two steps:

 
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;  

Or more nicely written as:

 
    for ($string) {
	s/^\s+//;
	s/\s+$//;
    }  

This idiom takes advantage of the foreach loop's aliasing behavior to factor out common code. You can do this on several strings at once, or arrays, or even the values of a hash if you use a slice:

 
    # trim whitespace in the scalar, the array, 
    # and all the values in the hash
    foreach ($scalar, @array, @hash{keys %hash}) {
        s/^\s+//;
        s/\s+$//;
    }  

How do I pad a string with blanks or pad a number with zeroes?

(This answer contributed by Uri Guttman, with kibitzing from Bart Lateur.)

In the following examples, $pad_len is the length to which you wish to pad the string, $text or $num contains the string to be padded, and $pad_char contains the padding character. You can use a single character string constant instead of the $pad_char variable if you know what it is in advance. And in the same way you can use an integer in place of $pad_len if you know the pad length in advance.

The simplest method uses the sprintf function. It can pad on the left or right with blanks and on the left with zeroes and it will not truncate the result. The pack function can only pad strings on the right with blanks and it will truncate the result to a maximum length of $pad_len.

 
    # Left padding a string with blanks (no truncation):
    $padded = sprintf("%${pad_len}s", $text);

    # Right padding a string with blanks (no truncation):
    $padded = sprintf("%-${pad_len}s", $text);

    # Left padding a number with 0 (no truncation): 
    $padded = sprintf("%0${pad_len}d", $num);

    # Right padding a string with blanks using pack (will truncate):
    $padded = pack("A$pad_len",$text);  

If you need to pad with a character other than blank or zero you can use one of the following methods. They all generate a pad string with the x operator and combine that with $text. These methods do not truncate $text.

Left and right padding with any character, creating a new string:

 
    $padded = $pad_char x ( $pad_len - length( $text ) ) . $text;
    $padded = $text . $pad_char x ( $pad_len - length( $text ) );  

Left and right padding with any character, modifying $text directly:

 
    substr( $text, 0, 0 ) = $pad_char x ( $pad_len - length( $text ) );
    $text .= $pad_char x ( $pad_len - length( $text ) );  

How do I extract selected columns from a string?

Use substr() or unpack(), both documented in perlfunc. If you prefer thinking in terms of columns instead of widths, you can use this kind of thing:

 
    # determine the unpack format needed to split Linux ps output
    # arguments are cut columns
    my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72);

    sub cut2fmt { 
	my(@positions) = @_;
	my $template  = '';
	my $lastpos   = 1;
	for my $place (@positions) {
	    $template .= "A" . ($place - $lastpos) . " "; 
	    $lastpos   = $place;
	}
	$template .= "A*";
	return $template;
    }  

How do I find the soundex value of a string?

Use the standard Text::Soundex module distributed with Perl. Before you do so, you may want to determine whether `soundex' is in fact what you think it is. Knuth's soundex algorithm compresses words into a small space, and so it does not necessarily distinguish between two words which you might want to appear separately. For example, the last names `Knuth' and `Kant' are both mapped to the soundex code K530. If Text::Soundex does not do what you are looking for, you might want to consider the String::Approx module available at CPAN.

How can I expand variables in text strings?

Let's assume that you have a string like:

 
    $text = 'this has a $foo in it and a $bar';  

If those were both global variables, then this would suffice:

 
    $text =~ s/\$(\w+)/${$1}/g;  # no /e needed  

But since they are probably lexicals, or at least, they could be, you'd have to do this:

 
    $text =~ s/(\$\w+)/$1/eeg;
    die if $@;			# needed /ee, not /e  

It's probably better in the general case to treat those variables as entries in some special hash. For example:

 
    %user_defs = ( 
	foo  => 23,
	bar  => 19,
    );
    $text =~ s/\$(\w+)/$user_defs{$1}/g;  

See also ``How do I expand function calls in a string?'' in this section of the FAQ.

What's wrong with always quoting "$vars"?

The problem is that those double-quotes force stringification-- coercing numbers and references into strings--even when you don't want them to be strings. Think of it this way: double-quote expansion is used to produce new strings. If you already have a string, why do you need more?

If you get used to writing odd things like these:

 
    print "$var";   	# BAD
    $new = "$old";   	# BAD
    somefunc("$var");	# BAD  

You'll be in trouble. Those should (in 99.8% of the cases) be the simpler and more direct:

 
    print $var;
    $new = $old;
    somefunc($var);  

Otherwise, besides slowing you down, you're going to break code when the thing in the scalar is actually neither a string nor a number, but a reference:

 
    func(\@array);
    sub func {
	my $aref = shift;
	my $oref = "$aref";  # WRONG
    }  

You can also get into subtle problems on those few operations in Perl that actually do care about the difference between a string and a number, such as the magical ++ autoincrement operator or the syscall() function.

Stringification also destroys arrays.

 
    @lines = `command`;
    print "@lines";		# WRONG - extra blanks
    print @lines;		# right  

Why don't my <<HERE documents work?

Check for these three things:

1. There must be no space after the << part.
 
2. There (probably) should be a semicolon at the end.
 
3. You can't (easily) have any space in front of the tag.
 

If you want to indent the text in the here document, you can do this:

 
    # all in one
    ($VAR = <<HERE_TARGET) =~ s/^\s+//gm;
        your text
        goes here
    HERE_TARGET  

But the HERE_TARGET must still be flush against the margin. If you want that indented also, you'll have to quote in the indentation.

 
    ($quote = <<'    FINIS') =~ s/^\s+//gm;
            ...we will have peace, when you and all your works have
            perished--and the works of your dark master to whom you
            would deliver us. You are a liar, Saruman, and a corrupter
            of men's hearts.  --Theoden in /usr/src/perl/taint.c
        FINIS
    $quote =~ s/\s+--/\n--/;  

A nice general-purpose fixer-upper function for indented here documents follows. It expects to be called with a here document as its argument. It looks to see whether each line begins with a common substring, and if so, strips that substring off. Otherwise, it takes the amount of leading whitespace found on the first line and removes that much off each subsequent line.

 
    sub fix {
        local $_ = shift;
        my ($white, $leader);  # common whitespace and common leading string
        if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) {
            ($white, $leader) = ($2, quotemeta($1));
        } else {
            ($white, $leader) = (/^(\s+)/, '');
        }
        s/^\s*?$leader(?:$white)?//gm;
        return $_;
    }  

This works with leading special strings, dynamically determined:

 
    $remember_the_main = fix<<'    MAIN_INTERPRETER_LOOP';
	@@@ int
	@@@ runops() {
	@@@     SAVEI32(runlevel);
	@@@     runlevel++;
	@@@     while ( op = (*op->op_ppaddr)() );
	@@@     TAINT_NOT;
	@@@     return 0;
	@@@ }
    MAIN_INTERPRETER_LOOP  

Or with a fixed amount of leading whitespace, with remaining indentation correctly preserved:

 
    $poem = fix<<EVER_ON_AND_ON;
       Now far ahead the Road has gone,
	  And I must follow, if I can,
       Pursuing it with eager feet,
	  Until it joins some larger way
       Where many paths and errands meet.
	  And whither then? I cannot say.
		--Bilbo in /usr/src/perl/pp_ctl.c
    EVER_ON_AND_ON  

 

 

 

Domain name registration service & domain search - 
Register cheap domain name from $7.95 and enjoy free domain services 
 

Cheap domain name search service -
Domain name services at just
$8.95/year only
 

Register domain name -
Buy domain name registration and cheap domain transfer at low, affordable price.

© 2002-2004 Active-Venture.com Web Site Hosting Service

 

[ Software Engineering is that part of Computer Science which is too difficult for the Computer Scientist.   ]

 

 
 
 

Disclaimer: This documentation is provided only for the benefits of our web hosting customers.
For authoritative source of the documentation, please refer to http://www.perldoc.com