Extract email address and name

Tag: perl Author: maji87251211 Date: 2011-04-03

Im trying to write a perl script to parse a directory full of emails and extract an email address and corresponding name.

At the moment im parsing for the word "From:" and then extracting the line, but this is where I am stuck.

The data can be in the following formats:

> From: "Smith, John" <[email protected]>
> From: John Smith <[email protected]>
> From: Frank Smith [mailto:[email protected]]=20
> From: "Smith, Frank" [mailto:[email protected]]=20

So i need to format the strings too so i end up with 3 variables, Firstname, Lastname and Email.

Is there a better way to parse the files to get the email address and name? How do i deal with the strings and re arrange them, usually the Name with a comma need swapping around.

Can anyone help please?

This is my script so far...

#!/usr/bin/perl  

@files = </storage/filters/*>;
foreach $file (@files)
{
        open (FILE, "$file");
        while($line= <FILE> )
        {
            print $line if $line =~ /. From:/;
        }
        close FILE;
}
Have you looked at CPAN? Email::Address may mostly fit the bill. As for the name swapping around, that's both trivial (to swap) and hard (to determine if the swap is required - I don't think just a comma is likely to be sufficient). Also, use glob instead of <...> - more readable.
@Tanktalus Email::Address is great but it won't handle gibberish such as 'From: "Smith, Frank" [mailto:[email protected]]=20';

Best Answer

If you’re sure that those are the only valid formats, write your script to handle just those, and discard the rest.

my $first, $last, $email;
while( $line = <FILE> ) {
    if( $line =~ /From:\s+"(.*?),\s*(.*?)"\s+<(.*?)>/ ) {
        ($first, $last, $email) = ($2, $1, $3);
    } elsif( $line =~ /From:\s+"(.*?)\s+(.*?)\s+<(.*?)>/ ) {
        ($first, $last, $email) = ($1, $2, $3);
    } elsif( $line =~ /From:\s+"(.*?),\s*(.*?)"\s+\[mailto:(.*?)\]/ ) {
        ($first, $last, $email) = ($2, $1, $3);
    } elsif( $line =~ /From:\s+"(.*?)\s+(.*?)\s+\[mailto:(.*?)\]/ ) {
        ($first, $last, $email) = ($1, $2, $3);
    }
    # Do something with $first, $last and $email. . . .
}

That skips bad cases entirely. You could certainly tighten up the code:

my $first, $last, $email;
while( $line = <FILE> ) {
    if( $line =~ /From:\s+"(.*?),\s*(.*?)"\s+(?:<|\[mailto:)(.*?)(?:>|\])/ ) {
        ($first, $last, $email) = ($2, $1, $3);
    } elsif( $line =~ /From:\s+"(.*?)\s+(.*?)\s+(?:<|\[mailto:)(.*?)(?:>|\])/ ) {
        ($first, $last, $email) = ($1, $2, $3);
    }
    # Do something with $first, $last and $email. . . .
}

or other possibilities.

Now, granted, if you want to make sure the email addresses are in a valid format, that’s a different deal. This will also be defeated by names like “Martin van Buren” and the like.

comments:

ok as stupid as it sounds i didnt even think of that... Thanks :)
Happens to all of us, ard. :)