How to refactor regex in Perl

Tag: regex , perl Author: binlangguo Date: 2013-07-07

I have the following sentences:

     text <MIR-1> GGG-33 <EXP-V-3> text text <VACCVIRUS-PROP-1> some other.
     text <MIR-1> text <ASSC-PHRASE-1> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.

What I want to do is to create a single regular expression (regex) that can match the two sentences above. Note that the only differing pattern in the above sentences are the middle factor <EXP-V-3> and <ASSC-PHRASE-1>.

I'm stucked with the current attempt, which matched them in two redundant regex. What's the right way to do it?

 use Data::Dumper;

    @sent = ("text <MIR-1> GGG-33 <EXP-V-3> text text <VACCVIRUS-PROP-1> some other.",
             " text <MIR-1> text <ASSC-PHRASE-1> text <VACCVIRUS-PROP-1> some other <PATTERN-1> other.");


    foreach $sent (@sent) {
       if ( $sent =~ /.*<MIR-\d+>.*<EXP-V-\d+>.*<VACCVIRUS-PROP-\d+>.*/gi ) {

          print "$sent\n";
        }
        elsif( $sent =~ /.*<MIR-\d+>.*<ASSC-PHRASE-\d+>.*<VACCVIRUS-PROP-\d+>/gi ) {
         print "$sent\n";
        }
    }

Live demo

Best Answer

(?:xxx|yyy)\s*<MIR-1>\s*(?:xxx|yyy)\s*(?:<EXP-V-3>|<ASSC-PHRASE-1>)\s*(?:xxxx|yyy)\s*<VACCVIRUS-PROP-1>

Maybe this regexp not optimized, but it work.

Ok, what I do here:

First Magic:

(?:EXPR) - Capture group NOT CAPTURED # <?:> helps to avoid any capturing

Second Magic:

(a|b|c) - choose metasymbol in work. I would choose between <a> or <b> or <c>

Third Magic:

Here Rubular work

Generalization:

.+?\s*<MIR-\d+>\s*.+?\s*(?:<EXP-V-\d+>|<ASSC-PHRASE-\d+>)\s*.+?\s*<VACCVIRUS-PROP-\d+>.+

And your example:

Here Rubular work too

Reject string:

.+?\s*<MIR-\d+>\s*[^\[]+?\s*(?:<EXP-V-\d+>|<ASSC-PHRASE-\d+>)\s*[^\]]+?\s*<VACCVIRUS-PROP-\d+>.+

Fourth Magic:

[^SYMBOLS] - Class of symbols. <^> At the beginning mean 'I DON'T want match them'.

Here Example:

[abc]{1} - I will match <a> or <b> or <c>
[^abc]{1} - I will NOT match <a> or <b> or <c>

Here Rubular work again

comments:

How can I make it more general? As 'xxx'or 'yyy' can actually be anything.
@neversaint update answer, please, check
please, check update again. Don't forget accept answer at the end ;)
you save my life. Thanks a million