counting a unique string in line

Tag: uniq Author: verycoolbaby Date: 2014-01-21

I try to use "uniq -c" to count 2nd string in line
My file A has around 500,000 lines, and looks like this

File_A

30-Nov 20714 GHI 235
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 10005 ABC 101
30-Nov 10355 DEF 111
30-Nov 10005 ABC 101
30-Nov 20714 GHI 235
...

The command I used

sort -k 2 File_A | uniq -c

I find out the result i get doesn't match the lines.
How can i fix this problem? or Does there has other way to count unique string in line?

The result i get will similar like this (i just random made out the number)

   70 30-Nov 10005 ABC 101
    5 30-Nov 10355 DEF 111
   55 30-Nov 20714 GHI 235

Best Answer

Here are a couple, or three, other ways to do it. These solutions have the benefit that the file is not sorted - rather they rely on hashes (associative arrays) to keep track of unique occurrences.

Method 1:

perl -ane 'END{print scalar keys %h,"\n"}$h{$F[1]}++'  File_A

The "-ane" makes Perl loop through the lines in File_A, and sets elements of the array F[] equal to the fields of each line as it goes. So your unique numbers end up in F[1]. %h is a hash. The hash element indexed by $F[1] is incremented as each line is processed. At the end, the END{} block is run, and it simply prints the number of elements in the hash.

Method 2:

perl -ane 'END{print "$u\n"}$u++ if $h{$F[1]}++==1'  File_A

Similar to the method above, but this time a variable $u is incremented each time incrementing the hash results in it becoming 1 - i.e. the first time we see that number.

I am sure @mpapec or @fedorqui could do it in half the code, but you get the idea!

Method 3:

awk 'FNR==NR{a[$2]++;next}{print a[$2],$0}END{for(i in a)u++;print u}' File_A File_A

Result:

2 30-Nov 20714 GHI 235
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
4 30-Nov 10005 ABC 101
2 30-Nov 10355 DEF 111
4 30-Nov 10005 ABC 101
2 30-Nov 20714 GHI 235
3

This uses awk and runs through your file twice - that is why it appears twice at the end of the command. On the first pass, the code in curly braces after "FNR==NR" is run and it increments the element of associative array a[] as indexed by field 2 ($2) so it is essentially counting the number of times each id in field 2 is seen. Then, on the second pass, the part in the second set of curly braces is run and it prints the total number of times the id was seen on the first pass, plus the current line. At the end, the END{} block is run and it counts the elements in associative array a[] and prints that out.

Other Answer1

You need to also tell uniq to consider only that field, the same way you did with sort. Perhaps you can use -f of --skip-fields for that. The problem you then have is that uniq doesn't take a "number of fields to check".

Otherwise, if you don't need to keep the original string you can just:

cut -d' ' -f2 | sort ...

comments:

Thanks for your answer, but I tried "uniq -f" still output wrong result. does it has other way to solve this
@JOSS Don't accept in that case, it will deter other answerers.

Other Answer2

If your intention is to count the unique values in the second column, the one that has 20714, 10005, ... in it, then you need to extract it first using cut.

cut -d' ' -f 2 File_A | sort | uniq -c