AWK one-liner for multi-column comparision two unsorted files

Saturday, December 29, 2012 , 5 Comments

This awk one-liner works for multi-column on unsorted files. Comparision is done based on 1st,2nd,3rd,4th of the first file and 1st,3rd,6th,7th of the second file.
File1
7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1
File2
7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7
Output
awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' File1 File2
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

5 comments:

  1. Hi Vijay, Thanks for the post.
    Can you help me out in this. Actually, I have rows like

    50121 abc.com 28/1/2014-12:00:00
    52111 xyz.com 27/1/2014-12:00:00
    deusr abc.com 26/1/2014-12:00:00
    50121 abc.com 26/1/2014-12:00:00
    52111 abc.com 25/1/2014-12:00:00

    I removed duplicates based on first column and got the output as
    50121 abc.com 28/1/2014-12:00:00
    52111 xyz.com 27/1/2014-12:00:00
    deusr abc.com 26/1/2014-12:00:00

    but the issue here is I am willing to remove duplicates based on 2 columns comparison. i.e., 1st and 2nd one. I am trying using 'awk' command. But I am not getting it. Can you help me out in this please..

    ReplyDelete
  2. @Bharghav, try the below:

    awk '{a[$1,$2]=$0}END{for(i in a)print a[i]}' your_file

    ReplyDelete
  3. Hi Vijay - in your original post the comparison of the 4 columns are based on AND condition.

    If we need to use OR condition for comparing those columns, then what would be the AWK command for that?

    Thanks

    ReplyDelete
  4. Hi Vijay,
    Thank you for such a great posts.
    Can you please help with the below requirement:

    File1:
    23455|abc|xyz
    23455|abc|xsd
    34433|wer|sad
    45655|fdf|fcd

    File2:
    v343v|23455
    z565z|23455
    c9898|34433
    b2323|45655

    Output should be:
    23455|abc|xyz|v343v
    23455|abc|xyz|z565z
    23455|abc|xsd|v343v
    23455|abc|xsd|z565z
    34433|wer|sad|c9898
    45655|fdf|fcd|b2323

    I am using the below command :
    awk -F'|' 'NR==FNR { a[$1]=$1"|"$2"|"$3; next} $2 > 0 {print a[$2] "|" $1}' file1 file2 > result.txt

    But its only showing the result :
    23455|abc|xsd|v343v
    23455|abc|xsd|z565z
    34433|wer|sad|c9898
    45655|fdf|fcd|b2323

    ReplyDelete
  5. I Have two files as below,

    first file:

    Start State Next State Session Count Transition% LA_product_view home 694 28.660%

    LA_product_view searchresults 54 2.230%

    home 1101260 2 0.050%

    second file:

    Start State Next State Session Count Transition%

    LA_product_view home 618 27.560%

    LA_product_view searchresults 59 2.630%

    home price 25 0.360%

    i need to compare first two coloumns of first file and first two coloumns of second file and if they both match it should print 3,4 coloums of both file.if both the files has unique they should be listed with NA from the both files. my output should be like:

    LA_product_view home 694 28.660% 618 27.560%

    LA_product_view searchresults 54 2.230% 59 2.630%

    home 1101260 2 0.050% N/A N/A

    home price N/A N/A 25 0.360%

    can someone please help me in this..!

    ReplyDelete