Author Topic: Searching for duplicate names in a file.. (Read 588 times)

Agent007 · « **on:** 20 January 2003, 16:23 »

Hi all,

I have this text file with tonnes of usernames......I need to search the file for duplicate usernames, since this is a very tedious process, is there a some script thingy that would do the trick? I was thinking if VI would be able to search and delete......?

thanks,
007

flap · « **Reply #1 on:** 20 January 2003, 16:45 »

sort filename | uniq > newfilename

will remove duplicates from filename and output the new list to newfilename.

voidmain · « **Reply #2 on:** 20 January 2003, 17:24 »

You can also just add the "-u" param to the "sort" command:

"sort -u" is basically the same as saying "sort | uniq". Now "sort -u filename" will only throw out duplicates if "entire lines" are the same.

e.g. A file with data like this:

...
joe
bill
joe
mary
joe
...

after running the sort -u command on it will output:

...
bill
joe
mary
...

Now if the file contained:

...
joe:Studly
bill:Nurdy
joe:studly
mary:Sexy
joe:studly dude
...

would output:

...
bill:Nurdy
joe:Studly
joe:studly
joe:studly dude
mary:Sexy
...

However, if there is some format to the text file containing your usernames then you can use sort keys. And of course you could have "cut" out the colums or fields you want to find uniq patterns on and you may want to ignore case. Sort is really a powerful tool and here is the "info" page that doesn't do it full justice:

http://voidmain.kicks-ass.net/man/?parm=sort&docType=info
http://voidmain.kicks-ass.net/man/?parm=sort&docType=man

If your file doesn't just contain the simple userid all in the same case and you are having trouble getting the right command, just paste in a sample of the file and I can give you a command to use to accomplish what you want.

[ January 20, 2003: Message edited by: void main ]

Agent007 · « **Reply #3 on:** 20 January 2003, 21:33 »

Void Main,

As u can c below, "user11" is repeated twice....I want only one instance of that
username.

quote:

n0 tty n0 skdj@xyzt Async interface 00:00:06 PPP: 900.n.00.n82
n0 tty n0 lala@xyz Async interface 00:00:0n PPP: 900.n.00.n29
n0 tty n0 user11@xyz Async interface 00:00:00 PPP: 900.n.00.n42
n0 tty n0 user11@xyzt Async interface 00:0n:49 PPP: 900.n.00.n42
n0 tty n0 weoi@xyzt Async interface 00:00:00 PPP: 900.n.00.n85
nn tty nn awer@xyz Async interface 00:00:52 PPP: 900.n.00.n00
66 tty 66 it@xyzt Async interface 00:02:04 PPP: 900.6.00.649

I have tried out the below sort command, and that did not remove the
duplicates....Pls giv me the correct syntax.

quote:

[root@localhost root]# sort -u test.txt | uniq > test1.txt

thanks & rgds,
007

voidmain · « **Reply #4 on:** 20 January 2003, 22:30 »

Couple of things. If you are going to do the "-u" on your sort then don't use the "uniq" in your pipe. However, whether using "sort -u" or "sort | uniq" the "uniq" command will look for entire lines that are matching, unless you give it a field parameter. If all you are concerned about is seeing the ID's and nothing else then this command would do what you want:

$ cut -f4 -d' ' test.txt | cut -f1 -d'@' | sort -u

Which will list your IDs like:

awer
it
lala
skdj
user11
weoi

Do you want the entire line output? And what command did your data come from? Is the formatting exactly the same in your sample vs what is in your actual file? If the @ sign wasn't there it would be easy with a single sort command, in fact we could change the @ sign to a space and have it do what you want by:

$ cat test.txt | tr '@' ' ' | sort -k4,4 -u

which lists this:
nn tty nn awer xyz Async interface 00:00:52 PPP: 900.n.00.n00
66 tty 66 it xyzt Async interface 00:02:04 PPP: 900.6.00.649
n0 tty n0 lala xyz Async interface 00:00:0n PPP: 900.n.00.n29
n0 tty n0 skdj xyzt Async interface 00:00:06 PPP: 900.n.00.n82
n0 tty n0 user11 xyz Async interface 00:00:00 PPP: 900.n.00.n42
n0 tty n0 weoi xyzt Async interface 00:00:00 PPP: 900.n.00.n85

I would have to think a little more about doing it without changing the @ sign. This worked with the data you posted pasted into a file.

[ January 20, 2003: Message edited by: void main ]

Agent007 · « **Reply #5 on:** 20 January 2003, 23:15 »

Thanks a million Void Main!! That really worked...Ur right, I only wanted the ID's to be listed. Btw, how does it actually work? I mean what's the f1-d'@' for? also, why the need of pipes?

thanks & rgds,
007

flap · « **Reply #6 on:** 20 January 2003, 23:30 »

quote:
Originally posted by void main:
I would have to think a little more about doing it without changing the @ sign.

How about this?

tr '@' ' ' < test.txt | sort -k4,4 -u | gawk '{print $1 " " $2 " " $3 " " $4 "@" $5 " " $6 " " $7 " " $8 " " $9 " " $10}'

Or is there a way of making that gawk statement smaller?

voidmain · « **Reply #7 on:** 20 January 2003, 23:34 »

quote:
Originally posted by flap:

How about this?

tr '@' ' ' < test.txt | sort -k4,4 -u | gawk '{print $1 " " $2 " " $3 " " $4 "@" $5 " " $6 " " $7 " " $8 " " $9 " " $10}'

Or is there a way of making that gawk statement smaller?

I'm sure there is and gawk/awk is very powerful. Unfortunately my brain was already full 10 years ago before reaching the awk chapter. And I don't believe your command will actually prevent lines with duplicates ids (before the @).

[ January 20, 2003: Message edited by: void main ]

flap · « **Reply #8 on:** 20 January 2003, 23:40 »

Well the duplicate id's have already been removed by your command, output of which is piped through gawk.

voidmain · « **Reply #9 on:** 20 January 2003, 23:42 »

quote:
Originally posted by Agent007:
Thanks a million Void Main!! That really worked...Ur right, I only wanted the ID's to be listed. Btw, how does it actually work? I mean what's the f1-d'@' for? also, why the need of pipes?

thanks & rgds,
007

It's really simple once you play with some of the basic UNIX commands. The first part of the command "cut -f4 -d' ' test.txt" says to break the file into columns separated by spaces "' '" and then only output the fourth column. Now that output will be in the form of "userid@host". So you pipe that output into the "cut -f1 -d'@'" which will split the input data into columns delimeted by '@' caracters which would result in two columns, and the "-f1" says to only output the first column which is the "userid". Take that output and pipe it directly into the "sort -u" command which sorts the input and removes duplicates and then spits the result back at you. It would be wise to invest in a shell programming book. This will become second nature to you...

voidmain · « **Reply #10 on:** 20 January 2003, 23:43 »

quote:
Originally posted by flap:
Well the duplicate id's have already been removed by your command, output of which is piped through gawk.

You're right. I'm having a bad hair day. That would indeed work.

Stop Microsoft

News:

Author Topic: Searching for duplicate names in a file.. (Read 588 times)

Agent007

Searching for duplicate names in a file..

flap

Searching for duplicate names in a file..

voidmain

Searching for duplicate names in a file..

Agent007

Searching for duplicate names in a file..

voidmain

Searching for duplicate names in a file..

Agent007

Searching for duplicate names in a file..

flap

Searching for duplicate names in a file..

voidmain

Searching for duplicate names in a file..

flap

Searching for duplicate names in a file..

voidmain

Searching for duplicate names in a file..

voidmain

Searching for duplicate names in a file..