Searching for duplicate names in a file..

Operating Systems > Linux and UNIX

(1/3) > >>

Agent007:
Hi all,

I have this text file with tonnes of usernames......I need to search the file for duplicate usernames, since this is a very tedious process, is there a some script thingy that would do the trick? I was thinking if VI would be able to search and delete......?

thanks,
007

flap:
sort filename | uniq > newfilename

will remove duplicates from filename and output the new list to newfilename.

voidmain:
You can also just add the "-u" param to the "sort" command:

"sort -u" is basically the same as saying "sort | uniq". Now "sort -u filename" will only throw out duplicates if "entire lines" are the same.

e.g. A file with data like this:

...
joe
bill
joe
mary
joe
...

after running the sort -u command on it will output:

...
bill
joe
mary
...

Now if the file contained:

...
joe:Studly
bill:Nurdy
joe:studly
mary:Sexy
joe:studly dude
...

would output:

...
bill:Nurdy
joe:Studly
joe:studly
joe:studly dude
mary:Sexy
...

However, if there is some format to the text file containing your usernames then you can use sort keys. And of course you could have "cut" out the colums or fields you want to find uniq patterns on and you may want to ignore case. Sort is really a powerful tool and here is the "info" page that doesn't do it full justice:

http://voidmain.kicks-ass.net/man/?parm=sort&docType=info
http://voidmain.kicks-ass.net/man/?parm=sort&docType=man

If your file doesn't just contain the simple userid all in the same case and you are having trouble getting the right command, just paste in a sample of the file and I can give you a command to use to accomplish what you want.

[ January 20, 2003: Message edited by: void main ]

Agent007:
Void Main,

As u can c below, "user11" is repeated twice....I want only one instance of that
username.

quote:
n0 tty n0 skdj@xyzt Async interface 00:00:06 PPP: 900.n.00.n82
n0 tty n0 lala@xyz Async interface 00:00:0n PPP: 900.n.00.n29
n0 tty n0 user11@xyz Async interface 00:00:00 PPP: 900.n.00.n42
n0 tty n0 user11@xyzt Async interface 00:0n:49 PPP: 900.n.00.n42
n0 tty n0 weoi@xyzt Async interface 00:00:00 PPP: 900.n.00.n85
nn tty nn awer@xyz Async interface 00:00:52 PPP: 900.n.00.n00
66 tty 66 it@xyzt Async interface 00:02:04 PPP: 900.6.00.649

--- End quote ---

I have tried out the below sort command, and that did not remove the
duplicates....Pls giv me the correct syntax.

quote:
[root@localhost root]# sort -u test.txt | uniq > test1.txt

--- End quote ---

thanks & rgds,
007

voidmain:
Couple of things. If you are going to do the "-u" on your sort then don't use the "uniq" in your pipe. However, whether using "sort -u" or "sort | uniq" the "uniq" command will look for entire lines that are matching, unless you give it a field parameter. If all you are concerned about is seeing the ID's and nothing else then this command would do what you want:

$ cut -f4 -d' ' test.txt | cut -f1 -d'@' | sort -u

Which will list your IDs like:

awer
it
lala
skdj
user11
weoi

Do you want the entire line output? And what command did your data come from? Is the formatting exactly the same in your sample vs what is in your actual file? If the @ sign wasn't there it would be easy with a single sort command, in fact we could change the @ sign to a space and have it do what you want by:

$ cat test.txt | tr '@' ' ' | sort -k4,4 -u

which lists this:
nn tty nn awer xyz Async interface 00:00:52 PPP: 900.n.00.n00
66 tty 66 it xyzt Async interface 00:02:04 PPP: 900.6.00.649
n0 tty n0 lala xyz Async interface 00:00:0n PPP: 900.n.00.n29
n0 tty n0 skdj xyzt Async interface 00:00:06 PPP: 900.n.00.n82
n0 tty n0 user11 xyz Async interface 00:00:00 PPP: 900.n.00.n42
n0 tty n0 weoi xyzt Async interface 00:00:00 PPP: 900.n.00.n85

I would have to think a little more about doing it without changing the @ sign. This worked with the data you posted pasted into a file.

[ January 20, 2003: Message edited by: void main ]

Navigation

[0] Message Index

[#] Next page

Go to full version