Removing Duplicates with Python
20/08/09 12:45 Filed in: Python
I haven’t really written much about Python lately. I have a feeling that is about to change Python is great because it is powerful and allows you to do things very quickly. I figured I would write a short post to show how to remove duplicates by just using the set type. This is probably the quickest and easiest way of removing duplicates in Python.
I don’t think I need to get in to how useful it can be to easily remove duplicates. I have used this many times in the past for doing everything from removing duplicate values from a list of SQL Injection checks to just determining how many unique occurrences I have for a given test.
Python has a type called a set. A set is basically an unordered collection of unique values. You can create a set by specifying a new empty set and adding values to it or by converting another type. The set conversion can be done over any iterable object.
Create a new empty set called myset:
myset = set()
You can add values to your set by using add or update:
Convert another type to a set called newset:
newset = set(another_type)
Sets in Python are nice for a couple of reasons. The first being they only keep unique values. This means that any type you convert to a set or anything you add to a set is unique. It will discard duplicate values. Secondly, you can test for membership in the set. Testing for membership will give you a True / False response based on whether a value exists in the set.
Here are some examples
Converting a list to a set.
Even strings are iterable objects in Python. String conversion to set.
The following shows True / False values for membership tests from the previous string conversion.
Let’s say you wanted to write a small program that took a file, removed the duplicates, and created a new file with only unique values. The file that contains the duplicates has one value per line, which means there is a newline at the end of each item. You want to maintain the newline in the new unique file that you are writing to as well. You will see the newlines in the following specified by “\n”.
The following is an example:
if len(sys.argv) < 3:
print "Usage: remove_dups.py original_file.txt unique_file.txt"
file1 = open(sys.argv)
file2 = open(sys.argv, "w")
unique = set(file1.read().split("\n"))
file2.write("".join([line + "\n" for line in unique]))
I will explain a bit of what’s happening here. Let’s say we have copied this in to a file called remove_dups.py. This program takes two arguments, your original file and the name of the file you want to create without the duplicates. If it doesn’t have the two arguments the program exits.
Next both files are opened, with the second file opened for writing. The first file is read in splitting on newlines. The unique variable now contains the unique values. We then write to the second file every line concatenating a newline on the end. This makes the second file contain the unique values one per line.
You now now how to remove duplicates in Python using the set type. Knowing is half the battle
I wrote this post very quickly and didn’t explain about my use of read() vs using readlines(). Marcin pointed out yesterday that it wasn’t clear. I wanted to show how you could use read() and split on newline characters. My hope was that you would see how you could split on any character when reading a file like commas, semicolons, asterisks, or anything really.
In the code example above, if you wanted to read in a file per line instead of splitting on the “\n” character you could just use readlines() instead.