I’ve spent a fair amount of time today scratching my head trying to work out how to avoid what are ‘logically’ duplicate results being returned when using a self-join in a database query and thought it best to quickly document what my solution was before I forget it.
This is probably case-specific and may not work in other environments depending on the data being used. It is also going to seem somewhat contrived as I have deliberately tried to keep this as simple as possible and avoided all the supporting tables and also details as to why this is necessary to support the user scenario.
In this example I have a table of name details associated with people which I have created in MySQL 5.0.67 as follows:
DROP TABLE IF EXISTS `schema`.`person`;
CREATE TABLE `schema`.`person` (
`id` int(10) unsigned NOT NULL auto_increment,
`forename` varchar(30) default NULL,
`surname` varchar(30) default NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this is populated with the following five records
1, 'Arthur', 'Dent'
2, 'Albert', 'Einstein'
3, 'Alfred', 'Hitchcock'
4, 'Albus', 'Dumbledore'
5, 'Ethelred', 'The Unready'
If I wish to find all records in this table associated with two people using wildcards (one whose forename begins with ‘A’ and the other whose forename begins with ‘Al’), I ended up using a self-join and two table aliases, e.g. P1 = Person 1 and P2 = Person 2 as follows:
select p1.id, p1.forename, p1.surname, p2.id, p2.forename, p2.surname
from person p1, person p2
where p1.forename like 'A%' and p2.forename like 'Al%';
Now, unfortunately when this is executed this results in all the matching permutations being returned as follows:
1, 'Arthur', 'Dent', 2, 'Albert', 'Einstein'
2, 'Albert', 'Einstein', 2, 'Albert', 'Einstein'
3, 'Alfred', 'Hitchcock', 2, 'Albert', 'Einstein'
4, 'Albus', 'Dumbledore', 2, 'Albert', 'Einstein'
1, 'Arthur', 'Dent', 3, 'Alfred', 'Hitchcock'
2, 'Albert', 'Einstein', 3, 'Alfred', 'Hitchcock'
3, 'Alfred', 'Hitchcock', 3, 'Alfred', 'Hitchcock'
4, 'Albus', 'Dumbledore', 3, 'Alfred', 'Hitchcock'
1, 'Arthur', 'Dent', 4, 'Albus', 'Dumbledore'
2, 'Albert', 'Einstein', 4, 'Albus', 'Dumbledore'
3, 'Alfred', 'Hitchcock', 4, 'Albus', 'Dumbledore'
4, 'Albus', 'Dumbledore', 4, 'Albus', 'Dumbledore'
Not only has this returned duplicates for the individual themselves (e.g. Albert Einstein for both person 1 and person 2) but also the same combinations in a different order (e.g ID 3,2 and 2,3). The first one of these is easy to remove from the result set when an ‘id’ field is present by adding the following criteria in the query
and p1.id != p2.id;
i.e.
select p1.id, p1.forename, p1.surname, p2.id, p2.forename, p2.surname
from person p1, person p2
where p1.forename like 'A%' and p2.forename like 'Al%'
and p1.id != p2.id;
this results in the following result set:
1, 'Arthur', 'Dent', 2, 'Albert', 'Einstein'
3, 'Alfred', 'Hitchcock', 2, 'Albert', 'Einstein'
4, 'Albus', 'Dumbledore', 2, 'Albert', 'Einstein'
1, 'Arthur', 'Dent', 3, 'Alfred', 'Hitchcock'
2, 'Albert', 'Einstein', 3, 'Alfred', 'Hitchcock'
4, 'Albus', 'Dumbledore', 3, 'Alfred', 'Hitchcock'
1, 'Arthur', 'Dent', 4, 'Albus', 'Dumbledore'
2, 'Albert', 'Einstein', 4, 'Albus', 'Dumbledore'
3, 'Alfred', 'Hitchcock', 4, 'Albus', 'Dumbledore'
As can be seen however, this still has the same logical combinations duplicated (order not withstanding). The removal of these records is significantly harder to do. I initially wrongly thought that I would be able to remove this through the addition of a ‘distinct’ keyword to the query but this only operates across all columns in the record and therefore doesn’t deliver the required result.
I resolved this by the inclusion of a combination of the ‘if’ and ‘concat’ functions that I was not aware of in SQL and a ‘group by’ clause (the ‘if’ function may not in fact be part of the ANSI standard yet). The following was the result:
select p1.id, p1.forename, p1.surname, p2.id, p2.forename, p2.surname,
if (p1.id<p2.id,CONCAT(p1.id,p2.id),CONCAT(p2.id,p1.id)) as pid_concat
from person p1, person p2
where p1.forename like 'A%' and p2.forename like 'Al%'
and p1.id != p2.id
group by pid_concat;
1, 'Arthur', 'Dent', 2, 'Albert', 'Einstein', '12'
1, 'Arthur', 'Dent', 3, 'Alfred', 'Hitchcock', '13'
1, 'Arthur', 'Dent', 4, 'Albus', 'Dumbledore', '14'
3, 'Alfred', 'Hitchcock', 2, 'Albert', 'Einstein', '23'
4, 'Albus', 'Dumbledore', 2, 'Albert', 'Einstein', '24'
4, 'Albus', 'Dumbledore', 3, 'Alfred', 'Hitchcock', '34'
Now whilst the ‘if’ function may or may not be widely implemented, it seems from the quick look I had that there is typically an equivalent function in most of the widely used databases in use. I found an Oracle function called ‘decode’ which would seem to deliver the same functionality. Result!