I first came across Soundex in the late 70's when working on a hospital IT system. The system had a surname index that used Soundex ti generate the initial key into the index (the index then had the patient ID number as a secondary key thus considerably reducing the time taken to search for patients by surname).
The reason for using the Soundex algorithm is that terms that are often misspelled can be a problem for database designers, for example, Names are variable length, can have strange spellings, and they are not unique. Many names have a wide rang of ethnic origins, which can give us names pronounced the same way but spelled differently and vice versa.
To solve this problem, we need to find some method of coding names which can find similar sounding one. Just such a family of coding algorithms exist and are called SoundExe, after the first patented version which was patented by Margaret O'Dell and Robert C. Russell in 1918.
A Soundex search algorithm takes a word, such as a person's name, as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.
The algorithm that I used in the late 70's is actually fairly straight forward to code and requires just a single pass over the input word as can be seen from the steps shown below :-
1. Capitalize all letters in the word and drop all punctuation marks. Pad the word with rightmost blanks as needed during each procedure step.
2. Retain the first letter of the word.
3. Change all occurrence of the following letters to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
4 Change letters from the following sets into the digit given:
1 = 'B', 'F', 'P', 'V'
2 = 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z'
3 = 'D','T'
4 = 'L'
5 = 'M','N'
6 = 'R'
5. Remove all pairs of digits which occur beside each other from the string that resulted after step (4).
6. Remove all zeros from the string that results from step 5.0 (placed there in step 3)
7. Pad the string that resulted from step (6) with trailing zeros and return only the first four positions, which will be of the form
To give an example using my surname "Mitchell" :-
1. Becomes MITCHELL
2. Keep the M
3. The rest becomes 0TC00LL (losing the first letter and change the I, H and E to zeros)
4. Becomes M0320044
5. Becomes M03204
6. Becomes M324
7. Final result is M324 and this is stored in the index
If someone was searching for Mitchel the rules above would return the same result thus increasing the chance of finding the right person.
There are many uses for Soundex and it need not only be used for names, for example it could also be used for addresses or any other free format alpha string that needs to be searched quickly and afficiently.