Charles S. Patridge 172 Monce Road Residence: (860) 673-9278 Burlington, CT 06013 Email: SAS Fuzzy Match Software Do you license the SAS System ? Then Read on! Do your files have unnecessary duplicate records ? Then Read on! Do you use outside commercial customer files ? Then Read on! Do you need to match files with no common keys ? Then Read on! Upon visual inspection and human reasoning, you can determine the following two names represent the same individual: A. Chuck S. Patridge, 172 Monce Road Burlington, CT 06013 B. Patridge, Charles Steven, RFD 2 Box 172 Farmington, CT 06085 However, most computer files without expensive sophisticated software can not determine these two records are the same individual You might very well need to have both records (same person) stored within your customer file for valid reasons. But, you also might like to know how many of your computer records contain multiple entries for similar or same people, item or records. In fact, you could be processing and/or storing these entries many times over at an expense to your business operations. My experience has found many customer base files contain as much as 5 to 7 percent duplicate records for the same entity without being easily detected. I have developed over the past ten years a rather common sense approach to isolating these duplicate records using the base SAS system. And at a cost far below the standard commercial prices which range from $60,000 and up.Have you ever needed to process commercial mailing files and match them against your own internal customer files to determine which ones are currently customers of yours and which ones are not. In most cases, these files are not set up in a fashion to easily perform this task without expensive application development from your own staff. Again, the routine I have developed will perform this task without much modification and at a hit ratio of as much as five times the standard commercial products. And finally, do you have files which need to be matched against each other but there is no common key to perform exact matching? The above routine I have described can perform this task, as long as there is some common field (name, address, birth date or some other entity) which exists in both files. If you have such a need as I have described and license the SAS System, call or email for an estimate, and have your data tested. Read on for some real life examples where this routine was implemented, along with an example of the source code.
Electronic Matching of Records without common keys by
Charles Patridge ITT Hartford Insurance Corporate Actuarial Hartford Plaza Hartford, CT 06115 860-547-6644 NESUG October 6-8, 1996 Boston PK Plz Boston, MA.
Contact Information:
Charles Patridge
172 Monce Road
Burlington, CT 06013
Home: 860-673-9278
Email: Charles_S_Patridge@prodigy.net
Title: The Fuzzy Feeling SAS Provides
Electronic Matching of Records without common keys
Introduction:
By briefly describing four real life situations where this application has been
implemented, I hope to provide a better understanding of when and where
Electronic Fuzzy Matching could be effective and utilized.
Application A:
The Vice President of Marketing comes down with a list of clients from Dunn &
Brad Street that are potentially ideal candidates for a special marketing
product campaign. The VP asks you which ones (from the list) are and are
not current customers. You ask if this list of clients exist on a floppy
diskette, as you can see this list encompasses several hundred entries.
The VP says "Sure, here is the diskette and this piece of paper describes
what, where and how the fields are contained on the diskette". You are relieved
you do not
have to key in these several hundred entries. You put the diskette into your
"hot little" PC and start to peruse the file. You notice several things
about the way the data is stored on this PC file:
1. There is no common "KEY" in this PC file that can be used to link these
entries with your company's customer file directly.
2. You determine this file has name, address, city, state and zip code for
each entry. You jump to the conclusion you can write a SAS program which will
read your customer base file and the D&B file, sort the name of client by zip
code and run a merge statement matching both files by zip code and name of
client.
After developing the initial SAS program and running your merge statement, you
discover you have SUCCESSFULLY matched 21% of the D&B file with the company
customer file. You deliver your report to the VP and state you have performed
magic with the D&B file by matching 21% of these records. The VP states he
appreciates your efforts but believes there should be more than 50% of these
D&B entries in the current company customer file.
You ask "How, can that be? I matched those records with a SAS program based on
name of client with in zip code!". The VP says "I had my secretary take the
first 100 entries and print them out, and she cross referenced them manually
with our customer list. You then asked to see her list and those she
matched. You review the secretary's list quickly and discovered the name
of clients on the D&B list are spelled or positioned differently than on the
company's customer list. For example:
D&B list Company customer list
Chuck Patridge P D P C, Ltd. Patridge, Charles
172 Monce Road PDPC, Ltd.
Burlington, Ct. 06013 Monce Rd.
Burlington, Ct. 06013-2545
You realize by looking at the two similar entries, they are in fact the same
client. You accept the fact these are the same client, and your SAS program
with the MERGE statement has failed to match records because the name of
the client is stored/spelled differently from your internal company customer
file versus the D&B external file. With a puzzled looked, you reply to the
VP "I am unable to match these two files with a high degree of accuracy
because
D&B does not store their clients' names the same way our company does!
We would need to purchase a matching software package to perform your request".
The VP says "How much would that cost?". You reply "About $60,000?". VP
asks "Can't we develop our own matching program?" You reply "Sure, but I
would need 3 or 4 months to develop such a program. But I will not be
able to get to it for another 3 or 4 months unless you say it is ok to
put the current projects on the back burner". The VP answers by saying
"No, continue with the current projects. I'll have my secretary do it
manually".
Application B:
Company A is a direct and reinsurance company, and its clients are other
primary insurance companies. Each month each primary insurance company
sends Company A a list of insureds, as well as the insurance agent who
placed the business. Company A also sells its own products to these same
insurance agents. Company A would like to know which agents of the
primary companies are also their own agents and which are not.
Company A's agent list has more than 55,000 members (agents). Each month
of submissions to Company A could easily be one to two thousand from the
primary companies. And as before, each primary company has its own way
of storing mailing data on its agency force.
Question at hand, how many agents of the primary companies are agents of
Company A's ?
Application C:
A Direct Mail Marketing Company (DMMC) sends over 10,000 catalogs to
consumers every month at a cost of $1.19 per catalog sent. The VP of
Catalog Distribution is asked to cut costs in his mailing budget without
sacrificing the marketing effort. On average, each catalog mailed generates
approximately $42 of sales.
With many options available to the VP of Distribution, he chooses to find how
many catalogs are sent to the same address. He approaches his MIS staff for
an alphabetical list of clients with the same address. The first run
produces about 3% (or 3,547) records with an exact address for a returned
savings of $4,200. The MIS analyst indicated this list only selected
clients where there were exact mailing addresses, and felt there were
as many records where there would be a "fuzzy" match. The VP says how
long
would it take to catch these fuzzy matches. The analyst states he will
research the software market to see what is available. Two weeks later,
the analyst returns and informs the VP it would cost DMMC $70,000 for the
matching software. The VP asks if there is any other alternative. The
analyst suggests
developing the software inhouse but does not have any idea how long it will
take nor how effective it would be. The VP decides against developing or
buying any matching software; instead, he decides to look into other ways
of cutting his mailing expenses.
Application D:
Two state agencies maintain files on certain federally funded programs.
Agency A maintains records on those state residents who are under the Long
Term Care program, and Agency B maintains files on state residents who have
or had cancer treatments. Patients can enter and leave from one program to
another as many times as necessary. The federal government has appropriated
funds for these two programs based on the number of patients in each
program. In addition, an extra percentage is granted for those
patients which utilize both care facilities.
The problem for the state agency is each built their own files independent
of the other. One agency chose to use Social Security as a key, while the
other agency used a computer generated number to track the patients. Hence,
there is no common link between the two files.
However, each file maintained data for the following information: Patient
First and Last Name, Date of Birth, Town of Residence, and Patient's gender.
The trick was to match these two files and find those patients which had
utilized both health care facilities.
Having developed elaborate Cobol and SAS programs, the director of agency A
believed there were more patients in both programs then previously matched.
Out of 500,000 records and with some modifications, I installed my matching
routine and discovered an additional 4.5% (about 22,500 records) were at
one time in both care facilities. This additional matching entitled the
state agencies to more Federal funds.
I chose to start this presentation with several real life business situations
I have encountered over the past 4 years. The names have been disguised to
protect the innocent and prevent any embrassment to the above companies on
the "Fortune 500" list. In addition, I believe an introduction to my
application would better explain its usefulness/effectiveness than
explaining
the actual SAS application. I have used this particular program for various
applications with only minor changes to the actual code.
The main purpose of this application is the ability to match records from
two separate files which do not have any common key to link records together.
Instead, the files have common fields but can not be matched exactly by
the contents of these fields due to the way the data is keyed or maintained.
For instance, the client name in file A is keyed as Charles Patridge;
in file B the client name is Chuck Patridge. As smart as a computer can
be, it can not reason that these two data fields are, in fact, the
same person. However, an individual can create a computer program(s) to
consider these two data fields to be close to being a match.
That is what I have attempted to create. A SAS program to look at fields
within a record and decide whether the contents are close to being the
"same" as the fields within another record. As with most applications,
the critical aspect to this particular problem was becoming familiar with
the data. To accomplish this task, I took approximately 1,200 records
from a real file and started to match each record with a file consisting
of more than 55,000 similar type records. As I matched each record, I
analyzed
the data from both files and eventually determined an algorythmn could be
created to simulate what the human brain does when comparing two similar
character strings.
To duplicate this human thought process and utilize the current tools built
within the base SAS system, I began the initial program one module at a
time. First, I decided the SCAN function would be crucial to the success
of my program. In using SCAN, I did not have to worry about the order in
which a client's name is stored. That is, Charles Patridge is the same
client name as
Patridge, Charles. All that is important is the "words" in one field are
"contained" in a character data string of another field. In addition, I
realized not all words from one field should or have to be contained in
another field. That is, "Charles Patridge" is the same person as "Patridge,
"Chuck". The difference between these two comparisons is "Chuck" and "Charles".
To compensate for these nicknames and/or abbreviations, one can build a SAS
macro to "normalize" the data. For instance, when dealing with addresses,
there are certain words which are commonly abbreviated: St for Street,
Rd for Road, Dr for Drive, etc. Hence, I had to build several SAS macros
to "normalize" my data depending on the field's use. For example,
addresses need one kind of normalizing versus the names of clients. A special
note is needed when trying to normalize the data. I have found data contents
have a certain characteristic based on industry and/or application. One
example is the insurance agent application described above. There are
numerous insurance agents throughout the United States whose business
/corporate name contain the word "AGENCY". However, in the DMMC example
above, "AGENCY" can be a rather unique word and be useful in the matching
process. The point is "normalizing" data should be done with care for
each application to maximize the matching (hit ratio) process.
The last ingredient needed for my application to work is the ability to read
one record from one SAS dataset and hold it open while reading through
another SAS dataset to look for potential matches with the record from the
first file. This third ingredient was accomplish using the POINT-- option
of the SET statement. A side benefit of the POINT option was the ability
to reduce the amount of I/O and CPU by breaking down the source file
(file to be used to determine if a match exists) into subsets based on
the application. For instance, the example Application B tries to match
insurance agents from all across the United States. I decided to subset
the source file into 50 datasets based on the state in which the agent's
address is located. This allow the program to search records based on
state instead of searching through all states for a possible match.
To enhance the matching process and reduce the manual effort in matching
records, I developed several additional techniques. First, I knew I
needed to
prioritized those matches found. That is, there could be numerous matches
in a given situation, and I wanted the most probable match to occur first
by some determined method. The method I chose was to count the number of
words found in each possible occurrence, and then rank these matches by
the number of words contained in descending order. In real life
applications, there could be numerous matches (especially in Metro
areas) for a given client, and I did not want the computer program to
select just one
match. I needed to list all possible matches for a Human to make the final
decision as to which one(s) are truly the real matches. This is an
application where the computer can do alot of the grunt work but is limited
in making the right/accurate choice. Hence, I needed to sort the final
matches according to the source record with the most words found.
Another technique added to the matching process was the occurrence of vowels
within the data. Due to human error, I determined many words are misspelled
because of vowels. For example, "Charlie" sounds the the same as "Charley"
but are spelled differently because of the vowels. So, to enhance the
matching process, I match each "normalized" word first looking for a
possible hit. And then I squeeze out the vowels and perfo the same
routine looking for a hit by using only the consonants. For example,
"Charlie"
is not contained in the string "Charley". However, after removing the
vowels, "Chrl" is contained in "Chrl". And I just have increased my hit
ratio by one word.
A side note for SAS users of release 6.07, I tried using the SOUND() function
in my application. It would have required to parse each and every word in
the source file adding a significant amount of processing to the matching
process. The gains I received did not seem to warrant utilizing it.
However, it could be used for those cases where it seems to make sense,
such as "BOOK" vs "HOOK", "COUCH" vs "POUCH", etc.
Having discussed the basic concepts and necessary ingredients to the
application, it is time to discuss the actual program using the example
Application B. First, it is assumed both files (source-master and target-to
be matched) have been "normalized" and edited for obvious errors (state
code matches zip code, all characters are upper case, etc).
This application involves several hundred to several thousand records each
month submitted to Company A from approximately 30 primary insurance
companies. The media for submission is either on paper (keyed) or on a
magnetic tape/diskette. The required data fields are agent number, agent
name, address, city, state, zip code, and company code. It should be
noted there are some intervening programs needed to separate certain
fields due to the way companies submit their data. For instance, many
companies code
city and state as one field, and Company A has program routines to
separate this into two separate fields. However, my major program does
not really need the state or city, as I use the zip code as the primary
key to determine if a possible match record is within a postal boundary
territory. More specifically, I use only the first three digits of
the zip code to determine if one record is within a geographic area
to search for a possible match. Once the files are in order, the
process of matching the
new monthly agent records against Company A's 55,000 agent master list
begins. First, determine if a record has already been matched for a
given company's agent. If so, then bypass matching it. If not, store
record in a separate SAS dataset for matching. Read the historical
matches to see if any prior records have not been matched and recycle
them for another round of matching.
Start the physical process of matching. If a possible match is found
then store the target and source record in another SAS dataset. If no
match is found then mark target record as no match. Due to the way the
data is entered into Company A's target files, several attempts of
matching are needed:
Match target name 1 field to source name 1 field.
Match target name 2 field to source name 1 field. *
Match target name 2 field to source addr 1 field. *
Match target addr 1 field to source addr 1 field.
* name 2 field may contain either name of client or 1st address line of
client (hence, need to match against name and addr).
After all target records have been through the matching process, the
matched records are then accumulated and sorted by the most probable
match first. These records are then printed and save to a SAS dataset
for an online selection system. Here, the operator using a FSP
application can select which records are truly a match. The online
system is an added component to the base application due to the
need to have SAS/FSP. Otherwise, the printed version can be used
to visually select which records are indeed
matched. once manually marked, an operator can use the system file
editor and go through and mark/update the source file with the appropriate
matches. Once source files are updated, the process can start all over
again for the next month.
In a typical month of about 1,500 records, I am able to run the entire
application from start to finish (includes selecting match records and
updating source file) in approximately 3 to 4 hours. Compare this to
manually matching 1200 records (using system file editor) and taking
approximately 8-10 hours with a rather responsive IBM 3090 VM/CMS
system, the application made me TWENTY-FIVE times more productive.
And in addition, the application ended up being more accurate
than manually doing it due to human
factors.
In order to keep this paper brief and to the point, I have intentionally
left out all macros and "normalizing" routines which do not directly
support the matching process. I hope this application and the other
three examples listed above have demonstrated the usefulness and
flexibility of the SAS system, as well as the fuzzy feeling SAS can
provide!
New Monthly Agency Records for Matching Process
===================================================================
AGT003 AGT005
HOLMES AGENCY SMYTHE, HOWARD K
RIO DRIVE BURLINGTON INSURANCE AGENCY
BOX 442 SPEILMAN HIGHWAY ROUTE 4
BURLINGTON BURLINGTON
CT 06013 CT 06085
===================================================================
AGT007 AGT008
PROFESSIONAL DATA PROCESSING PARTRIDGE, CHUCK
CONSULTANTS DBA INSURANCE AGENCY OF
. DRAWER 123 BURLINGTON
INTERSECTION RIO & MONCEE RD BOX 424 172 MONCE AVENUE
FARMINGTON BURLINGTON
CT 06085 CT 06001
===================================================================
AGT006
HOLME & SON FINANCIAL
ADVISORS
RIO GRANDE DRIVE
P.O. BOX 442
UNIONVILLE
CT 06099
Prior History of Matched and Unmatched Records
0000001 AGT001 0000002 AGT002
CHARLES S. PATRIDGE PRESIDENT H K SMITH AGENCY
PATRIDGE INSURANCE AGENCY, 12 MAIN STREET
INC. BOX 123
172 MONCE ROAD BOX 378 FARMINGTON
BURLINGTON CT 06013
CT 06013
===================================================================
_______ AGT003 0000003 AGT004
HOLMES AGENCY P.D.P.C. LIMITED
RIO DRIVE INSURANCE AGENCY OF
BOX 442 BURLINGTON
BURLINGTON CORNER OF MONCE AND RIO ROADS
CT 06013 UNIONVILLE CT 06085
===================================================================
0000006 0000003
CHUCK PATRIDGE PDPC LTD
NO ADDRESS BOX 865
BRISTOL 1 MAIN ST
MA 02019 MILES POND
VT 05815
===================================================================
0000001 0000005
PATRIDGE INSURANCE AGENCY, INC. CHUCK PATRIDGE AGENCIES
172 MONCE RD NO ADDRESS
BOX 378 BRISTOL
BURLINGTON CT 06019
CT 06013
===================================================================
0000002 0000004
HOWARD K SMITH AGENCY P D P C LTD
MAIN ST MONCE RD
FARMINGTON BOX 211
CT 06085 UNIONVILLE
CT 06085
The Output from the Matching Routine:
Agency File had some critical field in common with Co. Agency file
Incoming Record Company Agency Master File
===================================================================
AGT003 - 1 2 4-Key Word=AGENCY0000001
HOLMES AGENCY PATRIDGE INSURANCE AGENCY, INC.
RIO DRIVE 172 MONCE RD BOX 378
BOX 442 BURLINGTON, CT 06013
BURLINGTON, CT 06013 Erroneous Match
===================================================================
AGT003 - 1 2 4-Key Word=AGENCY0000002
HOLMES AGENCY HOWARD K SMITH AGENCY
RIO DRIVE MAIN ST
BOX 442 FARMINGTON, CT 06085
BURLINGTON, CT 06013 Erroneous Match
===================================================================
AGT003 - 1 2 3-Key Word=GNC (aGeNCy) 0000005
HOLMES AGENCY CHUCK PATRIDGE AGENCIES
RIO DRIVE NO ADDRESS
BOX 442 BRISTOL, CT 06019
BURLINGTON, CT 06013 Erroneous Match
===================================================================
AGT005 - 3 9 12- Key Word=SMTH, 0000002
HOWARD,AGENCY HOWARD K SMITH AGENCY
HOLMES AGENCY MAIN ST
RIO DRIVE FARMINGTON, CT 06085
BOX 442
BURLINGTON, CT 06013 Successful Match
===================================================================
AGT005 - 3 6 8- Key Word=INSURANCE, 0000001
AGENCY PATRIDGE INSURANCE AGENCY, INC.
SMYTHE, HOWARD K 172 Monce RD BOX 378
BURLINGTON INSURANCE AGENCY BURLINGTON, CT 06013
SPEILMAN HIGHWAY ROUTE 4
BURLINGTON, CT 06085 Erroneous Match
===================================================================
AGT005 - 2 6 7- Key Word=GNC 0000005
SMYTHE, HOWARD K CHUCK PATRIDGE AGENCIES
BURLINGTON INSURANCE AGENCY NO ADDRESS
SPEILMAN HIGHWAY ROUTE 4 BRISTOL, CT 06019
BURLINGTON, CT 06085 Erroneous Match
===================================================================
Following item(s) were unable to find a corresponding match
AGT006
HOLME & SON FINANCIAL ADVISORS
RIO GRANDE DRIVE
P.O. BOX 442
UNIONVILLE, CT 06099
===================================================================On-Line Selection for Matching System
New Incoming Unmatched Record Match Y/N ? MASTER File Record
1 GORDON MILLER INC INS MGR Y MILLER G G AGY INC
2 3211 N WASHINGTON ST BOX 326
3 BOX 326 COLUMBUS, IN 47202
4 COLUMBUS, IN 47201
===================================================================
1 M J LIBERMAN CO Y LIEBERMAN M J CO
2 8 WOOD HOLLOW RD
3 354 EISENHOWER PKWY PARSIPPANY, NJ 07054
4 LIVINGSTON, NJ 07093
===================================================================
1 ASSN INS SYSTEMS Y ASSN MGR SYSTEMS
2 3108 PARHAM RD 3023
3 3108 PORHAM RD STE 302 B RICHMON, VA 23229
4 RICHMOND, VA 23200
===================================================================
1 HALLISTER AGY Y HOLLISTER AGY INC
2 BOX 299
3 5739 W MAIN STR MEXICO, NY 13114
4 MEXICO, NY 13114
===================================================================
1 OWNES GRP Y OWENS GRP
2 619 PAISADE AVE
3 619 PALSADES AVE ENGELEWOOD CLIFFS,
4 ENGLEWOOD CLIF, NJ 07632 NJ 07632
===================================================================
1 CLABOUGH ASSN Y CLABOUGH ASSN INC
2 BOX B
3 1010 REED ST WITTENBERG, WI 54499
4 WITTENBERG, WI 54499
===================================================================
1 UNITED INS INVESTMENTS Y UNITED INS INVMTS
2 204 MONROE ST STE 210
3 204 C MONROE ST STE 210 ROCKVI, MD 20850
4 ROCHVILLE, MD 20850
===================================================================
1 CARROLL COUNTY INS AGY Y CARROLL COUNTY THOM
2 BOX 69
3 78 W MAIN ST BOX 69 WESTMINISTER, MD
4 WESTMINISTER, MD 21157 21157
===================================================================
1 BEALL GARNER SCREEN GEARE Y BEALL GARNER SCREEN
2 BOB 1626
3 44 BALTIMORE ST CUMBERLAND, MD 21502
4 CUMBERLAND, MD 21502
===================================================================
1 DEGNAN HOUGH AGY Y IPS INC DEGNAN DEWEY
2 32 MAIN ST
3 BOX 465 WEST LEBANON, NH
4 W LEBANON, NH 03784 03784
===================================================================
1 J H PORT AG INC Y PORT JACK H AGY INC
2 28 W MARKET ST
3 BOX 240 MT UNION, PA 17066
4 MOUNT UNION, PA 17066
===================================================================
1 DGK INS Y DAVIS GREGORY KYLE
2 BOX 337
3 BOX 337 FACTORYVILLE, PA
4 FACTORYVILLE, PA 18419 18419
===================================================================
1 N E C INS INC Y NOONAN EAST CENTRAL
2 BOX 478
3 308 NOONAN DR BOX 537 PACIFIC, MO 63069
4 P O BOX, MO 63069
===================================================================
1 CHEENEY INS AGY Y CHANEY INS
2 7500 E BUTHERUS DR
3 7500 E BUTHENUS DR SCOTTSDALE, AZ 85261
4 SCOTTSDALE, AZ 85260
===================================================================
1 E H ROERIND CO INC Y ROEDING E H CO INC
2 BOX 17 900
3 BOX 17900 CRESTVIEW HILL, KY
4 CRESTVIEW HILLS, KY 41017 41017
===================================================================
1 GEORGE JEE INS Y JEE GEORGE INS
2 434 18TH AVE BOX 210246
3 850 TARAVAL ST SAN FRANCISCO, CA
4 SAN FRANCISCO, CA 94100 94121
1 KELSO WHEELER ASSN Y KELSO WHEELER ASSN
2 1404 NEUSE BLVD DRAW
3 325 POLLOCK ST DRAW K NEW BERN, NC 28563
4 NEW BERN, NC 28560
===================================================================
1 ELKANAH MEAD Y MEAD ELKANAH AGY INC
2 BOX 812
3 BOX 1675 GREENWICH, CT 06830
4 GREENWICH, CT 06838
===================================================================
1 MC LAUGHLIN CO Y MCLAUGHLIN CO
2 1725 DESALES ST N W
3 9TH FL 1725 DE SALES ST N W WASHINGTON, DC 20036
4 WASHINGTON DC, DC 20036
===================================================================
Some "NORMALIZING" of Data
Business Name Addresses
AGENCY AGY AVENUE AVE
MANUFACTURING MFG BOULEVARD BLVD
MANUFACTURE MFG DRAWER BOX
INSURANCE INS P O BOX BOX
ENTERPRIZE ENTPZE POBOX BOX
ENTERPRISE ENTPZE CIRCLE CIR
CASUALTY CSLTY CENTER CTR
MANAGING MGMT DRIVE DR
MANAGEMENT MGMT EAST E
OFFICE OFCE FLOOR FL
FORT FT
HIGHWAY HWY
LANE LN
NORTH N
PARK PK
ROAD RD
SOUTH S
SQUARE SQ
STREET ST
TURNPIKE TPKE
TERRACE TR
WEST W
CROSSING XING
SOURCE CODE