SAS Coding Tips and Techniques



Send your comments and suggestions to Charles Patridge:
Charles_S_Patridge@prodigy.net

Tip / Technique Follows:

Charles S. Patridge
172 Monce Road                           Residence: (860) 673-9278
Burlington, CT 06013                    Email:

SAS Fuzzy Match Software

Do you license the SAS System ? Then Read on!
Do your files have unnecessary duplicate records ? Then Read on!
Do you use outside commercial customer files ? Then Read on!
Do you need to match files with no common keys ? Then Read on!

Upon visual inspection and human reasoning, you can determine the following
two names represent the same individual:
A. Chuck S. Patridge, 172 Monce Road Burlington, CT 06013
B. Patridge, Charles Steven, RFD 2 Box 172 Farmington, CT 06085

However, most computer files without expensive sophisticated software can not
determine these two records are the same individual You might very well need
to have both records (same person) stored within your customer file for
valid reasons. But, you also might like to know how many of your computer
records contain multiple entries for similar or same people, item or
records.  In fact, you could be processing and/or storing these entries
many times over at an expense to your business operations.
My experience has found many customer base files contain as much as 5 to
7 percent duplicate records for the same entity without being easily
detected.  I have developed over the past ten years a rather common sense
approach to isolating these duplicate records using the base SAS system.
And at a cost far below the standard commercial prices which range from
$60,000 and up.

Have you ever needed to process commercial mailing files and match them against your own internal customer files to determine which ones are currently customers of yours and which ones are not. In most cases, these files are not set up in a fashion to easily perform this task without expensive application development from your own staff. Again, the routine I have developed will perform this task without much modification and at a hit ratio of as much as five times the standard commercial products. And finally, do you have files which need to be matched against each other but there is no common key to perform exact matching? The above routine I have described can perform this task, as long as there is some common field (name, address, birth date or some other entity) which exists in both files. If you have such a need as I have described and license the SAS System, call or email for an estimate, and have your data tested. Read on for some real life examples where this routine was implemented, along with an example of the source code.


The Fuzzy Feeling SAS Provides

Electronic Matching of Records without common keys by

Charles Patridge ITT Hartford Insurance Corporate Actuarial Hartford Plaza Hartford, CT 06115 860-547-6644 NESUG October 6-8, 1996 Boston PK Plz Boston, MA.

Contact Information:

Charles Patridge
172 Monce Road
Burlington, CT 06013
Home: 860-673-9278
Email: Charles_S_Patridge@prodigy.net

Title: The Fuzzy Feeling SAS Provides
        Electronic Matching of Records without common keys

Introduction:
By briefly describing four real life situations where this application has been
implemented, I hope to provide a better understanding of when and where
Electronic Fuzzy Matching could be effective and utilized.

Application A:
The Vice President of Marketing comes down with a list of clients from Dunn &
Brad Street that are potentially ideal candidates for a special marketing
product campaign.  The VP asks you which ones (from the list) are and are
not current customers.  You ask if this list of clients exist on a floppy
diskette, as you can see this list encompasses several hundred entries.
The VP says "Sure, here is the diskette and this piece of paper describes
what, where and how the fields are contained on the diskette".  You are relieved
you do not have to key in these several hundred entries. You put the diskette into your "hot little" PC and start to peruse the file. You notice several things about the way the data is stored on this PC file: 1. There is no common "KEY" in this PC file that can be used to link these entries with your company's customer file directly. 2. You determine this file has name, address, city, state and zip code for each entry. You jump to the conclusion you can write a SAS program which will read your customer base file and the D&B file, sort the name of client by zip code and run a merge statement matching both files by zip code and name of client. After developing the initial SAS program and running your merge statement, you discover you have SUCCESSFULLY matched 21% of the D&B file with the company customer file. You deliver your report to the VP and state you have performed magic with the D&B file by matching 21% of these records. The VP states he appreciates your efforts but believes there should be more than 50% of these D&B entries in the current company customer file. You ask "How, can that be? I matched those records with a SAS program based on name of client with in zip code!". The VP says "I had my secretary take the first 100 entries and print them out, and she cross referenced them manually with our customer list. You then asked to see her list and those she matched. You review the secretary's list quickly and discovered the name of clients on the D&B list are spelled or positioned differently than on the company's customer list. For example: D&B list Company customer list Chuck Patridge P D P C, Ltd. Patridge, Charles 172 Monce Road PDPC, Ltd. Burlington, Ct. 06013 Monce Rd. Burlington, Ct. 06013-2545 You realize by looking at the two similar entries, they are in fact the same client. You accept the fact these are the same client, and your SAS program with the MERGE statement has failed to match records because the name of the client is stored/spelled differently from your internal company customer file versus the D&B external file. With a puzzled looked, you reply to the VP "I am unable to match these two files with a high degree of accuracy because D&B does not store their clients' names the same way our company does! We would need to purchase a matching software package to perform your request". The VP says "How much would that cost?". You reply "About $60,000?". VP asks "Can't we develop our own matching program?" You reply "Sure, but I would need 3 or 4 months to develop such a program. But I will not be able to get to it for another 3 or 4 months unless you say it is ok to put the current projects on the back burner". The VP answers by saying "No, continue with the current projects. I'll have my secretary do it manually". Application B: Company A is a direct and reinsurance company, and its clients are other primary insurance companies. Each month each primary insurance company sends Company A a list of insureds, as well as the insurance agent who placed the business. Company A also sells its own products to these same insurance agents. Company A would like to know which agents of the primary companies are also their own agents and which are not. Company A's agent list has more than 55,000 members (agents). Each month of submissions to Company A could easily be one to two thousand from the primary companies. And as before, each primary company has its own way of storing mailing data on its agency force. Question at hand, how many agents of the primary companies are agents of Company A's ? Application C: A Direct Mail Marketing Company (DMMC) sends over 10,000 catalogs to consumers every month at a cost of $1.19 per catalog sent. The VP of Catalog Distribution is asked to cut costs in his mailing budget without sacrificing the marketing effort. On average, each catalog mailed generates approximately $42 of sales. With many options available to the VP of Distribution, he chooses to find how many catalogs are sent to the same address. He approaches his MIS staff for an alphabetical list of clients with the same address. The first run produces about 3% (or 3,547) records with an exact address for a returned savings of $4,200. The MIS analyst indicated this list only selected clients where there were exact mailing addresses, and felt there were as many records where there would be a "fuzzy" match. The VP says how long would it take to catch these fuzzy matches. The analyst states he will research the software market to see what is available. Two weeks later, the analyst returns and informs the VP it would cost DMMC $70,000 for the matching software. The VP asks if there is any other alternative. The analyst suggests developing the software inhouse but does not have any idea how long it will take nor how effective it would be. The VP decides against developing or buying any matching software; instead, he decides to look into other ways of cutting his mailing expenses. Application D: Two state agencies maintain files on certain federally funded programs. Agency A maintains records on those state residents who are under the Long Term Care program, and Agency B maintains files on state residents who have or had cancer treatments. Patients can enter and leave from one program to another as many times as necessary. The federal government has appropriated funds for these two programs based on the number of patients in each program. In addition, an extra percentage is granted for those patients which utilize both care facilities. The problem for the state agency is each built their own files independent of the other. One agency chose to use Social Security as a key, while the other agency used a computer generated number to track the patients. Hence, there is no common link between the two files. However, each file maintained data for the following information: Patient First and Last Name, Date of Birth, Town of Residence, and Patient's gender. The trick was to match these two files and find those patients which had utilized both health care facilities. Having developed elaborate Cobol and SAS programs, the director of agency A believed there were more patients in both programs then previously matched. Out of 500,000 records and with some modifications, I installed my matching routine and discovered an additional 4.5% (about 22,500 records) were at one time in both care facilities. This additional matching entitled the state agencies to more Federal funds. I chose to start this presentation with several real life business situations I have encountered over the past 4 years. The names have been disguised to protect the innocent and prevent any embrassment to the above companies on the "Fortune 500" list. In addition, I believe an introduction to my application would better explain its usefulness/effectiveness than explaining the actual SAS application. I have used this particular program for various applications with only minor changes to the actual code. The main purpose of this application is the ability to match records from two separate files which do not have any common key to link records together. Instead, the files have common fields but can not be matched exactly by the contents of these fields due to the way the data is keyed or maintained. For instance, the client name in file A is keyed as Charles Patridge; in file B the client name is Chuck Patridge. As smart as a computer can be, it can not reason that these two data fields are, in fact, the same person. However, an individual can create a computer program(s) to consider these two data fields to be close to being a match. That is what I have attempted to create. A SAS program to look at fields within a record and decide whether the contents are close to being the "same" as the fields within another record. As with most applications, the critical aspect to this particular problem was becoming familiar with the data. To accomplish this task, I took approximately 1,200 records from a real file and started to match each record with a file consisting of more than 55,000 similar type records. As I matched each record, I analyzed the data from both files and eventually determined an algorythmn could be created to simulate what the human brain does when comparing two similar character strings. To duplicate this human thought process and utilize the current tools built within the base SAS system, I began the initial program one module at a time. First, I decided the SCAN function would be crucial to the success of my program. In using SCAN, I did not have to worry about the order in which a client's name is stored. That is, Charles Patridge is the same client name as Patridge, Charles. All that is important is the "words" in one field are "contained" in a character data string of another field. In addition, I realized not all words from one field should or have to be contained in another field. That is, "Charles Patridge" is the same person as "Patridge, "Chuck". The difference between these two comparisons is "Chuck" and "Charles". To compensate for these nicknames and/or abbreviations, one can build a SAS macro to "normalize" the data. For instance, when dealing with addresses, there are certain words which are commonly abbreviated: St for Street, Rd for Road, Dr for Drive, etc. Hence, I had to build several SAS macros to "normalize" my data depending on the field's use. For example, addresses need one kind of normalizing versus the names of clients. A special note is needed when trying to normalize the data. I have found data contents have a certain characteristic based on industry and/or application. One example is the insurance agent application described above. There are numerous insurance agents throughout the United States whose business /corporate name contain the word "AGENCY". However, in the DMMC example above, "AGENCY" can be a rather unique word and be useful in the matching process. The point is "normalizing" data should be done with care for each application to maximize the matching (hit ratio) process. The last ingredient needed for my application to work is the ability to read one record from one SAS dataset and hold it open while reading through another SAS dataset to look for potential matches with the record from the first file. This third ingredient was accomplish using the POINT-- option of the SET statement. A side benefit of the POINT option was the ability to reduce the amount of I/O and CPU by breaking down the source file (file to be used to determine if a match exists) into subsets based on the application. For instance, the example Application B tries to match insurance agents from all across the United States. I decided to subset the source file into 50 datasets based on the state in which the agent's address is located. This allow the program to search records based on state instead of searching through all states for a possible match. To enhance the matching process and reduce the manual effort in matching records, I developed several additional techniques. First, I knew I needed to prioritized those matches found. That is, there could be numerous matches in a given situation, and I wanted the most probable match to occur first by some determined method. The method I chose was to count the number of words found in each possible occurrence, and then rank these matches by the number of words contained in descending order. In real life applications, there could be numerous matches (especially in Metro areas) for a given client, and I did not want the computer program to select just one match. I needed to list all possible matches for a Human to make the final decision as to which one(s) are truly the real matches. This is an application where the computer can do alot of the grunt work but is limited in making the right/accurate choice. Hence, I needed to sort the final matches according to the source record with the most words found. Another technique added to the matching process was the occurrence of vowels within the data. Due to human error, I determined many words are misspelled because of vowels. For example, "Charlie" sounds the the same as "Charley" but are spelled differently because of the vowels. So, to enhance the matching process, I match each "normalized" word first looking for a possible hit. And then I squeeze out the vowels and perfo the same routine looking for a hit by using only the consonants. For example, "Charlie" is not contained in the string "Charley". However, after removing the vowels, "Chrl" is contained in "Chrl". And I just have increased my hit ratio by one word. A side note for SAS users of release 6.07, I tried using the SOUND() function in my application. It would have required to parse each and every word in the source file adding a significant amount of processing to the matching process. The gains I received did not seem to warrant utilizing it. However, it could be used for those cases where it seems to make sense, such as "BOOK" vs "HOOK", "COUCH" vs "POUCH", etc. Having discussed the basic concepts and necessary ingredients to the application, it is time to discuss the actual program using the example Application B. First, it is assumed both files (source-master and target-to be matched) have been "normalized" and edited for obvious errors (state code matches zip code, all characters are upper case, etc). This application involves several hundred to several thousand records each month submitted to Company A from approximately 30 primary insurance companies. The media for submission is either on paper (keyed) or on a magnetic tape/diskette. The required data fields are agent number, agent name, address, city, state, zip code, and company code. It should be noted there are some intervening programs needed to separate certain fields due to the way companies submit their data. For instance, many companies code city and state as one field, and Company A has program routines to separate this into two separate fields. However, my major program does not really need the state or city, as I use the zip code as the primary key to determine if a possible match record is within a postal boundary territory. More specifically, I use only the first three digits of the zip code to determine if one record is within a geographic area to search for a possible match. Once the files are in order, the process of matching the new monthly agent records against Company A's 55,000 agent master list begins. First, determine if a record has already been matched for a given company's agent. If so, then bypass matching it. If not, store record in a separate SAS dataset for matching. Read the historical matches to see if any prior records have not been matched and recycle them for another round of matching. Start the physical process of matching. If a possible match is found then store the target and source record in another SAS dataset. If no match is found then mark target record as no match. Due to the way the data is entered into Company A's target files, several attempts of matching are needed: Match target name 1 field to source name 1 field. Match target name 2 field to source name 1 field. * Match target name 2 field to source addr 1 field. * Match target addr 1 field to source addr 1 field. * name 2 field may contain either name of client or 1st address line of client (hence, need to match against name and addr). After all target records have been through the matching process, the matched records are then accumulated and sorted by the most probable match first. These records are then printed and save to a SAS dataset for an online selection system. Here, the operator using a FSP application can select which records are truly a match. The online system is an added component to the base application due to the need to have SAS/FSP. Otherwise, the printed version can be used to visually select which records are indeed matched. once manually marked, an operator can use the system file editor and go through and mark/update the source file with the appropriate matches. Once source files are updated, the process can start all over again for the next month. In a typical month of about 1,500 records, I am able to run the entire application from start to finish (includes selecting match records and updating source file) in approximately 3 to 4 hours. Compare this to manually matching 1200 records (using system file editor) and taking approximately 8-10 hours with a rather responsive IBM 3090 VM/CMS system, the application made me TWENTY-FIVE times more productive. And in addition, the application ended up being more accurate than manually doing it due to human factors. In order to keep this paper brief and to the point, I have intentionally left out all macros and "normalizing" routines which do not directly support the matching process. I hope this application and the other three examples listed above have demonstrated the usefulness and flexibility of the SAS system, as well as the fuzzy feeling SAS can provide! New Monthly Agency Records for Matching Process =================================================================== AGT003 AGT005 HOLMES AGENCY SMYTHE, HOWARD K RIO DRIVE BURLINGTON INSURANCE AGENCY BOX 442 SPEILMAN HIGHWAY ROUTE 4 BURLINGTON BURLINGTON CT 06013 CT 06085 =================================================================== AGT007 AGT008 PROFESSIONAL DATA PROCESSING PARTRIDGE, CHUCK CONSULTANTS DBA INSURANCE AGENCY OF . DRAWER 123 BURLINGTON INTERSECTION RIO & MONCEE RD BOX 424 172 MONCE AVENUE FARMINGTON BURLINGTON CT 06085 CT 06001 =================================================================== AGT006 HOLME & SON FINANCIAL ADVISORS RIO GRANDE DRIVE P.O. BOX 442 UNIONVILLE CT 06099 Prior History of Matched and Unmatched Records 0000001 AGT001 0000002 AGT002 CHARLES S. PATRIDGE PRESIDENT H K SMITH AGENCY PATRIDGE INSURANCE AGENCY, 12 MAIN STREET INC. BOX 123 172 MONCE ROAD BOX 378 FARMINGTON BURLINGTON CT 06013 CT 06013 =================================================================== _______ AGT003 0000003 AGT004 HOLMES AGENCY P.D.P.C. LIMITED RIO DRIVE INSURANCE AGENCY OF BOX 442 BURLINGTON BURLINGTON CORNER OF MONCE AND RIO ROADS CT 06013 UNIONVILLE CT 06085 =================================================================== 0000006 0000003 CHUCK PATRIDGE PDPC LTD NO ADDRESS BOX 865 BRISTOL 1 MAIN ST MA 02019 MILES POND VT 05815 =================================================================== 0000001 0000005 PATRIDGE INSURANCE AGENCY, INC. CHUCK PATRIDGE AGENCIES 172 MONCE RD NO ADDRESS BOX 378 BRISTOL BURLINGTON CT 06019 CT 06013 =================================================================== 0000002 0000004 HOWARD K SMITH AGENCY P D P C LTD MAIN ST MONCE RD FARMINGTON BOX 211 CT 06085 UNIONVILLE CT 06085 The Output from the Matching Routine: Agency File had some critical field in common with Co. Agency file Incoming Record Company Agency Master File =================================================================== AGT003 - 1 2 4-Key Word=AGENCY0000001 HOLMES AGENCY PATRIDGE INSURANCE AGENCY, INC. RIO DRIVE 172 MONCE RD BOX 378 BOX 442 BURLINGTON, CT 06013 BURLINGTON, CT 06013 Erroneous Match =================================================================== AGT003 - 1 2 4-Key Word=AGENCY0000002 HOLMES AGENCY HOWARD K SMITH AGENCY RIO DRIVE MAIN ST BOX 442 FARMINGTON, CT 06085 BURLINGTON, CT 06013 Erroneous Match =================================================================== AGT003 - 1 2 3-Key Word=GNC (aGeNCy) 0000005 HOLMES AGENCY CHUCK PATRIDGE AGENCIES RIO DRIVE NO ADDRESS BOX 442 BRISTOL, CT 06019 BURLINGTON, CT 06013 Erroneous Match =================================================================== AGT005 - 3 9 12- Key Word=SMTH, 0000002 HOWARD,AGENCY HOWARD K SMITH AGENCY HOLMES AGENCY MAIN ST RIO DRIVE FARMINGTON, CT 06085 BOX 442 BURLINGTON, CT 06013 Successful Match =================================================================== AGT005 - 3 6 8- Key Word=INSURANCE, 0000001 AGENCY PATRIDGE INSURANCE AGENCY, INC. SMYTHE, HOWARD K 172 Monce RD BOX 378 BURLINGTON INSURANCE AGENCY BURLINGTON, CT 06013 SPEILMAN HIGHWAY ROUTE 4 BURLINGTON, CT 06085 Erroneous Match =================================================================== AGT005 - 2 6 7- Key Word=GNC 0000005 SMYTHE, HOWARD K CHUCK PATRIDGE AGENCIES BURLINGTON INSURANCE AGENCY NO ADDRESS SPEILMAN HIGHWAY ROUTE 4 BRISTOL, CT 06019 BURLINGTON, CT 06085 Erroneous Match =================================================================== Following item(s) were unable to find a corresponding match AGT006 HOLME & SON FINANCIAL ADVISORS RIO GRANDE DRIVE P.O. BOX 442 UNIONVILLE, CT 06099 ===================================================================On-Line Selection for Matching System New Incoming Unmatched Record Match Y/N ? MASTER File Record 1 GORDON MILLER INC INS MGR Y MILLER G G AGY INC 2 3211 N WASHINGTON ST BOX 326 3 BOX 326 COLUMBUS, IN 47202 4 COLUMBUS, IN 47201 =================================================================== 1 M J LIBERMAN CO Y LIEBERMAN M J CO 2 8 WOOD HOLLOW RD 3 354 EISENHOWER PKWY PARSIPPANY, NJ 07054 4 LIVINGSTON, NJ 07093 =================================================================== 1 ASSN INS SYSTEMS Y ASSN MGR SYSTEMS 2 3108 PARHAM RD 3023 3 3108 PORHAM RD STE 302 B RICHMON, VA 23229 4 RICHMOND, VA 23200 =================================================================== 1 HALLISTER AGY Y HOLLISTER AGY INC 2 BOX 299 3 5739 W MAIN STR MEXICO, NY 13114 4 MEXICO, NY 13114 =================================================================== 1 OWNES GRP Y OWENS GRP 2 619 PAISADE AVE 3 619 PALSADES AVE ENGELEWOOD CLIFFS, 4 ENGLEWOOD CLIF, NJ 07632 NJ 07632 =================================================================== 1 CLABOUGH ASSN Y CLABOUGH ASSN INC 2 BOX B 3 1010 REED ST WITTENBERG, WI 54499 4 WITTENBERG, WI 54499 =================================================================== 1 UNITED INS INVESTMENTS Y UNITED INS INVMTS 2 204 MONROE ST STE 210 3 204 C MONROE ST STE 210 ROCKVI, MD 20850 4 ROCHVILLE, MD 20850 =================================================================== 1 CARROLL COUNTY INS AGY Y CARROLL COUNTY THOM 2 BOX 69 3 78 W MAIN ST BOX 69 WESTMINISTER, MD 4 WESTMINISTER, MD 21157 21157 =================================================================== 1 BEALL GARNER SCREEN GEARE Y BEALL GARNER SCREEN 2 BOB 1626 3 44 BALTIMORE ST CUMBERLAND, MD 21502 4 CUMBERLAND, MD 21502 =================================================================== 1 DEGNAN HOUGH AGY Y IPS INC DEGNAN DEWEY 2 32 MAIN ST 3 BOX 465 WEST LEBANON, NH 4 W LEBANON, NH 03784 03784 =================================================================== 1 J H PORT AG INC Y PORT JACK H AGY INC 2 28 W MARKET ST 3 BOX 240 MT UNION, PA 17066 4 MOUNT UNION, PA 17066 =================================================================== 1 DGK INS Y DAVIS GREGORY KYLE 2 BOX 337 3 BOX 337 FACTORYVILLE, PA 4 FACTORYVILLE, PA 18419 18419 =================================================================== 1 N E C INS INC Y NOONAN EAST CENTRAL 2 BOX 478 3 308 NOONAN DR BOX 537 PACIFIC, MO 63069 4 P O BOX, MO 63069 =================================================================== 1 CHEENEY INS AGY Y CHANEY INS 2 7500 E BUTHERUS DR 3 7500 E BUTHENUS DR SCOTTSDALE, AZ 85261 4 SCOTTSDALE, AZ 85260 =================================================================== 1 E H ROERIND CO INC Y ROEDING E H CO INC 2 BOX 17 900 3 BOX 17900 CRESTVIEW HILL, KY 4 CRESTVIEW HILLS, KY 41017 41017 =================================================================== 1 GEORGE JEE INS Y JEE GEORGE INS 2 434 18TH AVE BOX 210246 3 850 TARAVAL ST SAN FRANCISCO, CA 4 SAN FRANCISCO, CA 94100 94121 1 KELSO WHEELER ASSN Y KELSO WHEELER ASSN 2 1404 NEUSE BLVD DRAW 3 325 POLLOCK ST DRAW K NEW BERN, NC 28563 4 NEW BERN, NC 28560 =================================================================== 1 ELKANAH MEAD Y MEAD ELKANAH AGY INC 2 BOX 812 3 BOX 1675 GREENWICH, CT 06830 4 GREENWICH, CT 06838 =================================================================== 1 MC LAUGHLIN CO Y MCLAUGHLIN CO 2 1725 DESALES ST N W 3 9TH FL 1725 DE SALES ST N W WASHINGTON, DC 20036 4 WASHINGTON DC, DC 20036 =================================================================== Some "NORMALIZING" of Data Business Name Addresses AGENCY AGY AVENUE AVE MANUFACTURING MFG BOULEVARD BLVD MANUFACTURE MFG DRAWER BOX INSURANCE INS P O BOX BOX ENTERPRIZE ENTPZE POBOX BOX ENTERPRISE ENTPZE CIRCLE CIR CASUALTY CSLTY CENTER CTR MANAGING MGMT DRIVE DR MANAGEMENT MGMT EAST E OFFICE OFCE FLOOR FL FORT FT HIGHWAY HWY LANE LN NORTH N PARK PK ROAD RD SOUTH S SQUARE SQ STREET ST TURNPIKE TPKE TERRACE TR WEST W CROSSING XING
SOURCE CODE
END of WEB PAGE