/*** TIP 00201 ***/
Linking phone numbers

I've got a SAS file with 2 vars: OLD and NEW. They represent old and new phone
 numbers once the customer's phone has changed. When this happens the old-new obs is
 simply inserted somewhere in the file, no ordering/indexing of any kind, and the numbers
 have nothing to do with the time the record was inserted. Here's a snapshot of how data
 might look like:
  
   OLD        NEW
 8888888888 9999999999
 2222222222 3333333333
 5555555555 6666666666
 7777777777 8888888888
 0000011111 0000022222
 3333333333 4444444444
 9999999999 0000000000
 1111111111 2222222222
  
 A human eye will easily see that first-old to last-new chains are: 
  
 5555555555-6666666666
 7777777777-8888888888-9999999999-0000000000
 0000011111-0000022222
 1111111111-2222222222-3333333333-4444444444
  
 The file having upwards of 800000 obs, I need to devise a programmatic means of doing
 this. Actually with the sample like above the output would look like:
  
   OLD          NEW        NLINKS
 5555555555    6666666666      2
 7777777777    0000000000      4
 0000011111    0000022222      2
 1111111111    4444444444      4
  
 Where NLINKS represents the number of links in each "chain". So far my efforts have
 resulted in about fifteen steps of sorts and merges taking quite some time to run. Could
 anyone suggest a more elegant and/or efficient approach with fewer passes through the
 input?


Solution #1: Author: Paul M. Dorfman Email: sashole@BELLSOUTH.NET Max Zwingli asks how to traverse phone numbers randomly inserted in a file, when a number has been changed, and produce the first and last numbers in the chain and the number of links. This kind of question seems to have been asked before (I think sometime in 1998) and answered by a number of respondents including Ian Whitlock, Karsten Self, and myself. I recall using the question as a turf to compare SAS indexing, hand-coded binary search, and formatting as means of looking up new and old accounts (phones in your case), but because I have learned since that hashing provides for faster and more efficient searching, it is interesting to see how it might work in this case. The basic algorithm for solving the problem is quite simple: 1. Allocate 3 hash tables: 1) containing OLD phones; 2) containing NEW phones; 3) containing NEW phones *parallel* to the first table. 2. Read the file record by record and load the hash tables. 3. Read a record from the file again. Search OLD phone in the NEW hash table. If the search is unsuccessful, the OLD phone is the beginning of a chain, otherwise go to step 3. 4. Grab the corresponding NEW phone and search it in OLD table. If a match is found, grab the corresponding NEW phone from hash table 3 and search OLD table again. Otherwise it is the end of the chain, so stop and output the endpoints, then go to step 3. Now we see that the problem boils down to repeated hash searches of the same kind, so it makes sense to concoct a unified procedure and apply it with a degree of flexibility throughout the program. First, let us try to make use of the Macro Language (I assume that the input data set is called A) to create an %HSEARCH() routine that can chameleon depending upon the table and key it is searching: %let h = 1000003; * prime number => nobs*2; %macro hsearch (table=, key=); %if %upcase(&table) = %upcase(old) %then %let table = 1; %else %if %upcase(&table) = %upcase(new) %then %let table = 2; found = 0; do j=1+mod(&key,&h) by 1 until (h(&table,j)=. or found); if j = &h then j = 1; if h(&table,j) = &key then found = 1; end; %mend hsearch; data oldnew (keep=old new nlinks); array h (3,&h) _temporary_; *1=old,2=new,3=(old||new); do until (e1); set a end=e1; %hsearch (table=old, key=old); h(1,j) = old; h(3,j) = new; %hsearch (table=new, key=new); h(2,j) = new; end; do until (e2); set a end=e2; %hsearch (table=new, key=old); if found then continue; do nlinks=2 by 1 until (not found); %hsearch (table=old, key=new); if found then new = h(3,j); end; output; end; run;
Solution #2: Author: Paul M. Dorfman Email: sashole@BELLSOUTH.NET A good number of people dislike the Macro Language whenever a more conventional means of structured programming can be employed. I do not necessarily share that viewpoint given certain SAS limitations, but let us give the LINK subroutine a fair shot as well: data oldnew (keep=old new nlinks); array h (3,&h) _temporary_; *1=old,2=new,3=(old||new); do until (e1); set a end=e1; table = 1; key = old; link hsearch; h(1,j) = old; h(3,j) = new; table = 2; key = new; link hsearch; h(2,j) = new; end; do until (e2); set a end=e2; table = 2; key = old; link hsearch; if found then continue; table = 1; do nlinks=2 by 1 until (not found); key = new; link hsearch; if found then new = h(3,j); end; output; end; stop; hsearch: found = 0; do j=1+mod(key,&h) by 1 until (h(table,j) = . or found); if j = &h then j = 1; if h(table,j) = key then found = 1; end; run; In both cases, the problem is solved in a single step, albeit with two passes through the file. Above, hash table size is assumed about twice the number of the keys to search, in order to obtain the fastest searching speed with the wussiest collision resolution policy - linear probing.
Solution #3: /*** Paul Dorfman's solution is more efficient and elegant but I thought I would apply one of my fuzzy logic routines to your problem and base it upon looking for duplicate strings Not very efficient but it works Regards, Charles Patridge Email: Charles_S_Patridge@prodigy.net ***/ data phonenum; infile cards; length old new $10 ; input @1 old @12 new; cards; 8888888888 9999999999 2222222222 3333333333 5555555555 6666666666 7777777777 8888888888 0000011111 0000022222 3333333333 4444444444 9999999999 0000000000 1111111111 2222222222 ;;;; run; /*** create a format of phone numbers linked ***/ data phonefmt; retain fmtname "$phone"; set phonenum; start = old; label = new; drop old new; run; proc format cntlin=phonefmt; run; /*** assume no more than 6 phone numbers linked ***/ %macro linking; data linking; set phonenum; length link1 link2 link3 link4 link5 link6 $ 10; array link(6) link1-link6; link1 = old; nlink = 0; %do I = 2 %to 6; link(&I) = put( link(&I-1), $phone.) ; if link(&I) = link(&I-1) then do; link(&I) = ' '; goto finish; end; nlink + 1; %end; finish:; /*** create a key of all linked numbers ***/ alllinks = compress(link1||link2||link3||link4||link5||link6); run; proc sort data=linking out=linking; by descending nlink alllinks; run; /*** create another duplicate file with alllinks renamed ***/ data duptable; set linking(rename=(alllinks=balllink) drop=old new link1-link6 nlink); run; %mend linking; %linking; /*** get number of records ***/ %emptyyn( work.linking); /*** EMPTYYN macro can be found on my web site under Tips and Techniques***/ %let numrecds = &numobs; /*** read each record and compare it to all other records ***/ /*** and output only unique keys that are not contained in ***/ /*** other keys ***/ %macro findkeys( howmany ); data unique (drop=found balllink); %do i = 1 %to &howmany; %if &i gt &howmany %then stop; recd = &i; set linking point=recd; found = 0; %do k = 1 %to &howmany; brecd = &k; set duptable point=brecd; if (alllinks = balllink) then found + 1; if alllinks ne balllink then do; if (0 ne index(trim(balllink), trim(alllinks))) then found + 1; end; if (alllinks = balllink) and (found le 1 ) and (&k le &i) then output; %end; %end; stop; run; %mend findkeys; %findkeys(&numrecds);
Some comments from Max the person who ask the question: Max Zwingli maxzwingli@mail.nu Thanks Karsten, Paul and Charles for all your help. Paul's code in both macro and link versions is amazingly fast and works in a single step but needs an extra 30 Mb of RAM or so - as I understand to load the "hash" tables. Charles' code is a lot easier on memory but not nearly as fast. I've also dug out the old '98 discussion on the subject and tried the 'format version'. It worked satisfactorily fast but still on the average 3 times slower than Paul's "hashing" and more importantly required upwards of 100 Mb of extra RAM. That having been said, I do see how the main logic described by Paul works but he uses this routine to "load" and search the array(s): hsearch: found = 0; do j=1+mod(key,&h) by 1 until (h(table,j)=. or found); if j = &h then j = 1; if h(table,j) = key then found = 1; end; I have no clue how this works, to say nothing of how it can search 3 times faster than a format while using much less memory. Is it just me or anyone else has a problem with this "nonstandard" lookup? Is it described in any SAS book? Explanation by Ian Whitlock Given KEY, you want to store it at TABLE,MOD(KEY,&H)). The problem is that several values of KEY may yield the same value of MOD(KEY,&H). So where should they be stored? You need to have an algorithmic way of finding the position in the array. One possibility is to start and the next position and look element by element until you find either that where the KEY is stored or an empty slot. When empty this provides a place to put KEY. MOD(KEY,&H) provides a fast place to start looking. When the array is large enough in comparison with number of entries to be stored, you do not have to look very far before finding a place. Hence the seach is fast - short time to find MOD(KEY,&H) and a small number of array elements to look at. Now lets look at the code. hsearch: found = 0; do j=1+mod(key,&h) by 1 until (h(table,j)=. or found); if j = &h then j = 1; if h(table,j) = key then found = 1; end; FOUND is a flag to indicate whether KEY is found or not. The array for TABLE has &H elements. If you start some place and keep adding 1 you can fall off the end. In this case, just start back at 1 and keep adding. Why must you always stop? Since &H is larger than the number of keys, some elements must be missing. Hence you must always either run into KEY or an empty slot. I think Terjeson, Mark [mailto:TERJEMW@DSHS.WA.GOV] posted in the last month or two the address of a web site in Autralia that has a very good general discussion of hashing. You could search the archives at www.listserv.uga.edu for the message. It probably has "HASH" in the subject and you could narrow it down by Marks's name if I am correct about the name. Many computer science books on general algorithims will also have a discussion of hashing. I do not know of any SAS references other than papers that Paul has written. You can find at least two of those in the SUGI proceedings that are now available at www.sas.com. Ian Whitlock Email: whitloi1@westat.com /*** end of tip 00201 ***/