/*** TIP 00202 ***/
    Subject: Dynamic Table Lookup (was: Account transfer problem)
    November 1998 from SAS-L Archives
    Key Words: Table LookUp, Nested Formats, Linking,
               Binary / Indexed / Formatted Search 
    
Authors:  

Name            Email
Paul Dorfman  , paul_dorfman@hotmail.com
Karsten Self  , kmself@IX.NETCOM.COM
Ian Whitlock  , WHITLOI1@WESTAT.COM
Peter Crawford, Peter@CrawfordSoftware.demon.co.uk

     
    The "Account transfer problem: Help!" Ludwig Boltzmann posted a few days
    ago, resulted in an interesting thread extending beyond the original
    question. However, the problem which Ian Whitlock dubbed "Classic loop
    lookup problem" lends itself as an excellent test bed for different search
    methods. The original Ludwig's posting is fairly short, so I decided to
    repost it for the sake of easy reference:
     
     I hope to get some help with a complicated (I think) problem. In my
     organization, clients are assigned 9-digit account. Whenever an account
     number, for whatever reason, has to be changed, the old number and new
     number are inserted into a 'transfer' table. When I unloaded the table to a
     dataset XREF (and removed duplicates) I found out that records don't follow
     any specific order. For example, it might look like
     
     OLD    NEW
     116    119
     047    120
     122    124
     018    111
     029    114
     119    126
     111    113
     033    117
     112    118
     020    122
     065    121
     027    123
     011    112
     125    127
     113    115
     114    116
     115    125
     
     (I limited numbers to 3 digits for simplicity). The problem is that one
     account can go through many changes, and I need to reconstruct the whole
     chain for every physical account on the table. The only clue I have is that
     a new account is made higher than another new account previously assigned.
     Looking at above file I could see that 011 became 112 and then 118, 47
     became 120 (single transfer), etc. In other words, I would like to create a
     dataset LINKED with one row representing one chain and a pointer in that row
     showing how long the chain is. Done by hand, it would look like
     
     OBS    ACT1    ACT2    ACT3    ACT4    ACT5    ACT6    ACT7    PTR
     
      1      47      120       .       .       .       .      .      2
      2      18      111     113     115     125     127      .      6
      3      29      114     116     119     126       .      .      5
      4      33      117       .       .       .       .      .      2
      5      20      122     124       .       .       .      .      3
      6      65      121       .       .       .       .      .      2
      7      27      123       .       .       .       .      .      2
      8      11      112     118       .       .       .      .      3
     
     The transfer table has about 1,000,000 rows, so I need to figure out a way
     do it in a  program but got really stumped. I greatly appreciate every bit
     of help.
     
    I responded with a solution based on the format search idea: Load two
    formats, OLD mapping old items to new items, and NEW mapping new items to
    nothing. Use the latter to find an occurrence of the OLD value that does not
    have a match among NEW values, and start at this observation as a root.
    Then, in a loop, exploit the former format to reconstruct the entire chain
    until next new element has no match among old elements thus being a leaf.
     
    Ian Whitlock used the same principle but decided to invoke a different
    search mechanism, that is, indexed SAS file search. To avoid excessive
    lookups, Ian exploited SQL to return a START dataset containing only root
    observations. In the same SQL step, Ian constructed a lookup table by
    indexing the original dataset.
     
    Karsten Self offered the same method I had proposed but with a significant
    distinction: Instead of loading formats from a temporary system file
    containing Proc Format statements (as I did), he constructed a SAS dataset
    to be used with CNTLIN= option. When I started testing my method against
    Ian's, I noticed that the formats took quite long time to load even for
    moderate 10,000 observations. However, when I recoded the process of format
    loading Karsten's way, loading accelerated, on the average, 3 times. I tried
    sorting CNTLIN= datasets trying to make formats load faster but in fact,
    they loaded even slower.
     
    Both approaches consist of two stages: 1) the stage of table lookup
    preparation and 2) lookup itself. Ludwig's transfer table XREF is constantly
    updated as new pairs of NEW and OLD accounts are inserted, so what we have
    at hand is a typical dynamic table lookup problem, in which the first stage
    cannot be completed once and used in great many searches thereafter - the
    lookup table(s) have to be reconstructed every time. Hence, to make this
    kind of program perform best, some kind of equilibrium should be achieved
    between the speed of search and the speed of concocting a lookup object.
     
    During the first test, I noticed that in Ian's program, both parts are quite
    time-balanced; format method was very fast at the search stage but pretty
    slow to load even with sufficient memory resources. I knew from previous
    experience that explicitly coded binary search of an ordered array searches
    about 2 times slower than a format. However, it was apparent that with
    binary search, sorting a lookup source and reading it into an array is all
    needed for lookup object preparation, so I decided to give this method a
    shot, too. I placed all three programs (Ian's indexed search, my format
    lookup with Karsten's load technique, and binary search) in a single SAS job
    and tested them under OS/390 batch and NT/333MHz/128M for 1E4, 1E5, and 1E6
    observations. Note that the latter figure is the one Ludwig has in his
    'real' situation. I had inputs constructed after the following pattern:

     
    %LET INOBS = 10000;
    DATA XREF;
       DO OLD=1 TO &INOBS;
          NEW = OLD + &INOBS*.1;
         R = RANUNI(1);
         OUTPUT;
      END;
    RUN;
    PROC SORT OUT=XREF(DROP=R); BY R; RUN;
    %LET NA = 50;
     

    and then only varied INOBS. By design, it will yield chains of equal lengths
    whose number is 1/10 of &INOBS value. To assure that no array would runs out
    of bounds, I adopted the value 50 for maximum chain length and assigned it,
    as Ian suggested, to a macrovariable NA available through the job. Below,
    all the code is given (where the step above, and steps purging ACCTS dataset
    from the WORK library before next program, are omitted) followed by test
    results. The dataset ACCTS was purged before the commencement of each next
    method. These purge steps, as well as the step creating XREF, are omitted.
     

*** INDEXED SEARCH (IAN WHITLOCK) ***; PROC SQL ; CREATE TABLE START AS SELECT OLD FROM XREF WHERE OLD NOT IN ( SELECT NEW FROM XREF ) ORDER BY OLD ; CREATE UNIQUE INDEX OLD ON XREF ( OLD ) ; QUIT ; DATA ACCTS ( KEEP = ACT1 - ACT&NA NA ) ; ARRAY ACT (&NA) ; SET START ; LINK GETXREF ; NA = 1 ; ACT ( NA ) = OLD ; NA = 2 ; ACT ( NA ) = NEW ; DO WHILE ( 1 ) ; OLD = NEW ; LINK GETXREF ; NA + 1 ; IF NA GT &NA THEN ABORT ; ACT ( NA ) = NEW ; END ; RETURN ; GETXREF: SET XREF KEY = OLD / UNIQUE ; IF _IORC_ THEN DO ; OUTPUT ; _ERROR_ = 0 ; DELETE ; END ; RETURN ; RUN ;
*** FORMATTED SEARCH (P.DORFMAN/K.SELF) ***; DATA OLD NEW; RETAIN TYPE 'N' MAX 9; DO UNTIL(END); SET XREF END=END; FMTNAME = 'OLD'; START = OLD; LABEL = NEW; OUTPUT OLD; FMTNAME = 'NEW'; START = NEW; LABEL = .; OUTPUT NEW; END; HLO = 'O'; FMTNAME = 'OLD'; LABEL = .; OUTPUT OLD; FMTNAME = 'NEW'; LABEL = 1; OUTPUT NEW; STOP; RUN; PROC FORMAT CNTLIN=OLD; RUN; PROC FORMAT CNTLIN=NEW; RUN; DATA ACCTS (KEEP=ACT1-ACT&NA PTR); ARRAY ACT (&NA); SET XREF; IF INPUT(PUT(OLD,NEW.),9.); PTR = 1; ACT(PTR) = OLD; DO UNTIL (OLD = .); PTR + 1; ACT(PTR) = NEW; OLD = INPUT(PUT(NEW,OLD.),9.); IF OLD GT . THEN NEW = OLD; END; RUN;
*** EXPLICIT BINARY SEARCH (P.DORFMAN) ***; PROC SORT DATA=XREF OUT=OLD; BY OLD; PROC SORT DATA=XREF (KEEP=NEW) OUT=NEW; BY NEW; DATA ACCTS (KEEP=ACT1-ACT&NA PTR); ARRAY ACT (&NA); ARRAY LKUP (0:2,&INOBS) _TEMPORARY_; IF _N_ = 1 THEN DO I=1 TO NOBS; SET NEW NOBS=NOBS; LKUP (0,I) = NEW; *KEYED BY NEW ; SET OLD; LKUP (1,I) = OLD; *KEYED BY OLD; LKUP (2,I) = NEW; *SATTELLITE NEW; END; SET XREF; KEY = OLD; ROW = 0; LINK BSEARCH; IF FND_X = .; ACT(1) = OLD; PTR = 1; DO UNTIL (FND_X = .); PTR + 1; ACT(PTR) = NEW; KEY = NEW; ROW = 1; LINK BSEARCH; IF FND_X GT . THEN NEW = LKUP(2,FND_X); END; RETURN; BSEARCH: LO = 1; HI = DIM(LKUP,2); FND_X = .; DO WHILE (LO LE HI); X = INT((LO + HI)*.5); IF KEY LT LKUP(ROW,X) THEN HI = X - 1; ELSE IF KEY GT LKUP(ROW,X) THEN LO = X + 1; ELSE DO; FND_X = X; LEAVE; END; END; RETURN; RUN;

Test results: NT/333MHz/128M OS/390 (SECONDS) (CPU SECONDS) LOOKUP 1E4 1E5 1E6 1E4 1E5 1E6 METHOD INDEXED PREP 2.77 41.02 721.54 1.72 58.47 269.82 SEARCH LOOKUP 2.14 24.68 276.97 1.09 43.62 1160.25 TOTAL 4.91 65.70 998.51 2.81 102.09 1430.07 FORMAT PREP 2.03 20.47 293.73 2.29 23.34 . SEARCH LOOKUP 0.67 5.23 200.03 0.46 5.13 . TOTAL 2.70 25.70 493.76 2.75 28.47 . BINARY PREP 0.41 2.87 31.62 0.31 1.69 18.90 SEARCH LOOKUP 0.96 8.38 97.47 0.77 8.97 264.14 TOTAL 1.37 11.25 129.09 1.08 10.66 283.04
From these numbers, we can draw some conclusions. It should be no surprise that index file search is slowest since despite its being based on B-tree search, it is still a disk search. Besides, comparing PREP times clearly demonstrates that it takes much longer to build an index than create a format, to say nothing of just sorting XREF with a couple of variables. Juxtaposing LOOKUP times makes it clear that realistically speaking, only format search and explicit binary search can compete for performance: the fastest method got there 3.58 to whopping 9.58 times faster than indexed search. With 10,000 and 100,000 items to load, formats were loaded entirely into memory on both machines, in which cases format search performed the LOOKUP stage 1.43 to 1.74 times faster than binary search. However, this advantage was more than offset by format loading time, so in total, binary search completed the task 1.97 to 2.67 times more rapidly. With 1,000,000 search items, the behavior of format method changed drastically. As formats were being loaded, it was kind of amusing to watch memory utilization steadily climb up to all available machine limit. Under NT, when the limit (128M) was reached, SAS switched to disk and kept loading. It took about 300 seconds to finish the load process, but the search time - 493.76 seconds - indicates that it was clearly a disk lookup. In comparison, binary search completed the LOOKUP stage in 129.09 seconds in memory whose consumption never rose above 40M. But what are those missing values in the rightmost column of the table? They simply tell that Big Iron Brother brutally shut the load process off when real memory utilization topped 75M, and contemptuously disregarded all my attempts to trick it, no matter how hard I tried. S0C4. Using hyperspaces (extended memory) to house a format library resulted in all active production modules dumped into the auxiliary storage (meaning that my test SAS program was the ONLY program running in the system. The system folks were quick to appreciate the joke). Apparently, when it comes to memory, (in)formats are extremely greedy. Karsten asks, as a 'bonus' question, what kind of structure it uses to build a search object. Looks like a balanced binary tree would be a good guess. Remember, for each node stored in a binary tree, there are left, right, and parent pointers to be stored as well, which means that, roughly estimating, it should consume 4 times more memory than an array containing the same lookup information would. In lookup applications, intrinsic advantages of binary trees as dynamic structures, i.e. an ability to insert and delete nodes, remain completely unused, whilst building a balanced tree requires a lot of resources (unbalanced tree may make binary tree search degenerate into serial search). If instead of binary trees, tries (M-ary trees) were used, memory demands would be even more outrageous. So, if memory is abundant (probably not less that 256M for a 1E6-item format), and compiled formats can be kept there during searches, and the formats are used frequently without the necessity of recompiling them to reflect changes, then using formats may save a lot of computer resources in the long run. If not, then explicitly coded binary search provides the best of both worlds. Its DATA step code is short, simple, and bulletproof; lookup table preparation boils down to a plain sort of a relatively small file; its lookup speed is second only to formats operating in memory; loading even a 2-million plus search table should not present any hardware problems at the contemporary level of the latter. Lastly, binary search can be macroized as a DATA step function, stored in an autocall library, and then used in a straightforward manner. If anyone wants the macro I have been using, either refer to the SESUG'98 Proceedings or drop me a note. Kind regards, ++++++++++++++++++++++++++++++++++++++ Paul M. Dorfman Jacksonville, FL ++++++++++++++++++++++++++++++++++++++
From: WHITLOI1 WHITLOI1@WESTAT.COM Subject: Re: Dynamic Table Lookup (was: Account transfer problem: Hel Subject: Dynamic Table Lookup (was: Account transfer problem: Help!) Summary: Thanks and a look at deeply nested formats. Respondent: Ian Whitlock whitloi1@westat.com Thanks to Paul Dorfman pdorfma@UCS.ATT.COM for a very nice summary of the look up problem. Between you and Karsten M. Self Karsten.Self@schwab.com I have been forced to update my thinking on formats with millions of entries. Paul missed (just restoring my bruised ego) a trick in considering the cost of dynamic format look ups - nesting. Maybe the data set XREF is not stable, but it is always increasing and probably in relatively small increments, so building on previous formats by nesting looks reasonable. I haven't tested whether nesting will slow down the construction of large formats, or whether it will slow down the lookup process in a practical situation. I have never considered more than nesting one level. proc format ; value base ... ; value nest ... other = [base8.] ; run ; But the above observation prompted me to look further. How far can one nest formats? It looks like it may only be limited by memory and time to do the lookup. Here is a macro to grind out a nested sequence of formats and test them. %macro nesttest ( n = 1 ) ; %local i ; proc format ; value f001fmt 1 = "abc 1" ; %do i = 2 %to &n ; value f%sysfunc(putn(&i,z,3))fmt &i = "abc &i" other = [f%sysfunc(putn(&i-1,z,3))fmt8.] ; %end ; value all %do i = 1 %to &n ; &i = "abc &i" %end ; ; run ; data _null_ ; do x = 1 to &n ; put x= f%sysfunc(putn(&n,z,3))fmt. ; end ; run ; data _null_ ; do x = 1 to &n ; put x= all. ; end ; run ; %mend nesttest ; For the macro illiterate, the macro generates formats (f001fmt, f002fmt, ...) where each format nests the previous one in its definition and then tests the use of the last format built. A second DATA step compares the difference in lookup times with a single format. I stopped my testing with %nesttest ( n = 128 ) because I lost interest at this point. I haven't looked into how far one can reasonably go before the extra lookup time exceeds the saved preparation time, but it is clearly far short of 127. Ian Whitlock
For nesting formats, this offers more hope than tech rep P-222 which says (on page 211) "The SAS System will support up to five levels of nested depth, but the overhead requirements increase dramatically with each additional level." Peter Crawford
/*** end of tip 00202 ***/