/*** TIP000378 ***/

Quick data-completion check routine?

I have received a SAS dataset with a couple hundred columns and would like
to get an idea of the completion rate (how many rows have actual data, not
null) for each of the columns.  Has anyone out there had a similar need for
which they have already devised a solution?

Solution #1 Author: Paul Dorfman Email: paul_dorfman@HOTMAIL.COM It has been done more than once in the past (you can search the archives). Basically, there are two approaches. The simplest is to use a format as an exception lookup table, and then ask FREQ or TABULATE to kick out the frequencies, for example: data a ; input n1-n9 (c1-c9) (: $1.) ; cards ; 5 . . . . 0 0 8 . . Z . F . R . . . 5 3 . . . . 7 . 5 E E . . . . . W . . . . 4 . . 1 6 . B . N . . . . I A . . 2 6 . 4 . 0 . O . . . B . . L X 1 . . 1 . . . 2 . . W . . G . H . K 6 5 8 . . . . 5 7 . . J . N . S . J . . . 6 . . . 0 6 . . . K R . . M V . . . . 4 6 . 9 3 F . V . C . . . . 5 . . 4 4 2 . 2 . . N P . . . . . B 4 7 . . . 3 . . . R . . R A . T P . . 1 5 . 7 2 . 9 5 Z . A . Q . J F W ; run ; proc format ; value nn . = 'Null rows' Other = 'Not null rows' ; value $cc ' ' = 'Null rows' Other = 'Not null rows' ; run ; proc freq data = a ; format _numeric_ nn. _character_ $cc. ; tables _numeric_ _character_ / missing nocum nopercent ; run ; With large number of variables (you say 200?) the FREQ output, though clear, might be slightly more voluminous than desired. If the file contains only numeric variables, it can be cured by resorting to the amazingly concise MEANS instead, proc means data = a N NMiss ; format _numeric_ nn. ; run ; which for the test input above prints: N Variable N Miss --------------------------- n1 6 5 n2 4 7 n3 3 8 n4 5 6 n5 3 8 n6 6 5 n7 3 8 n8 9 2 n9 5 6 -------------------------- Otherwise if both numeric and character variables are present, there is always a Data step, for instance: data _null_ ; if 0 then set a ; array nn _numeric_ ; call symput ('n' , put (dim(nn), best.-l)) ; array cc _character_ ; call symput ('c' , put (dim(cc), best.-l)) ; run ; data comprate ( keep = VarName Type N NMiss ) ; set a nobs = nobs end = end ; array nn 8 _numeric_ ; array cc _character_ ; array nf ( &n ) 8 _temporary_ ; array cf ( &c ) 8 _temporary_ ; do over nn ; nf (_i_) ++ not missing(nn) ; end ; do over cc ; cf (_i_) ++ not missing(cc) ; end ; if end then do ; Type = 'N' ; do over nn ; VarName = vname (nn) ; N = nf (_i_) ; NMiss = nobs - N ; output ; end ; Type = 'C' ; do over cc ; VarName = vname (cc) ; N = cf (_i_) ; NMiss = nobs - N ; output ; end ; end ; run ; proc print ; run ; After having consumed the test input, this prints: Var Obs Type Name N NMiss 1 N n1 6 5 2 N n2 4 7 3 N n3 3 8 4 N n4 5 6 5 N n5 3 8 6 N n6 6 5 7 N n7 3 8 8 N n8 9 2 9 N n9 5 6 10 C c1 6 5 11 C c2 4 7 12 C c3 5 6 13 C c4 3 8 14 C c5 7 4 15 C c6 1 10 16 C c7 4 7 17 C c8 6 5 18 C c9 7 4
Solution #2 Author: Stig Eide Email: stigeide@YAHOO.COM What I do, is running a PROC FREQ on the variables of interest, using a format. But, if the dataset has "A couple hundred columns", it would be nice to have a generic method. Maybe this can help: %let libname=SASHELP; %let memname=CLASS; filename m 'c:\temp\missing.sas'; data _null_; attrib str length=$999; set sashelp.vcolumn; where libname="&libname" and memname="&memname"; file m; str=compress('missing_'!!name!!'=missing('!!name!!');'); put str; run; data missing(keep=missing_:); set &libname..&memname; %include m; run; proc freq data=missing; table missing_: / nopercent nocum; run;
/*** end of tip 00378 ***/