Transposing SAS flat file structure to normalized file structure



The following SAS programs allow the data manager at a CIMS partner agency to reformat the traditional, flat (horizontal) file structure to a normalized (vertical) file structure so that the data can be easily imported or loaded into a relational database. A typical data submission provided to the Chesapeake Bay Program Office (CBPO) is used to illustrate this process. The original format is a permanent SAS data set containing nearly 100 fields. The reason for the large number of fields is that there are associated descriptor (_D), method (_M), and/or analytical problem (_A) code fields for most parameters.

The resultant normalized EVENT file contains fields related to a particular sampling event that are not depth-dependent. Most of these parameters are measured when the monitoring vessel reaches a station, before physical parameters are measured and lab samples are collected. They include such things as station, date and time, total depth, pycnocline depths, air temperature, wind speed and direction, tidal stage, wave height, type of precipitation, and percent cloud cover. Because most of these parameters are described through only use of codes which provide units of measurement, separate units fields are not necessary. It is required that depths be recorded in meters and temperatures be reported in degrees Celsius.

In this file, the fields SOURCE, PROJECT, STATION, DATE, and TIME identify a unique record. In the WQ_EVENT table in the relational database, the DATE and TIME fields are called SAMPLE_DATE and SAMPLE_TIME because date and time are reserved words in most relational databases. The field names representing weather-related parameters have also been lengthened to make them more intuitive.

The resultant normalized DATA file contains fields related to a particular sampling event that are depth-dependent. Although secchi depth is not a depth-dependent parameter, it has been included in this file as directed by the Data Management and Acquisition Workgroup (DMAW) of the Chesapeake Bay Program. The TRANSPOSE procedure of SAS is used to create a normalized file structure in which there is a single record for each parameter value obtained for a particular sampling event at a particular depth in the water column layer from either a in-situ measurement (physical parameters) or an individual sample analyzed in the laboratory (nutrients and others). The descriptor method codes, as well as the analytical problem code, associated with each parameter value are also included in the record.

In this file, the fields SOURCE, PROJECT, STATION, DATE, TIME, SAMPLE_TYPE, SAMPLE_ID, DEPTH, LAYER, and PARAMETER identify a unique record. As in the WQ_EVENT table, the DATE and TIME fields in the WQ_DATA table are called SAMPLE_DATE and SAMPLE_TIME. The other filed names have also bee lengthened to make them more intuitive.

It is important to note that data submissions to the CBPO in SAS format vary from agancy to agency. Therefore, modifications to these programs may be necessary.

VIEW SAS CODE ONLY - Use Edit/Select All to cut and paste for your own use.

SAS Program for creating normalized EVENT file

LIBNAME IN 'SYS$MONITOR:[ODU.RAW]';

* EXTRACTS NON-DEPTH DEPENDENT FIELDS FOR THE SAMPLING EVENT TABLE ;
DATA EVENT KEEP=DATE CLOUD CRUISE PDEPTHL PDEPTHU PRECIP_K STATION TDEPTH TIDE TIME WAVHGT WINDIR WINDSPD);
SET IN.O266268;

* REFORMATS SUBMITTED TIME (NUMERIC DATA TYPE);
HR=INT(TIME/100);
MIN=TIME-HR*100;
TIME=HMS(HR,MIN,0);
FORMAT DATE MMDDYY10.;
FORMAT TIME TIME5.;

* RETRIEVES ONLY RECORDS THAT ARE NOT DEPTH-DEPENDENT (ZERO-DEPTH RECORD USED HISTORICALLY FOR EVENT DATA);
IF SDEPTH=0;

PROC SORT NODUP;
BY STATION;

DATA _NULL_;
LENGTH SOURCE $10.;
LENGTH AGENCY $10.;
LENGTH PROGRAM $10.;
LENGTH PROJECT $10.;
SET EVENT;

* ADDS DESCRIPTIVE FIELDS;
SOURCE='ODU';
AGENCY='VADEQ';
PROGRAM='WQMP';
PROJECT='MAIN';

* OUTPUTS FIXED-LENGTH ASCII FILE;
FILE O266268E;
PUT CRUISE @10 STATION @20 DATE MMDDYY10. @30 TIME TIME5. @35 TDEPTH 5.1 @40 PDEPTHL 5.1 @45 PDEPTHU 5.1 @50 ATEMP 5.1 @55 CLOUD @60 PRECIP_K @65 TIDE @70 WAVHGT @75 WINDIR @80 WINDSPD @85 SOURCE @95 AGENCY @105 PROGRAM @115 PROJECT;

SAS Program for creating normalized DATA file

LIBNAME IN 'SYS$MONITOR:[ODU.RAW]';

* EXTRACTS DEPTH DEPENDENT FIELDS FOR THE WATER QUALITY DATA TABLE;
* DATA VALUES WILL BE TRANSPOSED ACCORDING TO THE PRIMARY KEY FIELDS;
DATA WQ (KEEP=STATION DATE TIME SDEPTH LAYER REP_NUM CHLA COND DISOX2 DISOXY EXVOL_ML KD LIPAT_CM NH4 NO2 NO23 OD480B OD510B OD630B OD647B OD664B OD665A OD750A OD750B PH PHEA PHOSP PO4F POC PON SALIN SAMVOL_L SAVALUE SECCHI SI TDN TDP TSS UDVALUE UWVALUE WTEMP);
SET IN.O266268;

* SETS OPTICAL DENSITY READINGS TO ZERO IF NEGATIVE;
IF OD664B AND OD664B LT 0 THEN OD664B=0;
IF OD665A AND OD665A LT 0 THEN OD665A=0;
IF OD750A AND OD750A LT 0 THEN OD750A=0;
IF OD750B AND OD750B LT 0 THEN OD750B=0;

* SETS LENGTH OF LIGHT PATH FOR CHLA/PHEA CALCULATIONS;
IF OD664B AND OD665A AND OD750A AND OD750B THEN LIPAT_CM=1;

* CALCULATES ACTIVE CHLOROPHYLL-A;
CHLA=(26.7*((OD664B-OD750B)-(OD665A-OD750A))*EXVOL_ML)/(SAMVOL_L*LIPAT_CM);
IF CHLA AND CHLA LT 1 THEN CHLA=0;
CHLA=ROUND(CHLA,0.1);

* CALCULATES ACTIVE PHAEOPHYTIN-A;
PHEA=(26.7*((1.7*(OD665A-OD750A))-(OD664B-OD750B))*EXVOL_ML)/(SAMVOL_L*LIPAT_CM);
IF PHEA AND PHEA LT 0 THEN PHEA=0;
PHEA=ROUND(PHEA,0.1);

* SETS LAYER FOR SECCHI RECORDS TO SURFACE;
IF SDEPTH=0 THEN LAYER='S';

* SORT PROCEDURE ESTABLISHES THE PRIMARY KEY FIELDS IN THE RELATIONAL DATABASE DESIGN - VARIABLE ADDED TO KEY WHEN CODES ARE MERGED;
PROC SORT NODUP;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;

* TRANSPOSES THE DATA BY THE PRIMARY KEY FIELDS;
PROC TRANSPOSE OUT=WQ NAME=VAR PREFIX=VAL;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;

* EXTRACTS DESCRIPTOR CODE VARIABLES FOR TRANSPOSING AND THEN MERGING WITH THE TRANSPOSED DATA FILE;
DATA QUAL (KEEP=STATION DATE TIME SDEPTH LAYER REP_NUM DISOX2_D NH4_D NO2_D NO23_D PHOSP_D PO4F_D POC_D PON_D SI_D TDN_D TDP_D TSS_D);
SET IN.O266268;

PROC SORT NODUP;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;

* TRANSPOSES THE DATA BY THE PRIMARY KEY FIELDS;
PROC TRANSPOSE OUT=QUAL NAME=VAR PREFIX=QUAL;
VAR DISOX2_D NH4_D NO2_D NO23_D PHOSP_D PO4F_D POC_D PON_D SI_D TDN_D TDP_D TSS_D;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;


DATA QUAL(KEEP=STATION DATE TIME SDEPTH LAYER REP_NUM VAR QUAL1);
SET QUAL;

IF VAR='DISOX2_D' THEN VAR='DISOX2';
IF VAR='NH4_D' THEN VAR='NH4';
IF VAR='NO2_D' THEN VAR='NO2';
IF VAR='NO23_D' THEN VAR='NO23';
IF VAR='PHOSP_D' THEN VAR='PHOSP';
IF VAR='PO4F_D' THEN VAR='PO4F';
IF VAR='POC_D' THEN VAR='POC';
IF VAR='PON_D' THEN VAR='PON';
IF VAR='SI_D' THEN VAR='SI';
IF VAR='TDN_D' THEN VAR='TDN';
IF VAR='TDP_D' THEN VAR='TDP';
IF VAR='TSS_D' THEN VAR='TSS';
IF QUAL1='' THEN DELETE;

* EXTRACTS PROBLEM CODE VARIABLES FOR TRANSPOSING AND THEN MERGING WITH TRANSPOSED DATA FILE;
DATA PROBLEM (KEEP=STATION DATE TIME SDEPTH LAYER REP_NUM CHLA_A COND_A DISOX2_A DISOXY_A KD_A NH4_A NO2_A NO23_A PHEA_A PHOSP_A PH_A PO4F_A POC_A PON_A SALIN_A SAVAL_A SECCHI_A SI_A TDN_A TDP_A TSS_A UDVAL_A UWVAL_A WTEMP_A);
SET IN.O266268;

PROC SORT NODUP;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;

* TRANSPOSES THE DATA BY THE PRIMARY KEY FIELDS;
PROC TRANSPOSE OUT=PROBLEM NAME=VAR PREFIX=PROB;
VAR CHLA_A COND_A DISOX2_A DISOXY_A KD_A NH4_A NO2_A NO23_A PHEA_A PHOSP_A PH_A PO4F_A POC_A PON_A SALIN_A SAVAL_A SECCHI_A SI_A TDN_A TDP_A TSS_A UDVAL_A UWVAL_A WTEMP_A;
BY STATION DATE TIME SDEPTH LAYER REP_NUM;

DATA PROBLEM(KEEP=STATION DATE TIME SDEPTH LAYER REP_NUM VAR PROB1);
SET PROBLEM;

IF VAR='CHLA_A' THEN VAR='CHLA';
IF VAR='COND_A' THEN VAR='COND';
IF VAR='DISOX2_A' THEN VAR='DISOX2';
IF VAR='DISOXY_A' THEN VAR='DISOXY';
IF VAR='KD_A' THEN VAR='KD';
IF VAR='NH4_A' THEN VAR='NH4';
IF VAR='NO2_A' THEN VAR='NO2';
IF VAR='NO23_A' THEN VAR='NO23';
IF VAR='PHEA_A' THEN VAR='PHEA';
IF VAR='PHOSP_A' THEN VAR='PHOSP';
IF VAR='PH_A' THEN VAR='PH';
IF VAR='PO4F_A' THEN VAR='PO4F';
IF VAR='POC_A' THEN VAR='POC';
IF VAR='PON_A' THEN VAR='PON';
IF VAR='SALIN_A' THEN VAR='SALIN';
IF VAR='SAVAL_A' THEN VAR='SAVALUE';
IF VAR='SECCHI_A' THEN VAR='SECCHI';
IF VAR='SI_A' THEN VAR='SI';
IF VAR='TDN_A' THEN VAR='TDN';
IF VAR='TDP_A' THEN VAR='TDP';
IF VAR='TSS_A' THEN VAR='TSS';
IF VAR='UDVAL_A' THEN VAR='UDVALUE';
IF VAR='UWVAL_A' THEN VAR='UWVALUE';
IF VAR='WTEMP_A' THEN VAR='WTEMP';
IF PROB1='' THEN DELETE;

* SORTS THE DATA SETS BY THE FOLLOWING KEY FIELDS: STATION DATE TIME SDEPTH LAYER REP_NUM VAR AND MERGES THEM TOGETHER BY THE COMBINED KEY;
PROC SORT DATA=WQ; BY STATION DATE TIME SDEPTH LAYER REP_NUM VAR;
PROC SORT DATA=PROBLEM; BY STATION DATE TIME SDEPTH LAYER REP_NUM VAR;
* PROC SORT DATA=METHOD; BY STATION DATE TIME SDEPTH LAYER REP_NUM VAR;
PROC SORT DATA=QUAL; BY STATION DATE TIME SDEPTH LAYER REP_NUM VAR;

DATA WQ_ODU(DROP=_LABEL_ HR MIN);
MERGE WQ QUAL PROBLEM;
BY STATION DATE TIME SDEPTH LAYER REP_NUM VAR;

* REFORMATS SUBMITTED TIME (NUMERIC DATA TYPE);
HR=INT(TIME/100);
MIN=TIME-HR*100;
TIME=HMS(HR,MIN,0);

* DELETES UNNECESSARY OBSERVATIONS CONTAINING NULL VALUES IF THERE IS NO ASSOCIATED PROBLEM CODE;
IF VAL1=. AND PROB1='' THEN DELETE;

PROC DATASETS NOLIST; DELETE WQ QUAL PROBLEM;

DATA _NULL_;
LENGTH SOURCE $10.;
LENGTH PROJECT $10.;
LENGTH UNITS $15.;
LENGTH METH $4.;
SET WQ_ODU;

* ADDS CURRENT METHOD CODES;
IF VAR='CHLA' THEN METH='101B';
ELSE IF VAR='COND' THEN METH='F01O';
ELSE IF VAR='DISOX2' THEN METH='U';
ELSE IF VAR='DISOXY' THEN METH='F01N';
ELSE IF VAR='KD' THEN METH='C01';
ELSE IF VAR='NH4' THEN METH='102A';
ELSE IF VAR='NO2' THEN METH='101';
ELSE IF VAR='NO23' THEN METH='102A';
ELSE IF VAR='OD480B' THEN METH='101B';
ELSE IF VAR='OD510B' THEN METH='101B';
ELSE IF VAR='OD630B' THEN METH='101B';
ELSE IF VAR='OD647B' THEN METH='101B';
ELSE IF VAR='OD664B' THEN METH='101B';
ELSE IF VAR='OD665A' THEN METH='101B';
ELSE IF VAR='OD750A' THEN METH='101B';
ELSE IF VAR='OD750B' THEN METH='101B';
ELSE IF VAR='PH' THEN METH='F01I';
ELSE IF VAR='PHEA' THEN METH='101A';
ELSE IF VAR='PHOSP' THEN METH='102';
ELSE IF VAR='PO4F' THEN METH='102';
ELSE IF VAR='POC' THEN METH='103A';
ELSE IF VAR='PON' THEN METH='103A';
ELSE IF VAR='SALIN' THEN METH='F01F';
ELSE IF VAR='SAVALUE' THEN METH='U';
ELSE IF VAR='SECCHI' THEN METH='F01C';
ELSE IF VAR='SI' THEN METH='101B';
ELSE IF VAR='TDN' THEN METH='101';
ELSE IF VAR='TDP' THEN METH='102A';
ELSE IF VAR='TSS' THEN METH='101B';
ELSE IF VAR='UDVALUE' THEN METH='U';
ELSE IF VAR='UWVALUE' THEN METH='U';
ELSE IF VAR='WTEMP' THEN METH='F01Q';

* ADDS SOURCE AND PROJECT CODES;
SOURCE='ODU';
PROJECT='MAIN';

* ADDS UNITS FOR INDIVIDUAL PARAMETERS;
IF VAR IN ('OD480B','OD510B','OD630B','OD647B','OD664B','OD665A','OD750A','OD750B') THEN UNITS='OD';
ELSE IF VAR='EXVOL_ML' THEN UNITS='ML';
ELSE IF VAR='LIPAT_CM' THEN UNITS='CM';
ELSE IF VAR='SAMVOL_L' THEN UNITS='L';
ELSE IF VAR IN ('CHLA','PHEA') THEN UNITS='UG/L';
ELSE IF VAR IN ('SAVALUE','UDVALUE','UWVALUE') THEN UNITS='UEINSTEINS/S/M';
ELSE IF VAR='KD' THEN UNITS='COEFFICIENT';
ELSE IF VAR IN ('WTEMP') THEN UNITS='DEG C';
ELSE IF VAR='SALIN' THEN UNITS='PPT';
ELSE IF VAR='PH' THEN UNITS='SU';
ELSE IF VAR='SECCHI' THEN UNITS='M';
ELSE IF VAR='COND' THEN UNITS='UMHOS/CM';

* ADDS UNITS FOR ALL OTHER PARAMETERS (NUTRIENTS, CONVENTIONAL POLLUTANTS, ETC.);
IF UNITS='' THEN UNITS='MG/L';

* ADDS SAMPLE TYPE (ISM=IN-SITU MEASUREMENT, D=DISCRETE SAMPLE) AND SAMPLE ID (NA=NOT APPLICABLE, S1=SAMPLE #1, S2=SAMPLE #2);
IF VAR IN ('COND','DISOXY','PH','SALIN','SECCHI','SALIN','WTEMP') THEN SAMPTYPE='ISM';
ELSE SAMPTYPE='D';
IF VAR IN ('COND','DISOXY','PH','SALIN','SECCHI','SALIN','WTEMP') THEN SAMPID='NA';
ELSE SAMPID='S1';

* OUTPUTS FIXED-LENGTH ASCII FILE WITH PARAMETER-SPECIFIC LEVELS OF PRECISION;
FILE O266268D;
PUT STATION @10 DATE MMDDYY10. @20 TIME TIME5. @25 SDEPTH 5.1 @30 LAYER @35 SAMPTYPE @40 SAMPID @45 VAR @;
IF VAR IN ('COND','LIPAT_CM')
THEN PUT @55 VAL1 8.0-L @;
ELSE IF VAR IN ('DISOX2','DISOXY','EXVOL_ML','PH','SAMVOL_L','SECCHI','TSS','WTEMP')
THEN PUT @55 VAL1 8.1-L @;
ELSE IF VAR IN ('SALIN','CHLA','PHEA')
THEN PUT @55 VAL1 8.2-L @;
ELSE IF VAR IN ('OD480B','OD510B','OD630B','OD647B','OD664B','OD665A','OD750A', 'OD750B','POC','SAVALUE','SI','UDVALUE','UWVALUE')
THEN PUT @55 VAL1 8.3-L @;
ELSE IF VAR IN ('KD','NH4','NO2','NO23','PHOSP','PO4F','PON','TDN','TDP')
THEN PUT @55 VAL1 8.4-L @;
PUT @65 QUAL1 @70 PROB1 @75 SOURCE @85 PROJECT @95 UNITS @110 METH;