Monday, November 3, 2008

SAS Clinical Interview QUESTIONS and ANSWERS

What is the therapeutic area you worked earlier?
There are so many diff. therapeutic areas a pharmaceutical company can work on and few of them include, anti-viral (HIV), Alzheimer’s, Respiratory, Oncology, Metabolic Disorders (Anti-Diabetic), Neurological, Cardiovascular. Few more of them, include…

Central nervous system
Orthopedics and pain control
Gene therapy
Immunology etc

What are your responsibilities?
Some of them include; not necessarily all of them….

· Extracting the data from various internal and external database (Oracle, DB2, Excel spreadsheets) using SAS/ACCESS, SAS/INPUT.
· Developing programs in SAS Base for converting the Oracle Data for a phase II study into SAS datasets using SQL Pass through facility and Libname facility.
· Creating and deriving the datasets, listings and summary tables for Phase-I and Phase-II of clinical trials.
· Developing the SAS programs for listings & tables for data review & presentation including ad-hoc reports, CRTs as per CDISC, patients listing mapping of safety database and safety tables.
· Involved in mapping, pooling and analysis of clinical study data for safety.
· Using the Base SAS (MEANS, FREQ, SUMMARY, TABULATE, REPORT etc) and SAS/STAT procedures (REG, GLM, ANOVA, and UNIVARIATE etc.) for summarization, Cross-Tabulations and statistical analysis purposes.
· Developing the Macros at various instances for automating listings and graphing of clinical data for analysis.
· Validating and QC of the efficacy and safety tables.
· Creating the Ad hoc reports using the SAS procedures and used ODS statements and PROC TEMPLATE to generate different output formats like HTML, PDF and excel to view them in the web browser.
· Performing data extraction from various repositories and pre-process data when applicable.
· Creating the Statistical reports using Proc Report, Data _null_ and SAS Macro.
· Analyzing the data according to the Statistical Analysis Plan (SAP).
· Generating the demographic tables, adverse events and serious adverse events reports.

Can you tell me something about your last project study design?
If the interviewer asked you this question, then you need to tell that your current project is on a phase-1 study (or phase-2/Phase-3). You also need to tell about the name of the drug and the therapeutic area.  Here are some more details you need to lay down in front of him…
a) Is it a single blinded or double-blinded study?
b) Is it a randomized or non-randomized study?
c) How many patients are enrolled.
d) Safety parameters only (if it is a phase-1)
e) Safety and efficacy parameters if the study is either Phase-2,3or 4.
To get the all these details... visit .

How many subjects were there?
Subjects are nothing but the patients involved in the clinical study.
Answer to this question depends on the type of the study you have involved in.
If the study is phase1 answer should be approx. between 30-100.
If the study is phase2 answer should be approx. between 100-1000.
If the study is phase3 answer should be approx. between 1000-5000.
Note: These are just typical and not exact numbers.

How many analyzed data sets did you create?

Again it depends on the study and the safety and efficacy parameters that are need to determined from the study. Approx. 20-30 datasets is required for a study to get analyzed for the safety and efficacy parameters. Here is some ex. of the datasets.

DM (Demographics),
MH (Medical History),
AE (Adverse Events),
PE (Physical Education),
VS (Vital Signs),
CM (Concomitant Medication),
LB (Laboratory),
QS (Questionnaire),
IE (Inclusion and Exclusion),
DS (Disposition),
DT (Death),
SV (Subject Visits),
SC (Subject Characteristics),
CO (Comments),
EX (Exposure),
PC (Pharmacokinetic Concentrations),
PP (Pharmacokinetic Parameters),
TI (Therapeutic Intervention),
and other Supplementary datasets like

How did you create analyzed data sets?
Analysis datasets are nothing but the datasets that are used for the statistical analysis of the data. Analysis datasets contains the raw data and the variables derived from the raw data. Variables, which are derived for the raw data, are used to produce the TLG’s of the clinical study. The safety as well as efficacy endpoints (parameters) dictate the type of the datasets are required by the clinical study for generating the statistical reports of the TLG’s. Sometimes the analysis datasets will have the variables not necessarily required to generate the statistical reports but sometimes they may required to generate the ad-hoc reports.
Refer also to get the complete info about creation of datasets:

How many tables, listings and graphs?
Can be in between 30-100 (including TLG’s)

What do you mean by treatment emergent and treatment emergent serious adverse events?
Treatment emergent adverse events and Treatment emergent serious adverse events are nothing but the adverse events and serious adverse events which were happened after the drug administration or getting worsen by the drug, if patients are already having those adverse events before drug administration.

Can you explain little bit about the datasets?

DEMOGRAPHIC analysis dataset contains all subjects’ demographic data (i.e., Age, Race, and Gender), disposition data (i.e., Date patient withdrew from the study), treatment groups and key dates such as date of first dose, date of last collected Case Report Form (CRF) and duration on treatment. The dataset has the format of one observation per subject.

LABORATORY analysis dataset contains all subjects’ laboratory data, in the format of one observation per subject per test code per visit per accession number. Here, we derive the study visits according to the study window defined in the SAP, as well as re-grade the laboratory toxicity per protocol. For a crossover study, both the visit related to the initial period and as it is related to the beginning of the new study period will be derived. If the laboratory data are collected from multiple local lab centers, this analysis dataset will also centralize the laboratory data and standardize measurement units by using conversion factors.

EFFICACY analysis dataset contains derived primary and secondary endpoint variables as defined in the SAP. In addition, this dataset can contain other efficacy parameters of interest, such as censor variables pertaining to the time to an efficacy event. This dataset has the format of one record per subject per analysis period.

SAFETY can be categorized into four analysis datasets:

VITAL SIGN analysis dataset captures all subjects’ vital signs collected during the trial. This dataset has the format of one observation per subject per vital sign per visit, similar to the structure for the laboratory analysis dataset.

ADVERSE EVENT analysis dataset contains all adverse events (AEs) reported including serious adverse events (SAEs) for all subjects. A treatment emergent flag, as well as a flag to indicate if an event is reported within 30 days after the subject permanently discontinued from the study, will be calculated. This dataset has a format of one record per subject per adverse event per start date. Partial dates and missing AEs start and/or stop dates will be imputed using logic defined in the SAP.

MEDICATION analysis dataset contains the subjects’ medication records including concomitant medications and other medications taken either prior to the beginning of study or during the study. This dataset has a format of one record per subject per medication taken per start date. Incomplete and missing medication start or stop dates will be imputed using instructions defined in the SAP.

SAFETY analysis dataset contains other safety variables, whether they are defined in the SAP or not. The Safety analysis dataset, similar to Efficacy analysis dataset in structure, consists of data with one record per subject per analysis period to capture safety parameters for all subjects.

It is crucial to generate analysis datasets in a specific order, as some variables derived from one particular analysis dataset may be used as the inputs to generate other variables in other analysis datasets. For example, the time to event variables in the efficacy and safety analysis datasets are calculated based on the date of the first dose derived in the demographic analysis dataset.

Analysis datasets are generated in sequence

(Safety Datasets)
Demographic _______Laboratory __________Efficacy
Vital Sign
Adverse Event
Concomitant Medications

What is your involvement while using CDISC standards? What is mean by CDISC where do you use it?
CDISC is nothing but an organization (Clinical Data Interchange Standards Consortium), which implements industrial standards for the pharmaceutical industries to submit the clinical data to FDA.
There are so many advantages of using CDISC standards: Reduced time for regulatory submissions, more efficient regulatory reviews of submission, savings in time and money on data transfers among business.

CDISC standards is used in following activities:

Developing CRTs for submitting them to FDA to get an NDA.
Mapping, pooling and analysis of clinical study data for safety.
Creating the annotated case report form (eCRF) using CDISC-SDTM mapping.
Creating the Analysis Datasets in CDISC and non-CDISC Standards for further SAS Programming.

What do you mean when you say you created tables, listings and graphs for ISS and ISE?

How do you do data cleaning?
It is always important to check the data we are using- especially for the variables what we are using. Data cleaning is critical for the data we are using and preparing.
I use Proc Freq, Proc SQL, MEANS, UNIVARIATE etc to clean the data.

I will use Proc Print with WHERE statement to get the invalid date values.

Can you tell me CRT's??

Creating Case Report Tabulations (CRTs) for an NDA Electronic Submission to the FDA

ABSTRACT:The Food and Drug Administration (FDA) now strongly encourages all new drug applications (NDAs) be submitted electronically. Electronic submissions could help FDA application reviewers scan documents efficiently and check analyses by manipulating the very datasets and code used to generate them.The potential saving in reviewer time and cost is enormous while improving the quality of oversight. In January 1999, the FDA released the Guidance for Industry: Providing Regulatory Submissions in Electronic Format – NDAs. As described, one important part of the application package is the case report tabulations (CRTs), now serving as the instrument for submitting datasets. CRTs are made up of two parts: first, datasets inSAS® transport file format and second, the accompanying documentation for the datasets. Herein, we briefly review the content and conversion of datasets to SAS transport file format, and then elaborate on the code that makes easy work of theaccompanying dataset documentation (in the form of data definition tables) using the SAS Output Delivery System (ODS). The intended audience is SAS programmers with an intermediate knowledge of the BASE product used under any operating system and who are involved in the biotechnology industries.

Where do you use MEdDra and WHO? Can you write a code? How do you use it?
What is MedDRA?
The Medical Dictionary for Regulatory Activities (MedDRA) has been developed as a pragmatic, clinically validated medical terminology with an emphasis on ease-of-use data entry, retrieval, analysis, and display, with a suitable balance between sensitivity and specificity, within the regulatory environment. MedDRA is applicable to all phases of drug development and the health effects of devices. By providing one source of medical terminology, MedDRA improves the effectiveness and transparency of medical product regulation worldwide.
MedDRA is used to report adverse event data from clinical trials, as well as post-marketing and pharmacovigilance.

What are the structural elements of the terminology in MedDRA?
The structural elements of the MedDRA terminology are as follows:

SOC (System Orgon Class) - Highest level of the terminology, and distinguished by anatomical or physiological system, etiology, or purpose
HLGT( High Level Group Term) – Subordinate to SOC, superordinate descriptor for one or more HLTs
HLT (High Level Term) – Subordinate to HLGT, superordinate descriptor for one or more PTs
PT (Preferred Term) – Represents a single medical concept
LLT (Lower Level Term) – Lowest level of the terminology, related to a single PT as a synonym, lexical variant, or quasi-synonym (Note: All PTs have an identical LLT).

In what format is MedDRA distributed?
MedDRA is distributed in sets of flat ASCII delimited files. There is a different set of files for each available language. The Czech translation is distributed in UTF-8 format. For detail information as to file names, data record scheme, and record layout, sees the MedDRA ASCII and Consecutive Files Documentation document, which can be downloaded from the MedDRA MSSO Web site. MedDRA is delivered in text file format. As of MedDRA Version 11.1, the total size of all ASCII files for the English version is 12,459KB.

The WHODrug dictionary was started in 1968. The dictionary contains information on both single and multiple ingredient medications. Drugs are classified according to the type of drug name being entered, (i.e. proprietary/trade name, nonproprietary name, chemical name, etc.). At present, 52 countries submit medication data to the WHO Collaborating Center, which is responsible for the maintenance and distribution of the drug dictionary. Updates to the dictionary are offered four times per year.

What do you mean by used Macro facility to produce weekly and monthly reports?
The SAS macro facility can do lot of things and especially it is used to…
• reduce code repetition
• increase control over program execution
• minimize manual intervention
• create modular code.
to get more info about macro facility.

How did you validate table’s, listings and what are the other things you validated?
First, the output from the listing needs to be read into a SAS data set. Next, the validation results need to be calculated (you need to do this anyway) and then turned into a SAS data set with the same layout and properties as the one created from the original output. Last, SAS compares the original versus validation data sets by using PROC COMPARE. The results are concise, quick, accurate and 100% complete. We have to use the same procedure to validate the Tables.

We will also validate graphs made in SAS… but to do that we need to use SAS/GRAPH Network Visualization Workshop and using it we can validate graphs made with SAS automatically as well as manually.

Did you see anywhere that. Patient is randomized to one drug and the patient is given another drug? if you get in which population would you put that patient into?
I will consider that patient in the group of the drug that he was given. Before I do anything, I will make sure it is a data entry error or patient is actually given the other drug.
What would you do if you had to pool the data related to one parallel study and one cross over study?

Say If you have a same subject in two groups taking two different drugs.. and If you had to pool these two groups how would you do it?

This situation arises when the study is a cross over design study. I would consider the same patient as two different patients of each treatment group.

What are the phases you are good at?
Phase-I,II and III.

How would you transpose dataset using data step?

Using Proc Transpose Procedure.

Proc transpose data=old out=new prefix=DATE;
var date;
by name;

The prefix= option controls the names for the transposed variables (DATE1, DATE2, etc.) Without it, the names of the new variables would be COL1, COL2, etc.

Actually, proc transpose creates an extra variable, _NAME_, indicating the name of the transposed variable. _NAME_ has a value of DATE on both observations. To eliminate the extra variable, modify a portion of the proc statement:

out=new (drop=_name_);

The equivalent data step code using arrays could be:

data new (keep=name date1-date3);
set old;
by name;
array dates {3} date1-date3;
retain date1-date3;
if then i=1;
else i + 1;
dates{i} = date;

This program assumes that each name has exactly three observations. If a name had more, the program would generate an error message when hitting the fourth observation for that name. When i=4, this statement encounters an array subscript out of range:

dates{i} = date;

If some patient misses one lab how would you assign values for that missing values?? Can you write the code?
Same answer as the below question….

How do you deal with missing values?
Whenever SAS encounters an invalid or blank value in the file being read, the value is defined as missing. In all subsequent processes and output, the value is represented as a period (if the variable is numeric-valued) or is left blank (if the variable is character-valued).

In DATA step programming, use a period to refer to missing numeric values.
For example, to recode missing values in the variable A to the value 99, use the following statement:

IF a=. Then a=99;

Use the MISSING statement to define certain characters to represent special missing values for all numeric variables. The special missing values can be any of the 26 letters of the alphabet, or an underscore. In the example below, the values 'a' and 'b' will be interpreted as special missing values for every numeric variable.


Did you ever create efficacy tables?
Yes, I have created Efficacy tables. Efficacy tables are developed to get an the information about primary objectives/parameters of the study.

What is the primary and secondary end point in your last project?

Primary and secondary endpoints of the clinical trial conducted is given under the SAP. You can download the sample protocol as well as trial SAP from my blog ( ) or else go to , and then type the name of pharmaceutical company, it will give you the list of clinical trials conducted by that company, if you just click on any one study, you will be able to see the primary and secondary objectives and all other details.

What are the stat procedures you used?

Tell me something about proc mixed? (Sometimes they may ask you to write the syntax)Syntax:
PROC MIXED is a generalization of the GLM procedure in the sense that PROC GLM fits standard linear models, and PROC MIXED fits the wider class of mixed linear models. Both procedures have similar CLASS, MODEL, CONTRAST, ESTIMATE, and LSMEANS statements, but their RANDOM and REPEATED statements differ (see the following paragraphs). Both procedures use the nonfull-rank model parameterization, although the sorting of classification levels can differ between the two. PROC MIXED computes only Type I -Type III tests of fixed effects, while PROC GLM offers Types I - IV. The RANDOM statement in PROC MIXED incorporates random effects constituting the vector in the mixed model. However, in PROC GLM, effects specified in the RANDOM statement are still treated as fixed as far as the model fit is concerned, and they serve only to produce corresponding expected mean squares.

What would you do, if you have to use data step functions in macro definition? Can you use all the functions in data step in macro definition?

If I have a dataset with different subjid's and each subjid has many records? How can I obtain last but one record for each patient?

Proc sort data=old;
By subjid;

Data new;
Set old;
By subjid;
If first.subjid;


proc sort data=old out=new nodupkey;
by subjid;

Can you get some value of a data step variable to be used in any other program you do later in the same SAS session? How do you do that?
Use a macro… with a %PUT statement.

What would you do if you have to access previous records values in current record?
Using ampersand sign…. &var.

What is a p value? Why should u calculate that? What are the procedures you can use for that?
If the p-value were greater than 0.05, you would say that the group of independent variables does not show a statistically significant relationship with the dependent variable, or that the group of independent variables does not reliably predict the dependent variable. Note that this is an overall significance test assessing whether the group of independent variables, when used together reliably predicts the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. Using the PROC FREQ, PROC ANOVA, PROC GLM  and  PROC TTEST we cal calculate the p-value.

What do you usually do with proc life test?
PROC LIFETEST is used to obtain Kaplan-Meier and life table survival estimates (and plots). Using a strata statement in Proc Lifetest, which compare survival estimates for different groups.

Can you get survival estimates with any other procedures?
PROC LIFEREG and PROC PHREG can also be used to get the survival estimates along with PROC LIFETEST.

Can you write a code to get the survival estimates?

proc lifetest data=data method=km outsurv=newdata;
time survival*status(0);
strata study;

What is the difference between stratum and by statement in Proc Lifetest?
You can specify a BY statement with PROC LIFETEST to obtain separate analyses on observations in groups defined by the BY variables.

The BY statement is more efficient than the STRATA statement for defining strata in large data sets. However, if you use the BY statement to define strata, PROC LIFETEST does not pool over strata for testing the association of survival time with covariates nor does it test for homogeneity across the BY groups.

The STRATA statement indicates which variables determine strata levels for the computations.
The strata are formed according to the non-missing values of the designated strata variables. The MISSING option can be used to allow missing values as a valid stratum level.

Which procedure do you usually use to create reports?
Proc Report, proc Tabulate and Data _null_.

What do you do, if you had to get the column names and some title in every page of your report when you create it using data_null_?

Give your data _null_ titles the "proc print" and "proc report" feel

The more you can make your "data _null_" behave like "proc print" or "proc report", when it comes to titles, the better. If the "byline" option is set then put out a dashed "byline". If not, then don't. Does your "by" variable have a label? If so, then your dashed byline should have the text of your variable label in it on the left of the equals sign. If the variable has no label then it should just be the variable name. If that's the way "proc report" or "proc print" does it then do it that way with your "data _null". Get it to interface with #byval and #byvar entries if they exist. Give people the feel that "data _null_" reporting is no different to using "proc print" or "proc report" and you will have less opposition to your "data _null_" reports. How you do this is already in those two pages. You are going to find yourself in a situation whereby you really must do the report using data _null_ but other people are not comfortable with it because they feel it is "too different" than using "proc report". The more you can give it the same feel, the more easily you can dip into "data _null_" when you have to without people worrying.

How do you use the macro which is created by some other people and which is in some other folder other than SAS?
With SAS Autocall library using the SAS Autos system.

Can you tell me something regarding the macro libraries?
Macro libraries are the libraries, which stores all the macros required for developing TLG’s of the clinical trial. These are very are necessary in controlling and managing the macros. With the help of a %INCLUDE statement; the stored macros in the macro library can be automatically called.

Can you show me how the efficacy table looks like?

Can you show me how the safety table looks like?

Did you use ODS?
Yes, I have used the ODS(Output Delivery System), which normally used to make the output from the Tables, Listings and graphs looks pretty. ODS creates the outputs in html, pdf and rtf formats.
General syntax:
Start the output with:

Ods output---format ;
SAS statements……………..
Ods output-format close;

Your resume says you created HTML, RTF, PDF? Why you had to create three?? Can you tell me in specific why each form is used?
There are several ways of format to create the SAS output.

To publish or to place the output on the Internet we need to create the output in HTML format, by converting the output into HTML files. We generally create the SAS output in RTF, because the RTF can be opened in Word or other word processors. If we need to send the printable reports through email, we need to create the output in PDF. PDF output is also needed when we send documents required to file an NDA to FDA.

What are the graphs you created?
Survival estimate graphs.

What are the procedures you used to create them?


Can you generate statistics using Proc SQl?
Yes, we can generate the statistics like N, Mean, Median, Max, Min, STD & SUM using PROC SQL. But SQL procedure cannot calculate all the above statistics by default, as it is the case with PROC MEANS.

When do you prefer Proc SQl? Give me some situation?
The SQL procedure supports almost all the functions available in the DATA step for the creation of data as well as the manipulation of the data.
When we compare the same result, obtained from SQL and with the Data step, PROC SQL requires less code and, more importantly it requires less time to execute the code.

How do you delete a macro variable?
If the macro variable is stored in the library then it is a easy to delete it. Multiple variables may be deleted by placing the variable names in the DELETE statement:

Why do you have to use proc import and proc export wizards? Give me the situation?

Safety Datasets Examples:
Following 16 datasets are the examples for safety datasets.........

  • Adverse Events,

  • (Prior and) Concomitant Medications,

  • Comments,

  • Demographics,

  • Disposition/End of Study,

  • Drug Accountability,

  • ECG,

  • Exposure,

  • Inclusion and Exclusion Criteria,

  • Lab,

  • Medical History,

  • Physical Examination,

  • Protocol Violations,

  • Subject Characteristics,

  • Substance Use, and

  • Vital Signs.


Anonymous said...

Very helpful for people trying to land a job in clinical trial field. Thank you so much.


RAM said...

Can Anybody say entire procedure to follow any study from the scratch as a programmer.

I mean PLease take any recent clinical study you gone thru and explain how to proceed as soon as u recieve DATA and PROTOCOL.
I just want to know how all diferents reports can be derived from rawdata.
AND also please explain layouts/formats of safety and efficacy reports..

PLease help me..

sarath said...

Ram..... read these white papers..

SAS programmer work environment and guiding principles

The Changing Nature of SAS Programming in the Pharmaceuticals Industry

Clinical Trials Terminology for SAS Programmers

Anonymous said...

Hi this is Suresh,
plz tel me how to proceed after getting SAP for sas programmers.....Thank u

Anonymous said...

Hi this is nanditha...
I have a question about SAS. Is there any practica site in SAS clinical trails.
for interviw base point which topics I have to put concentrate. I sudied base SAS and I started clinical trail material.
Thank you

Anonymous said...

Hi can any one tell what you need to know as a Pk/PD programmer.I heard phase I of clinical trial needs this programmer.what do i need to know.plz help.

pooja said...

nice work ....i have one suggestion and request.i tried SAS Sample Projects...They are good now i want to test my code and result for all projects if it is right or wrong.... how can i do this?

sarath said...


You can post your code here or else send it to me at I will take a look at your code.


Reena said...

Do we need to follow Test plan and Functional Specification document while performing validation for sas programs and TLF's or we just do the validation directly without any documents .. plz let me know answer for this

sarath said...

It all depends on the company you are working for. Many of them (Small Companies) don't even have a Test Plan or Functional Specification Document. Big Pharmaceutical Companies and CRO's maintains these kind of documents. In the 21CRF part11 world, the correct way of doing validation is to follow the guidelines / specifactions given in the Test Plan/ Functional Specification Document.

Harish said...

please answer this quoestion
Name one major important attribute of SAS 9----most used in real life.

sarath said...

Variables attributes (length, label, format and informats), I see all of them as equally important .

Geetha said...

Very useful information .
Thanks for sharing with us.......

sujji said...

can anyone send me a sample project... to
thanks in advance

Anonymous said...


anyone know how to combine two observations into single observation?

Anonymous said...

hi,what would u do if you hav to access previous record values in current record

Akshara said...

What is the main difference in b/w Base SAS,proc sql and macros....?

Anonymous said...

Hi this is pushpa...
I have a question about SAS. Is there any practica site in SAS clinical trails.
for interviw base point which topics I have to put concentrate. I sudied base SAS and I started clinical trail material.present i goto base sas classes.
Thank you

Anonymous said...

Hi everybody,
I am new to clinical sas programming, can anyone clarify me after coding a program , where do we save this ? or we need to report this to somebody else.

Anonymous said...

Hi all,

Is SAS certification mandatory to get job related to clinical sas programmers in abroad or india.

Thank u

Unknown said...

what is efficacy table..i mean variables datavalues in that tables...give me eg plzzzz?

arnam said...

It's helpful really..Thanks a lot.
can anybody having sas clinical sample projects kindly mail me at

Monica said...

hi everyone,
can anyone please help me with QC process in sas.If i get a code to check as a QC how would I proceed.I am alos interested in sas online freelancing projects if anyone could suggest me links .,,that would b great help.

Unknown said...

This is a good read to help you during interviews. I take these tips to the heart. Find More Interview Tips For freshers here - Visit Here

Post a Comment