Seemingly Unrelated Regression

Ramin Mojab

2023-08-07

library(ldt)

seed <- 123
set.seed(seed)

Introduction

In this vignette, I will introduce you to the main features of the ldt package for dealing with Seemingly Unrelated Regression models. I will demonstrate how to perform common tasks such as estimating an SUR model and automatically finding significant coefficients. I will also discuss model uncertainty and how to define an SUR model set and automatically search for the best models within this set. Additionally, we will explore the use of Principal Component Analysis as an alternative approach when dealing with a large number of potential explanatory variables.

One of the main ideas behind ldt is to minimize user discretion. As such, an analysis in ldt is generally based on a dataset and a set of rules that convert this dataset into a list of potential regressors and/or predictors. This rule-based approach to selecting data not only avoids discretion but is also expected due to the word “automatically” used in the previous paragraph.

In this example, I will create an artificial dataset with dependent variables and both relevant and irrelevant explanatory variables. The dependent and relevant explanatory variables are a sample from a known SUR model. While we can discuss how well the estimation process can find the true parameters, this is not the main goal, here. Instead, I will focus on how to estimate, search, project, and report the results.

Let’s get started!

A simple experiment

Let’s start by assuming that we know the structure of the system. We can do this by simulating data from a known SUR model. The following command generates a sample from such a system of equations:

num_obs <- 100
num_eq <- 2L
num_exo <- 3L
sample <- sim.sur(num_eq, num_exo, num_obs, TRUE)

print(sample$coef)
  >                   Y1         Y2
  > Intercept -0.5604756 0.07050839
  > X1        -0.2301775 0.12928774
  > X2         1.5587083 1.71506499

We know the parameters of the system because they are included in the output of the sim.sur function. This system has 2 equations or dependent variables. Each equation has an intercept and 2 exogenous variables. There are no restrictions imposed on the system. The sample size is 100. All coefficients of the model are generated randomly and are listed in the output sample. This includes a positive definite matrix of regression covariance and a matrix of coefficients (where the first row includes the intercept coefficients). The LaTeX code for the equations of the system is in the eqsLatex element. It results in the following representation:

\[\begin{aligned} Y_1 = -0.56 + -0.23 X_1 + 1.56 X_2 + E_1, \sigma_1^2 = 1.81 \\ Y_2 = 0.07 + 0.13 X_1 + 1.72 X_2 + E_2, \sigma_2^2 = 0.67 \end{aligned}\]

Furthermore, the matrix representation is in the eqsLatexSys element and it results in the following formula:

\[\begin{aligned} \begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} = \begin{bmatrix} -0.56 & -0.23 & 1.56 \\ 0.07 & 0.13 & 1.72 \end{bmatrix} \begin{pmatrix} 1 \\ X_1 \\ X_2 \end{pmatrix} + \begin{pmatrix} E_1 \\ E_2 \end{pmatrix}, \Sigma = \begin{bmatrix} 1.81 & 0.25 \\ 0.25 & 0.67 \end{bmatrix} \end{aligned}\]

Remember that these are the parameters of the system.

We can use the systemfit package to estimate the model. This includes preparing the equations of the system (the first two lines of the following code) and estimating (the third line):

exp_names <- paste0(colnames(sample$x), collapse = " + ")

fmla <- lapply(1:ncol(sample$y), 
               function(i) as.formula(paste0("Y", i, " ~ -1 +", exp_names)))

fit_sf <- systemfit::systemfit(fmla, data = data.frame(sample$y, sample$x), 
                               method = "SUR")

fit_sf_sys <- sim.sur(fit_sf$residCov, 
                      matrix(fit_sf$coefficients, ncol = 2), 10, TRUE)

The last line is used for reporting purposes. Like before, I report the LaTeX formula:

\[\begin{aligned} \begin{pmatrix} Y_1 \\ Y_2 \end{pmatrix} = \begin{bmatrix} -0.47 & -0.28 & 1.59 \\ 0.05 & 0.10 & 1.71 \end{bmatrix} \begin{pmatrix} 1 \\ X_1 \\ X_2 \end{pmatrix} + \begin{pmatrix} E_1 \\ E_2 \end{pmatrix}, \Sigma = \begin{bmatrix} 1.72 & 0.19 \\ 0.19 & 0.67 \end{bmatrix} \end{aligned}\]

Note that this representation is not very appropriate for the estimated model because it does not report the estimated standard errors and there are better ways to do so. However, it suits our purpose here. Also, note that you can get better results (in terms of lower difference between actual and estimated coefficients) by increasing the sample size.

The following code does the same by using the estim.sur function in the ldt package:

fit <- estim.sur(sample$y, sample$x[,-1], addIntercept = TRUE) 

res_table <- coefs.table(fit, regInfo = c("obs", "sigma2"), 
                             formatLatex = FALSE)

The second line converts the estimated result into a table for presentation. It has some other arguments that can be used to control the format or level of information in the table. Here is the result of reporting the contents of the table with kable function:

Results of estimation using ldt::estim.sur function.
Y1 Y2
Intercept -0.47*** 0.05
X1 -0.28* 0.10
X2 1.59*** 1.71***
obs 100 100
sigma2 1.67 0.65

Difference between the results of systemfit and estim.sur functions (apart from other factors such as difference in the estimator or the initialization of covariance matrix in FGLS estimation) here is probably due to the adjustment of degrees of freedom.

Before we move on from this section, let me introduce the coefs.plot function in the ldt package. This function can be used to graphically plot the estimated coefficients or to compare them with the actual parameters. In this example, we will focus on the coefficient of X2 in the first equation. The following code can be used to plot it:

coefs <- fit$estimations$coefs
stds <- fit$estimations$stds
coefs.plot(intervals = list(list(value = coefs[3,1], label = "X2->Y1", col = "red",
                                 xmin = coefs[3,1] - 2 * stds[3,1],
                                 xmax = coefs[3,1] + 2 * stds[3,1])),
           points = list(list(value = sample$coef[3,1], col = "blue",
                              label = "Actual")),
           xlab = "Value", ylab = "")
Plot of an estimated coefficient, its confidence interval, and the actual value.
Plot of an estimated coefficient, its confidence interval, and the actual value.

Variable selection

In the previous subsection, we had all the relevant information about the system, except for the error term generated in sim.sur. Unless you have selected a very low number of observations (num_obs parameter) or the system is relatively large or complex, you should be able to find the structure or actual parameters. Note that we are not discussing the power of t-test or identification issues.

However, this setup may not be realistic enough. In reality, there is much uncertainty about the choice of variables and theory might not be enough to reduce it significantly. As mentioned in the introduction, ldt tries to minimize user discretion and automate things. This might include defining rules over an external dataset. Although I used external datasets in previous versions of ldt vignettes, I decided to rely on simulated data to reduce confusion. To simulate a dataset, I will add some irrelevant data to the analysis in this section. Let’s use the data from the previous subsection, but with an additional line that overrides the matrix of explanatory data in the sample and adds many irrelevant data:

sample$x <- cbind(sample$x, matrix(rnorm(num_obs * 50), ncol = 50, 
                                   dimnames = list(NULL,paste0("z",1:50))))

Therefore, there are 53 irrelevant and 2 relevant variables in the sample. Note that the number of irrelevant data is relatively large.

First, let’s see how significance search works in this situation. As before, we use:

fit <- estim.sur(sample$y, sample$x[,-1], addIntercept = TRUE,
                 searchSigMaxIter = 10, searchSigMaxProb = 0.05) 

res_table <- coefs.table(fit, regInfo = c("obs", "sigma2"), formatLatex = FALSE,
                             expList = c("Intercept", "X1", "X2", "X3", "X4", 
                                         "z3", "z6", "z14", "z32", "z42", "z47"))
Results of estimation using ldt::estim.sur function with significance search and variable selection uncertainty.
Y1 Y2
Intercept -0.59*** 0.18**
X1 1.05*** 0(r)
X2 0(r) 1.18***
X3 0(r) -0.94***
X4 0(r) 0.49***
z3 0.46*** 0.24***
z6 -0.20*** 0(r)
z14 -0.35** -0.21***
z32 0(r) 0.09**
z42 -0.50*** -0.25***
z47 -0.38** -0.18**
obs 100 100
sigma2 2.70 0.64

To reduce the length of the table, I ran coefs.table command twice: first without specifying expList, and after finding the name of unrestricted coefficients, with expList. With this top-to-bottom approach, we can get a more parsimonious model. However, you can see the effect of variable selection uncertainty in the table: a relevant variable is omitted and some irrelevant variables’ coefficients are significant.

Given a specific goodness-of-fit criterion or performance metric, there is a function in ldt that can do brute-force search to find the best model. It systematically and efficiently tries to estimate and evaluate all possible combinations of variables and selects the best one. We can discuss whether this best model is the true model or not. A more important issue is that while theoretically attractive, for large datasets this approach is not computationally feasible due to the large number of possible combinations of variables and therefore the relatively large size of the theoretical model set.

Assuming that we know the maximum number of relevant variables, the code to search the theoretical model set is as follows:

search_res <- search.sur(sample$y, sample$x, numTargets = ncol(sample$y),
                         xSizes = c(1:4), 
                         metricOptions = get.options.metric(typesIn = c("sic")))
search_res

This code is time-consuming and is not evaluated here. However, on my system, the elapsed time is 17 seconds and the number of searched models is 368830. Note that if we change our previous guess and assume that the maximum number of relevant variables is larger, for example 5, the size of the practical model set becomes 3847591 (10 times larger) and this is estimated in 210 seconds (12 times larger). We are using SIC as the performance metric and these elapsed times increase if we rely on out-of-sample simulation and their corresponding metrics. Therefore, we need to focus on some subsets of the model set. But how should we prioritize one part of the model set over another before the estimation and evaluation stage?

One might reduce the number of potential explanatory variables by using theory or statistical testing, similar to the top-to-bottom approach we talked about before. Since ldt dislikes user discretion, it provides a more systematic approach. The idea behind it is simple: smaller models are estimated, variables are selected, larger models are estimated with a lower number of potential explanatory variables. Here is the code:

x_size_steps = list(c(1, 2), c(3), c(4), c(5))
count_steps = c(NA, 20, 10, 9)

search_step_res <-
  search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
                      xSizeSteps = x_size_steps, countSteps = count_steps,
                      metricOptions = get.options.metric(typesIn = c("sic", "aic")),
                      searchItems = get.items.search(bestK = 10)
  )
search_step_res
  > method: sur 
  > expected: 1,876, searched: 1,876 (100%), failed: 0 (0%)
  > elapsed time: 0.06728618 minutes 
  > --------
  > 1. sic:
  >  Y1 (best=514.36)
  >  Y2 (best=514.36)
  > 2. aic:
  >  Y1 (best=488.308)
  >  Y2 (best=488.308)

The first two lines define the steps. We use all variables (NA in count_steps means all) to estimate models with sizes defined as the first element of x_size_steps (1, 2). Then we select 20 number of variables (2nd element in count_steps) from the information provided by the best models and estimate models with sizes determined by the second element of x_size_steps (3). And so on.

The size of the model subset and running time are greatly reduced. However, let’s see its performance.

To study or report results, we should use the summary function. The output of a search project in ldt does not contain estimation results but only the minimum level of information to replicate them. The summary function does the job and estimates the models. Here is the code:

ssum <- summary(search_step_res, y = sample$y, x = sample$x)

Usually, there is more than one model in the summary. This is because the output is first “target-variable-specific” and second “evaluation-specific”. I report some of them by creating a list of estimated models and like before by using coefs.table function:

mod_list <- list("SIC" = ssum$sic$target1$model$bests$best1,
                 "AIC" = ssum$aic$target1$model$bests$best1)
res_table <- coefs.table(mod_list, regInfo = c("obs", "sigma2"), formatLatex = FALSE)

Like before, the second line gets the coefficient table for presentation.

Estimation result of best models found using ldt::search.sur.stepwise function.
SIC SIC AIC AIC
dep. Y1 Y2 Y1 Y2
Intercept -0.74*** 0.11 -0.74*** 0.11
X1 1.11*** 0.04 1.11*** 0.04
X2 -0.45** 1.01*** -0.45** 1.01***
X3 -0.31* -1.09*** -0.31* -1.09***
X4 -0.06 0.45*** -0.06 0.45***
obs 100 100 100 100
sigma2 3.28 0.82 3.28 0.82

Note that the headers indicate the metric used to compare the models. The first row shows the name of the target variable in the search.

The algorithm has omitted all irrelevant variables and found the restrictions on the second equation. However, for the first equation, X3 is significant at the 90% confidence level, even though it should be insignificant.

It’s important to note that the results are specific to the selected seed. Drawing general conclusions is beyond the scope of this text.

Out-of-sample evaluations

In this section, we will discuss another aspect of ldt: out-of-sample evaluation. Since ldt is rather atheoretical, it tries to compare models based on their predictive power. Of course, we are not generally talking about time and future in SUR models, and by out-of-sample evaluation, we are referring to the cross-validation practice.

The following code is similar to the code in the previous section, but we define a out-of-sample process:

metric_options <- get.options.metric(typesOut = c("rmse", "crps"), 
                                     seed = -seed, simFixSize = 5, trainRatio = 0.75)
search_step_res <-
  search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
                      xSizeSteps = x_size_steps, countSteps = count_steps,
                      metricOptions = metric_options,
                      searchItems = get.items.search(bestK = 10)
  )
search_step_res
  > method: sur 
  > expected: 1,926, searched: 1,926 (100%), failed: 0 (0%)
  > elapsed time: 0.06706197 minutes 
  > --------
  > 1. rmse:
  >  Y1 (best=1.704)
  >  Y2 (best=0.966)
  > 2. crps:
  >  Y1 (best=0.969)
  >  Y2 (best=0.551)

We use 0.75 ratio of the observations (determined by trainRatio) for estimating and the rest for testing, and we repeat this experiment 5 times (determined by simFixSize). I report the result similar to the previous discussion:

ssum <- summary(search_step_res, y = sample$y, x = sample$x)
 
mod_list <- list("RMSE_T1" = ssum$rmse$target1$model$bests$best1,
                 "RMSE_T2" = ssum$rmse$target2$model$bests$best1,
                 "CRPS_T1" = ssum$crps$target1$model$bests$best1,
                 "CRPS_T2" = ssum$crps$target2$model$bests$best1)
res_table <- coefs.table(mod_list, regInfo = c("obs", "sigma2"), formatLatex = FALSE)
res_table <- res_table[,c(1,4,5,8)]

Please note that a column name such as RMSE_T1 indicates that the column belongs to a model selected based on the “RMSE” criteria, and the evaluation pertains to the first target variable. The last line is included for presentation purposes and serves to remove any unimportant columns. Below is the table of results:

Estimation result of best models found using ldt::search.sur.stepwise function and out-of-sample evaluation.
RMSE_T1 RMSE_T2 CRPS_T1 CRPS_T2
dep. Y1 Y2 Y1 Y2
Intercept -0.61*** -0.61***
X1 1.16*** 1.16***
z35 -0.35* -0.35*
z42 -0.49*** -0.49***
z47 -0.55*** -0.55***
X2 1.00*** 1.00***
X3 -1.05*** -1.05***
obs 100 100 100 100
sigma2 3.00 1.02 3.00 1.02

You may achieve better results by modifying the arguments of the get.options.metric function and using the get.items.modelcheck function to change the default value of searchItems (Recall that our goal is to exclude irrelevant z? variables and find zero restrictions). One important parameter to consider is the number of simulations, simFixSize. Its current value is 5. By increasing this value, you can allow the models to demonstrate their capabilities more effectively during the search process. Of course, increasing the simFixSize is expensive!

A Closer Look at the Coefficients

When searching and estimating a model set, ldt gathers various types of information, including the coefficients in the context of seemingly unrelated systems. You can adjust the arguments of the search.sur function to report the evaluation metric of all estimations or to summarize the estimation information for a coefficient. For example, you might want to know the combined distribution of a coefficient in all estimated models or perform an extreme bound analysis.

Let’s use the sample from the previous example for this purpose. We can use the get.items.search function to request other types of information:

search_items <- get.items.search(type1 = TRUE, bestK = 10,
                                 inclusion = TRUE, 
                                 cdfs = seq(-0.8, 0.3, 0.01),
                                 extremeMultiplier = 1.95, 
                                 mixture4 = TRUE)

Apart from the bestK parameter, which is set to a high value to meet step-wise search requirements, we changed the default value of five parameters. We set type1 to TRUE, meaning we need the first type of information (coefficients in SUR). inclusion = TRUE means inclusion weights are saved and returned. cdfs = seq(...) means we want the weighted average of CDFs in all estimated models at the given points. The sequence value helps us get the correct shape of the combined distribution. Also note that this sequence is selected after running the code a second time when we have information about extreme bounds. extremeMultiplier = 1.95 calculates and returns the combined confidence interval of a coefficient in all models. mixture4 = TRUE returns the first four moments of the combined distribution, which we will use to estimate a GLD distribution.

Let’s repeat the estimation with these new options:

search_step_res <-
  search.sur.stepwise(sample$y, sample$x, numTargets = ncol(sample$y),
                      xSizeSteps = x_size_steps, countSteps = count_steps,
                      metricOptions = get.options.metric(typesIn = c("aic")),
                      searchItems = search_items)
For convenience, let’s print the true model here: \[\begin{aligned} r sample$eqsLatexSys \end{aligned}\]

Inclusion Weights

Inclusion weight is a simple concept: it is the average weight of models that contain a specific variable. Therefore, if a variable has a higher inclusion weight, it contributes more information to explain the dependent variable.

Focusing on the second target variable, Y2, we can see from search_step_res$aic$target2$model$inclusion matrix that the average inclusion weight in all models is 4.9210088^{-110}. Let’s plot some of the highest inclusion weights:

Sorted Inclusion weights.
Sorted Inclusion weights.

Recall that the second dependent variable is not a function of z? variables (i.e., irrelevant ones), intercept and X1.

Combined Distribution

The CDF, extreme bounds, and mixture4 convey different levels of information about the combined density of a coefficient across all estimated models. Let’s focus on the effect of X2 on Y1.

First, we extract the actual and interval estimation:

actual <- list(value = sample$coef[3,1], label = "Actual")

b <- search_step_res$aic$target1$coefs$bests$item3$best1
interval <- list(value = b$mean, label = "X2 -> Y1",
                 xmin = b$mean - 1.95 * sqrt(b$var),
                 xmax = b$mean + 1.95 * sqrt(b$var))

Since we will use the coefs.plot function discussed earlier, we set the values as elements of a list. Next, we get the extreme bounds:

e <- search_step_res$aic$target1$coefs$extremeBounds
bound <- list(xmin = e[3,1], xmax = e[3,2], label= "E.B.")

Next, we use the CDFs to plot the true density function:

cdfs <- list(type = "cdfs", 
             cdfs = sapply(search_step_res$aic$target1$coefs$cdfs, function(c)c[3,1]),
             xs = search_items$cdfs, 
             label = "CDFs", smoothFun = function(y) smooth(y), col = "brown", lty = 1)

Then we get the GLD distribution using the gld package:

g <- search_step_res$aic$target1$coefs$mixture[3,]
g <- s.gld.from.moments(g[1], g[2], g[3], g[4], type = 3)
glds <- list(type = "gld",
                      p1 = g[1], p2 = g[2], p3 = g[3], p4 = g[4],
                      label = "GLD", quantiles = seq(0.001, 0.999, 0.001), lty = 2, col = "black")

Plotting is done using the coefs.plot function, similar to the previous example.

coefs.plot(intervals = list(interval), bounds = list(bound),
           points = list(actual), distributions = list(cdfs, glds),
           xlab = "Value", ylab = "")
Plot of an estimated coefficient in the search process.
Plot of an estimated coefficient in the search process.

Role of principle components

Consider a system of equations where there are many relevant variables, meaning that we cannot divide the data into relevant and irrelevant and each variable has something to offer. Estimating a large model might not be wise due to low degrees of freedom or other problems such as multicollinearity. In such a system, we might still want to search for the most significant variables or select a subset of the most important variables. However, we can also use principal component analysis (PCA).

As before, let’s start by defining our system:

num_exo <- 60L
sample <- sim.sur(num_eq, num_exo, num_obs, TRUE)

As you can see, we assume that there are 60 relevant exogenous variables in the system. The following code shows the estimation with PCA:

fit <- estim.sur(sample$y, sample$x, addIntercept = FALSE,
                  pcaOptionsX = get.options.pca(2,4))

By using get.options.pca, I am asking ldt not to include the first 2 variables in the PCA analysis. Also, I want it to include 4 principal components in the regression. The following table shows the result:

Results of estimation using ldt::estim.sur function with significance search and variable selection uncertainty.
Y1 Y2
Intercept -1.09 -0.06
X1 -0.27 3.51***
X_PC1 -0.91** 1.97***
X_PC2 -1.44*** 1.64***
X_PC3 1.05** 0.77
X_PC4 0.02 0.60
obs 100 100
sigma2 53.38 81.09