1  Introduction

It is instructive to start with the first sentence of Train (2009) Section 1.3, “Discrete choice analysis consists of two interrelated tasks: specification of the behavioral model and estimation of the parameters of that model” (emphasis added). In particular, chapters 1–7 of Train (2009) only specify models; there is not even a hint of model estimation. And since I follow Train (2009), we too begin with a focus on model specification.

Let me be clear: many students will see \(y\) and \(x\) later in this section and immediate think about “fitting” a model; that is, they assume they have data that they will put into an estimation routine to find estimates of the parameters of the model. We are not there yet! At this early stage in the book, we are only specifying models; that is, listing sets of assumptions about data generating processes and exploring the implications of those assumptions. There are no data yet. There will be parameters introduced in our choice of model specification, but we are not yet estimating those parameters. Have patience, we will eventually do these things.

1.1 Recap of Train Ch. 1

Train denotes the outcome in any given choice situation as \(y\), determined by some observable factors collected in the vector \(\mathbf{x}\) and some unobservable factors collected in the vector \(\boldsymbol{\varepsilon}\). The factors (\(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\)) relate to the agent’s choice (\(y\)) through a function \(y = h(\mathbf{x}, \boldsymbol{\varepsilon})\). We assume for the moment that we know \(h(\cdot)\) and that \(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\) are length-one vectors (i.e., scalars) denoted \(x\) and \(\varepsilon\).

Since we do not observe \(\varepsilon\), we can’t predict \(y\) exactly. Instead, we focus on the probability of \(y\), that is:

\[ \begin{aligned} p(y|x) &= \Pr \left( \varepsilon \textrm{ such that } h(x,\varepsilon)=y \right) \\ &= \Pr \left( I \left[ h(x,\varepsilon)=y \right] = 1 \right) \\ &= \int I \left[ h(x,\varepsilon)=y \right] f(\varepsilon) \, d\varepsilon \end{aligned} \tag{1.1}\]

For certain special choices of \(h\) and \(f\), a closed-form expression1 for the integral is available. But more generally, for almost any choice of \(h\) and \(f\), we can approximate the integral through simulation. Train provides pseudo code on how to do so:

  1. Repeat the following two steps many (\(r=1, \ldots, R\)) times:
    • Draw \(\varepsilon^{(r)}\) from \(f(\varepsilon)\).
    • Determine whether \(h(x,\varepsilon^{(r)}) = y\). If so, set \(I^{(r)}=1\); else set \(I^{(r)}=0\).
  2. Average the \(R\) values of \(I\)

Next we look at two examples where we use this procedure to approximate the \(p(y|x)\) integral. The first example I made up. The second example is the binary logit model discussed by Train, but for which I’ll show you how to simulate the \(p(y|x)\) integral.

1.2 A Simple Example

Let’s first set up a toy example to demonstrate how simulation can approximate the \(p(y|x)\) integral. Suppose \(x=0.5\) and \(\varepsilon\) is uniformly distributed between \(-1\) and \(1\). Define \(h(x, \varepsilon)\) to be:

\[ h(x, \varepsilon) = \begin{cases} 0 & \text{if } x + \varepsilon < 0 \\ 1 & \text{if } x + \varepsilon \in [0,1] \\ 2 & \text{if } x + \varepsilon > 1 \end{cases} \tag{1.2}\]

We’ll focus on the outcome \(y=2\). You can probably intuit that the \(p(y=2 | x) = 0.25\) since only one quarter of the time will \(\varepsilon\) be sufficiently positive to make \(x + \varepsilon > 1\).2 Nevertheless, let’s approximate the integral representation of \(p(y=2|x)\) through simulation to ensure we understand the process.

To walk you through the code, we first set a seed so that the pseudo-random numbers generated by runif() can be replicated exactly each time the code is run (even on different computers). We then specify that we will use 1,000 draws in the simulation and we create a vector I to hold our results. The simulation occurs via a for() loop where each time through the loop we take a draw of \(\varepsilon\), calculate \(0.5 + \varepsilon\) and check whether that sum is greater than one. If so, then \(h(x,\varepsilon)=2\) matching the value of \(y\) for the choice probability we want to assess — i.e., \(p(y=2|x)\) — and thus we store a \(1\) in the \(r^\textrm{th}\) position of I; otherwise we store a 0. We then average the values in I to get our approximation of \(p(y=2|x)\).

set.seed(1234)

R <- 1000
I <- vector(length=R)

for(r in 1:R) {
    eps  <- runif(1, min=-1, max=1)
    h    <- 0.5 + eps 
    I[r] <- as.integer(h > 1)
}
mean(I)
[1] 0.258

The simulated value 0.258 approximates the exact value 0.25 and can be made closer by increasing the number of draws used in the simulation.

Notice that we took draws of \(\varepsilon\) to empirical approximate the integral of \(p(y|x)\), but we should not think of these draws as “data” in the sense of a dataset. They are ancillary numbers generated from the distribution of \(\varepsilon\) that we use to approximate the integral.

R users will recognize that we can shorten the code by taking advantage of R’s vectorized functions and its conversion of boolean values to 0/1 when used in mathematical operations. Here is a shorter implementation of the simulation; whether it’s “better” code is a matter of preference.

set.seed(1234)
R <- 1000
mean( runif(R, min=-1, max=1) + 0.5 > 1 )
[1] 0.258

That’s it. If you can generate pseudo-random draws from the density \(f\) and you know \(h\), approximating a choice probability by simulation might only require a handful of lines of code.

1.3 A Binary Logit Model Example

As an example of a model with a complete closed-form solution, Train provides the binary logit model. In this example, we’ll let \(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\) be vectors of length two, and we’ll specify \(h\) and \(f\) as follows:

The “binary” part refers to the aspect of the model whereby the decision maker does one of two things; they either take an action (\(y=1\)) or not (\(y=0\)). To tie this model into a framework of behavior, we start with a utility function \(U\). In Train’s specific example, utility is specified as

\[ U(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) = \mathbf{x}'\boldsymbol{\beta}+ \varepsilon \tag{1.3}\]

where \(\mathbf{x}\) is a vector of observable explanatory variables, \(\boldsymbol{\beta}\) is a vector of parameters that through the functional form \(\mathbf{x}'\boldsymbol{\beta}\) effectively serve as weights on the covariates, and \(\varepsilon\) is a scalar index collecting the value of information used by the decision maker but unobserved to the researcher. Notice that we’re allowing \(\mathbf{x}\) and \(\boldsymbol{\varepsilon}\) to be vectors in this example, whereas they were scalars (or equivalently length-one vectors) in the previous example.

In this model, the threshold for action is 0 because the decision maker takes action (\(y=1\)) when utility is positive; conversely, when utility is negative, the decision makes elects not to take the action (i.e., \(y=0\)). Therefore we can specify \(h\) as:

\[ h(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) = \begin{cases} 0 & \text{if} \hspace{1ex} U(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) \le 0 \\ 1 & \text{if} \hspace{1ex} U(\mathbf{x}, \boldsymbol{\beta}, \varepsilon) > 0 \end{cases} \tag{1.4}\]

The “logit” part of the model’s name refers to the choice of \(f\). The binary logit model assumes \(f\) is the logistic distribution:

\[ f(\varepsilon) = \frac{e^{-\varepsilon}}{(1+e^{-\varepsilon})^2} \tag{1.5}\]

Having specified \(h\) and \(f\), let’s choose some values for \(\mathbf{x}\) and \(\boldsymbol{\beta}\) and use simulation to approximate the integral for \(p(y|\mathbf{x},\boldsymbol{\beta})\).3 Let’s pick \(\mathbf{x}=(0.5, 2)\) and \(\boldsymbol{\beta}=(3,-1)\) such that \(\mathbf{x}'\boldsymbol{\beta}= (0.5)(3)+(2)(-1) = 1.5 - 2 = -0.5\). We know from the closed-form solution to this model provided by Train that, with these values of \(\mathbf{x}\) and \(\boldsymbol{\beta}\), the probability the decision maker takes action is:

\[ p(y=1 | \mathbf{x}, \boldsymbol{\beta}) = \frac{e^{\mathbf{x}'\boldsymbol{\beta}}}{1 + e^{\mathbf{x}'\boldsymbol{\beta}}} = \frac{e^{-0.5}}{1+e^{-0.5}} = 0.3775407 \tag{1.6}\]

We can approximate this integral as before. Below I use the function rlogis() to take R=1000 draws from the binary logistic distribution, and I approximate the integral with the proportion of times \(\mathbf{x}'\boldsymbol{\beta}+ \varepsilon\) is greater than the threshold for action (\(0\)):

set.seed(2345)
R <- 1000

x    <- c(0.5, 2)
beta <- c(3, -1)

U <- as.vector(x %*% beta) + rlogis(R)
mean(U > 0)
[1] 0.364

Our simulated value 0.364 approximates the exact value of the integral 0.378.

1.4 Key Learnings

The key learning from this chapter is that with discrete choice models our focus is on the probability of the choie outcome \(y\) – that is, \(p(y|x)\). This choice outcome \(y\) results from the joint distribution \(f\) of unobserved factors \(\boldsymbol{\varepsilon}\) and the behavioral model \(h\) that relates \(y\) to \(\mathbf{x}\) (and \(\boldsymbol{\varepsilon}\)). The probability of the choice outcome \(p(y|x)\) can be written in closed form for only very special choices of \(f\) and \(h\), but for almost any choice of \(f\) and \(h\) we can simulate \(p(y|x)\), as the two examples in this chapter demonstrate.


  1. In this context, a closed-form expression means a way of writing the integral so that the anti-derivative sign is not part of solution. For example, the integral \(\int x \, dx\) has the closed for expression \(x^2/2\) plus some constant. We will see later that the Extreme Value distribution is often chosen for \(f\) predominantly because it leads to a closed for expression for the choice probability \(p(y|x)\).↩︎

  2. More precisely, \(p(y=2|x) = \Pr(x+\varepsilon > 1|x=0.5) = \Pr(\varepsilon > 0.5) = \int_{0.5}^1 f(\varepsilon) d\varepsilon = (0.5\varepsilon)\vert_{0.5}^1 = 0.25\).↩︎

  3. Only at the early stage of simulating data from a model to better understand the model do we pick values for \(\mathbf{x}\) and \(\boldsymbol{\beta}\). Typically, \(\mathbf{x}\) will part of the data you collect and \(\boldsymbol{\beta}\) will be parameters whose values you seek to estimate.↩︎