Discrete Choice Model Estimation with R

Author

Dan Yavorsky

Warning

This is an early draft. The book is incomplete. Every aspect of it is subject to change. Many things might be incorrect. Are you sure you want to read this yet?

Welcome

You don’t understand it, until you have coded it.

It’s a mantra that my mentors instilled in me in graduate school, and one that I propagate on my students today. I believe it in deeply. If you want to understand a statistical model, it is insufficient to interact with the model only through pen and paper. No matter how well you organize the subscripts or how masterfully you interweave elements of the Greek and Roman alphabets, you won’t really “get it.” To deeply understand a statistical model, you need to be able to simulate data from the model, and then take that simulated data to an estimation routine that enables you to recover the parameters of interest.

This is something you learn to do in graduate school. But, unless you have one of the most caring and pedagogical advisers, no one teaches it to you! Instead, it’s a skill that students develop independently and inefficiently between classes and assignment due dates or, in my case, after I was knee-deep in my dissertation research. There is a tremendous imbalance between how immensely important this skill is and the time and instruction dedicated to it.¹ That imbalance is the motivation for this book.

I address this gap in the specific domain of Discrete Choice Modeling. I do so alongside both Kenneth Train’s masterful text Discrete Choice Methods with Simulation (Second Edition, 2009) freely available online at https://eml.berkeley.edu/books/choice2.html and the underappreciated tome Applied Choice Analysis by David Hensher, John Rose, and William Greene (Second Edition, 2015). These books are properly cited as Train (2009) and Hensher, Rose, and Greene (2015) but will hereafter be referred to as DCMS and ACA. I will reference them throughout.

Train and I align in our thinking:

“…the true value of [] choice modeling is the ability to create tailor-made models. The computation and programming steps that are needed to implement a new model are usually not difficult. An important goal [] is to teach these skills as an integral part of the exposition of the models themselves.”²

However, what is “usually not difficult” for an experienced researcher can be immensely difficult for a typical graduate student. The aim of this book is to ease that difficulty. I demonstrate R code alongside the math. You will see the mathematical derivations and pseudo-code from DCMS and ACA transformed into working R code. No mysteries will remain. To borrow a quote from Quintilian, Cobbett, Cooke, and many others: my goal is to take you through the process of programming the simulation and estimation of common discrete choice models not so that you can understand, but so that you cannot possibly misunderstand.

How to read this book

The active reader will benefit most from securing copies of DCMS and ACA alongside this book. While I introduce each model before diving into the associated programming, DCMS and ACA provide substantially more context regarding their development and use in economics and psychology. They also offer more thorough descriptions and derivations of the properties of the models. My recommendation is that you should read — or most likely, re-read — the referenced chapters from DCMS and ACA as we begin to explore each specific model, and then read and work through the corresponding chapter here, alternating between the texts. Do not simply read this book from start to finish without frequently returning to DCMS and/or ACA. Doing so only deprives you of the intended experience and likely dramatically reduces the amount you will learn from the process.

In addition, I strongly encourage you write or copy the code as you work though this book. I would mandate this if I could, but I’m not sitting there next to you and so I must settle for simply sharing my encouragement. Write the code. Play with the code. Break the code. Make it your own. That behavior is where deep understanding comes from, not from highlighting the occasional sentence or equation you find important.

Intended Audience

This book is intended for a narrow audience, predominantly current or former graduate students with an interest in discrete choice modelling who will find value from seeing and interacting with the programmatic implementation of the multinomial logit and its extended family of related models. In other words, someone who read DCMS and thought how would I code that? Or someone who used nlogit while working through the second half of ACA and wants to “see under the hood” of its estimation routines.

We will simulate data from the statistical models and then estimate the parameters of those models from the simulated data. Doing so will require us to generate random draws from distributions, approximate integrals with monte carlo simulation, maximize likelihood functions with optimization routines, and (when taking a Bayesian approach) explore posterior distributions with Markov Chain Monte Carlo methods.

We will do all of this in R, a freely available software environment for statistical computing and graphics. As a result, I assume you are reasonably familiar with R. If not, there is a tremendous set of free online R resources collected and organized at https://www.bigbookofr.com/. Popular books include R for Data Science properly cited as Wickham, Cetinka-Rundel, and Grolemund (2023) and Advanced R properly cited as Wickham (2019). If you are new to R or your familiarity with it has waned, you should consult these resources before proceeding here.

I also assume you have taken introductory statistics or econometrics courses, as the concepts and techniques taught there are foundational for understanding and estimating discrete choice models. In particular, you should have no uncertainty about the difference between a model, an estimator, and an estimate. To briefly review:

a model is the set of mathematical assumptions about how data are generated,
an estimator (or equivalently, the estimation routine) is an algorithm or function of the data, and
an estimate is the result of applying the estimator to a particular dataset.

In my experience, students are often given a model and an estimator, after which much time in the classroom is spent deriving properties of the estimator for that particular model. Often a homework assignment follows which asks students to implement the estimator on a dataset to find an estimate. While this approach succeeds in developing the student’s skills and understanding of the particular estimator for the particular model, it does so with blinders on, obscurring consideration about the choice or specification of the model as well as the choice of estimator.

I’m sure you can recall your introduction to linear regression where you were handed a model with a linear conditional mean and a Normally distributed “error” and told that you would estimate the parameters of the model via least squares. Hopefully this left you wondering about less-particular situations such as those with a non-continous outcome variable, a non-Normally distributed error term, a non-linear conditional mean, correlation between the observed covariates and the error, a causal interpretation, or a different estimation approach.

In this book, we will focus on models with non-continous outcome variables and we will use Maximum Likelihood and Bayesian methods to estimate the parameters of those models. If you would like to review key ideas from statistics and econometrics in preparation for tackling our discrete choice models with likelihood and Bayeian approaches, I highly recommend Abramovich and Ritov (2023, link) on the topic of mathematical statistics, Goldberger (1991, link) and Kennedy (2008, link) for econometrics, and McElreath (2018, link) and Gelman et al. (2013, link) for Bayesian statistics. Bruce Hansen’s recently released two-volume series Hansen (2022b, link) and Hansen (2022a, link) on probability, statistics, and econometrics are an excellent set of alternative resources.

Acknowledgments

I gratefully aknowledge the academic giants who pioneered and developed the field of Discrete Choice Analysis. The models offer flexible forms to study human behavior, the estimation methods are fascinating, their implementation usually proceeds from vexing to fun, and their application has been immensely useful in my consulting practice where I seek to help firms better understand consumers in order to improve consumer well-being and firm profitability.

I am also immensely grateful to the teams that work on the R-Project, those at Posit who provide RStudio, Positron, and Quarto, and the related communities of developers, academics, and R users. The free tools they provide are exceptional and I use them daily. The welcoming communities they have established provide the necessary support humans need, be it technical troubleshooting or emotional encouragement.

More personally, I would like to thank my academic mentors Ella Honka, Peter Rossi, and Eric Bradlow, as well as the folks that have or continue to tolerate me through my consulting work, including Prachi Bhalerao, June Wu, Chris Diener, Keith Chrzan, and the hosts and attendees of Sawtooth Software’s annual Analytics and Insights Summit. I have learned so much from these researcher’s guidance and often their work.

Special thanks go to generous people who reviewed and assisted with drafts of this book, including Darren Aeillo, Kalyan Rallabandi, and Geoff Zheng.

Unquestionably, my deepest thanks go to Alison, Zach, and Ben for their patience and support.

Colophon

An online version of this book is available at https://dcme-r.danyavorsky.com. The source of the book is available at https://github.com/dyavorsky/dcme-r. The book is authored using Quarto, an open-source scientific and technical publishing system that makes it easy to create articles, presentations, websites, books, and other publications that combine text and executable code.

What I’m referring to are select core skills from the fields of statistical computing and computational statistics. My experience was that graduate students in the social sciences – notably economics, psychology, and business – have little, if any, formal exposure to these ideas. See Rizzo (2019) for a related text that emphasizes computational statistics, and McElreath (2018) and Gelman et al. (2020) for a Bayesian perspective that emphasize a “Bayesian Workflow.”↩︎
DCMS pg 2. Train provides pseudo-code throughout his book and generously makes Matlab code available on his website.↩︎