How We Built Models to Predict Rating Changes in the Bond Market

Can bond upgrades and downgrades be predicted? And can bond investors make money by overweighting bonds likely to be upgraded and avoiding those likely to be downgraded?

Over the past two years we have sought to answer these questions quantitatively, applying a range of different statistical tools to build a robust model to predict changes in credit quality.

Last week, we showed that upgrades and downgrades drive significant dispersion in bond returns, far beyond what can be explained by yield alone. This week, we explain how we built models to predict upgrades and downgrades. And next week, we’ll show how these models translate into returns in the bond market.

Our Methodology

The greatest challenge in quantitative analysis of credit markets is building a database that links corporate bond returns to company financials over a long period of time. The Verdad Bond Database has 30+ years of descriptive and financial data on over 5,000 US investment-grade and high-yield bonds per month. For this project, we aggregated the data to the company level with separate levels for secured, unsecured and subordinated debt. The resulting database contains data on 3,257 unique companies and more than 90 financial metrics, including standard financial metrics, debt-specific metrics, and constructed variables that have proved useful to us in previous quantitative work. In most cases, we also transformed the data to natural log scale, which gave a more symmetric distribution, making it easier to model.

To build a model based on this data, we then had to define the variables we were attempting to predict. We decided to attempt to forecast changes in market-implied rating. We defined an upgrade as greater than one-unit increase in ratings over twelve months and a downgrade as greater than one-unit decrease in ratings over twelve months. This corresponds to a one-notch upgrade, from BB2 to BB1 for example, or a one-notch downgrade, from A3 to BBB1 for example. Note, our underlying market-implied credit ratings are continuous, so they can change by decimal increments. The figure below shows the distribution of 12-month rating changes in our dataset.

Figure 1: Distribution of Changes in NTM Market-Implied Ratings, 1996–2020

Source: Verdad Bond Database. NTM represents the next twelve months.

Across the entire dataset, 24% of bonds were upgraded and 21% of bonds were downgraded over the next 12 months. These upgrades and downgrades are contingent upon starting ratings.

Interestingly, when we map the base rates of upgrades and downgrades by credit rating, we see different probability distributions. Upgrades are roughly linear: the lower the rating, the higher the probability of upgrades. Downgrades follow a U-shaped distribution, with the highest and lowest rated bonds most likely to be downgraded. The base rates of these upgrades and downgrades by market-implied rating are shown below.

Figure 2: Base Rate of Upgrades (Dec 1996 – Aug 2020)

Source: Verdad Bond Database. Gray is investment grade, green is high yield.

Figure 3: Base Rate of Downgrades (Dec 1996 – Aug 2020)

Source: Verdad Bond Database. Gray is investment grade, green is high yield.

To capture the importance of the starting rating when predicting upgrades and downgrades, we decided to rely on a non-parametric statistical model: a model that is not tethered to fixed assumptions about the relationship between variables. This is important because initial rating interacts with different variables in different ways. To take a simple example, momentum measures based on credit spreads tend to be mean reverting for higher rated bonds but tend to trend for lower rated bonds (i.e., lower rated bonds that are getting worse tend to continue getting worse, whereas higher rated bonds that run into tough times can often turn things around). The ideal model would need to capture this dynamic, punishing lower rated bonds with negative momentum while rewarding highly rated bonds with negative momentum – a task for which machine-learning models are well suited.

To ensure that we were developing a robust model that could predict out of sample, we created a hold-out sample of companies to evaluate out-of-sample predictive accuracy. That way, we can judge the model based on new data it has never seen, rather than the sample that is has already learned from during training. To make the hold-out sample completely independent from the training sample, we randomly selected 40% of the companies to hold out. For example, across all years from 1996 to 2020, the training sample might include Apple, GM, and Exxon, but those companies will never show up in the hold-out sample. Similarly, the hold-out sample might include Boeing, Nike, and Verizon, but those three companies will never show up in the training sample.

The diagram below summarizes our approach to sample splitting.

Figure 4: Sample Splitting through Random Selection by Company

Source: Verdad

Finally, we trained two machine-learning models using the training sample: one to predict downgrades and one to predict upgrades. These random forest models work by building hundreds of independent decisions trees, then averaging the predictions across those trees to make a combined forecast for each company. The goal was for the models to provide a probability of downgrade or upgrade for each new company in the hold-out sample. When we evaluated our models against the hold-out sample, we wanted to see companies with a high probability of downgrade actually being downgraded at high rates. Similarly, we wanted to see companies with a high probability of upgrade actually being upgraded at high rates in the hold-out sample. And indeed, this is what happened. The charts below divide the hold-out sample into deciles based on the forecast downgrade probability and the forecast upgrade probability from our two models. The actual outcomes for each forecast decile are shown in the solid bars. Notice how the actual outcomes closely align with the forecasted probabilities in each decile.

Figure 5: Out-of-Sample Upgrade Forecast Accuracy (Dec 1996 – Aug 2020)

Source: Verdad Bond Database

Figure 6: Out-of-Sample Downgrade Accuracy (Dec 1996 – Aug 2020)

Source: Verdad Bond Database

For the top decile, where the downgrade model forecasts a 45% probability of downgrade, 43% of companies are actually downgraded. This is 1.8x better than random guessing, as the base rate of downgrades is 24% in the data set.

Similarly, in the top decile of upgrade probability, where the upgrade model forecasts a 44% probability of upgrade, 38% of companies are actually upgraded. This is also 1.8x better than random guessing, as the base rate of upgrades is 21% in the data.

In sum, we found that it’s possible to predict bond upgrades and downgrades with significantly higher accuracy than random chance. Next week, we will discuss the return implications of forecasting upgrades and downgrades in bonds and whether we can use these predictions to generate higher returns.

Back to Library

Archive

Forecasting Bond Upgrades and Downgrades

How We Built Models to Predict Rating Changes in the Bond Market