Using the Data Science Pipeline to Save Money Smarter - Jack Wilcom

Introduction

Hello! My Name is Jack Wilcom, a sophomore computer science student at the University of Maryland. I would like to welcome you to my final project for CMSC320 - Introduction to Data Science.

This project has two purposes:

1. To introduce the data science pipeline and demonstrate how you would use data to answer deep questions.
2. To answer a question I have been asking myself for a long time.

A little background is necessary to understand what question I am asking and why. A few years ago, I had my main savings account transferred entirely under my name, and now I have to make financial decisions with little help from my parents. Of course, the stress of this decision is significant and is no doubt exacerbated by my lack of financial knowledge. My friends and I discuss financial strategies. One of my friends is obsessed with the gold standard, frequently advertising gold as an investment. This aligns with something that I frequently say regarding the inflation of the US dollar; If your money is sitting in a savings account, you might as well be gradually losing money. Now, I have a savings account in which a lot of money sits dormant for a long period of time. While I was content with deriving that saying from thin air previously, I would like to answer the question legitimately and recommend a good course of action for myself and others.

So the question is... Is money that sits in savings accounts constantly losing buying power? Furthermore, what is the minimum risk required to prevent yourself from losing money over time? Certified deposits? Gold?

Preface/Disclaimer

First and foremost, I absolve myself of all responsibility from the actions taken by others based on the insights from this project. If you decide to make financial decisions based on this information, that is your choice. Remember that the economy is enormously complicated and influenced by policies and disasters that are more often than not arbitrary. All insights and predictions generated from this project unrealistically assume that no policy change, economic crisis, or act of God is influencing data from recent history onward. You will see examples of the data fluctuating due to this missing information later on.

This is a good opportunity to discuss the types of missing data. These unknown data (such as the interest data before 2009) would be known as "Missing at Random" because they certainly exist and may be recorded, but it is infeasible to account for all of them. Note that they are not "Missing Completely at Random" which would indicate that the data itself is causing values to go missing.

Anyway, on with the show!

Data Collection I

The most important dataset in this analysis must quantify inflation somehow. This immediately presents a problem because it is not immediately intuitive. I found an API maintained by the US Bureau of Labor Statistics (BLS) that supplies a value known as the consumer price index (CPI) each month. CPI is based on the average price history of many consumer products in a given year and is commonly used as a metric for the "buying power" of the US dollar domestically. This is perfect for our analysis.

The first step is to import the libraries necessary for connecting and using a JSON API. Requests is useful for communicating with websites and JSON is used to decipher what the API sends back.

Next, we use the provided API URL to request the data in JSON form.

I do this independently of the other steps because you can get a maximum time range of 20 years per request, and I want all available years. Additionally, the API is most functional when using an API key (please don't use my API key). There is a maximum number of requests per day (as is common with most free APIs), so you have to be careful with the number of requests you make.

Pandas is the cornerstone of the data science pipeline we use. It stores data of varying types in tables and provides ways to quickly perform calculations on rows and columns.

Different from pandas, numpy efficiently handles SAME-TYPE array data in very efficient ways. It also provides useful values like np.nan to indicate non-numbers.

This gives us an array of responses, which contain strings that can be read as JSON.

Next, we run through these requests, convert them to JSON, and excavate the meaningful data out of the JSON. It is also important to provide error handling for this because if data is not available, we want to know.

We built a pandas table containing the information for further processing. While this table is far more manageable than JSON, it is still not yet in a useful form from a data science perspective. For this, we need to do the next step.

Data Cleaning/Management

Usually, data received from outside sources has something wrong with it which makes it difficult to manage. For this source, the dates are two columns of strings.

It is important to make the dates machine-readable for further steps. We will use an intermediate mapping step to convert the month string into an integer, then use that integer to create a DateTime object for each row. We can then throw away the original year/month data, as it is no longer useful to us.

Luckily, there is no missing data in this case, but there often is. If there were missing dates, we would just want to drop the whole row in this situation.

We will come back to this data soon enough, but there is more data to collect.

Data Collection II

The next thing we have to do is get data about interest rates. I searched up and down for an easier way to get this data, but I was completely unable to find a webpage or API version of all interest rate data. Instead, I found a spreadsheet in xlsx format on the FDIC website. I converted this into xls format so Pandas could read it.

First, we define a function for converting string date ranges into DateTimes. This does not preserve the day or range, just the start date, making the date simpler. This results in a loss of accuracy, but it will make merging the tables possible later on

Now, we can read in the spreadsheet file and get it into a data frame. We have to iterate through each of the sheets, read them each as a data frame, drop the extra columns, properly index and name the columns and rows, and append the result to the table. We do a little bit of cleanup as well, removing missing dates with dropna.

The table is still really messy after this step, so further cleanup is required.

Data Cleaning/Management II

We don't want different rows for rate and rate cap. This makes them completely unmanageable. In this part, we fix that by combining them by date. To do that, first, we have to average the rates that fall into each month. This further simplifies the data for use with the other data we have. We then match each rate cap with the rate and merge the columns together. I fill the missing data values with adjacent values because we want to assume that the interest rates are not changing if we have no reason the believe they did. They are not continuous values, even though they may look like it.

Data Collection/Cleaning III

I promise this is the last dataset we need. This is the price history of gold directly from a webpage of JSON data. The process of converting it to a usable data frame is similar to the CPI data.

Finally, we can merge these tables together by date to make one cohesive table.

This table is finally clean enough for the next step in the data science pipeline.

Data Exploration

This is the step where we graph some of the data and look for patterns and interesting things to explore further. It is useful to look for correlations between variables in most circumstances because we want to ultimately make more general statements about relations in the data.

I'll start by graphing some of the values over time; particularly the average interest rates of the three most common bank accounts.

Immediately we see a problem; this plot looks incomplete. This is where we must recognize the problem with the interest rates dataset, in that it only goes back to 2009.

We need to restrict our dates to 2009-present if we want to see patterns easily. The first interest rate data is from April 2009, so we'll start there.

I was initially puzzled by how the data doesn't go back further, and I asked around for a reasonable explanation for this. I have a strong suspicion that it is related to the 2008 housing and loan crisis, which is why they keep records of this data now. I also was told that interest rates "never recovered", which is what we see in the data. This is a good example of data that is "missing at random".

It is also obvious that all of the rates rise and fall together. We will show this more clearly in a correlation matrix later on.

What is more interesting is the relationship between inflation rates and interest rates. It seems intuitive that as money becomes less valuable, interest rates would rise to compensate for lost value. This is something we want to find out.

Interest rates don't seem to correlate with inflation rates month-by-month. Incredibly, interest rates appear to have decreased overall in response to low buying power (high CPI).

This is to say that money devalues somewhat independently from the means to passively gain value back. This is a major problem because it suggests that money is not necessarily safe in bank accounts. This is a hint for one of our questions.

What about the value of a commodity like gold? does it follow CPI?

While there appears to be some positive relationship between gold and CPI, gold does not appear to follow CPI on a small (month-by-month) scale. It is interesting to see how the dollar value of gold is constant until 1970. This is because the gold standard for the US dollar we abandoned in 1969, and the values of gold and the dollar diverged.

It is important to remember that even though the plots look similar, the values of gold and APY are very different and should not be compared directly. APY is a percentage RATE of value appreciation, whereas we are looking at PRICES of gold. We'll come back to the PRICE value of a bank account compared to gold in the next part.

We should graph the CPI to see if there is anything interesting going on.

We can see that somewhere around 1970, the shape of the CPI over time changes from one linear slope to another. Perhaps this hints at some kind of policy change? Regardless, it signals to us that we may be able to predict what CPI will be in the future using linear regression.

Before I move on, I wanted to demonstrate an effective way of generating a chart of correlation values between the variables. This may confirm or deny what we have posited from the graphs; CPI change is not correlated with interest rates and the interest rates for different account types are strongly correlated.

You can make a correlation matrix that looks like this to find these relationships.

Seaborn has a great variety of advanced visualization functions - https://seaborn.pydata.org/tutorial.html

Matplotlib is used by other libraries for generating visualizations, we use it here to adjust the size of the matrix (otherwise it would be tiny).

This matrix confirms what we have seen already. Focus particularly on the "CPI_change" row, which has correlation values very close to 0. I suspect that the negative correlation between CPI and interest rates comes from the time period. The economy gets worse, interest rates dive, and inflation skyrockets - this is the unfortunate reality of our modern economy.

We have more than enough insight to move on to the next steps. It may seem like we didn't find anything helpful, but this would be a misconception. We were able to demonstrate a correlation between interest rates between account types and disprove any immediate month-by-month connection between values of investments and inflation.

Data Analysis/Transformation (ML Preprocessing)

This is where the rubber starts to meet the road. By transforming the data and performing analysis steps, we will be able to predict how much commodities will cost in the future, the minimum interest rate needed to maintain buying power over time, minimize the time we must keep our money in the bank to make sure it stays safe, and determine if gold is a good way to avoid the repercussions of inflation.

We start by making some new functions for transforming data. The buying power function takes a starting CPI and cost, then calculates (using an equation from Stack Exchange) the cost of the average product in the current year. While at first the relative value functions appears to do the same thing, it is calculating a fraction instead of a cost. Its anchor value is the "buying power" you have in a given year, and it is able to determine the buying power in different years.

If you are having trouble understanding this, don't worry. I will interpret the results in a moment.

Use modified equation from https://money.stackexchange.com/questions/88234/how-to-calculate-worth-of-money-n-year-ago-from-given-inflation-value
old_price = new_price * (old_CPI/new_CPI)
new_price = old_price / (old_CPI/new_CPI)

Now we see that on average, a product that cost 3 cents in 1913 would cost one dollar in 2021.

We also see that the amount of buying power (higher is better) 100 dollars had in 1913 has diminished to a measly 3. This is to say that if you kept 100 dollars in your pocket for 100 years, the amount (value) of actual goods it could buy reduces by about 97 percent. amazing!

Using this new data, we can perform calculations to determine the buying power of money when left in accounts over time.

We can start by looking at a fixed interest rate of 0.50 - approximately the average interest rate of a 3 year CD today. We use the monthly compounding interest equation that is frequently taught in algebra classes to accomplish this. Relative_value_interest is returning the relative value calculations from the previous cell on money that is appreciating in value.

APY to monthly rate: https://www.thebalance.com/calculate-monthly-interest-315421

We can see that when factoring in inflation, the interest rate is negligible, even for today's 3-year CD rate.

While this gives us a unique perspective of what is happening over a long period, it is quite unrealistic. We can't assume that the interest rate stays the same every month. Luckily, We have the data to see what this looks like with actual interest rates instead. Unluckily, each calculation relies on the previous one, so we use iteration through each individual row of the table: this is relatively slow.

This function is able to do it - it was a little tricky to write.

Now we use the function for the different interest rates.

This does not look as impressive when adjusted for relative value as a result of monthly inflation.

Remember what "Value" means in this context. It is relative to the starting value of 100 dollars. We are mostly interested in the slope of these graphs, not necessarily the bias. This is how we will generalize the findings to a model allowing us to predict future values.

Finally, we have to add gold into the mix for comparison. The calculation is simpler because we already have the true value.

Gold looks like a good way to preserve value over time, about on par (on minimum) with a one-year CD. It is best to take this with a grain of salt since gold has been relatively unstable in recent times (past 10 years).

Machine Learning and Hypothesis Testing

The next step is to create models from the data that allow us to reliably calculate how much money you would lose of gain in the future. I'm going to use linear regression to find the slopes of the models.

Linear regression is perhaps the most straightforward tactic for machine learning, in which the computer generates a line of best fit, minimizing the total distance (residual) for the true data. It gives us an estimated slope and bias (y-intercept). This may not be helpful if the data does not have a linear relationship (as this is assumed). We would expect the data to be more exponential than linear, but the difference between exponential and linear in this short of a time period is negligible. If we wanted to better fit something that was exponential, we could use polyfit or interaction terms, but this won't be necessary.

We will use the linear model from SKLearn - SKLearn docs: https://scikit-learn.org/stable/user_guide.html

We are not necessarily concerned with the biases in this specific use case, as they are not really important to make future predictions. We keep them anyway so we can check the fits later on.

We have this map of slopes and biases.

Now to graph the models against the actual data to see how well they fit

All but the last regression seem to fit well.

We saw already that gold did not look very linear in the 10-year span, so it is unsurprising that the linear regression is confused.

Let's try again using gold data from a larger range of values - this should help reduce the effect of the large dip in the center of the 10-year span.

This model looks more slightly more reasonable.

We can test this using hypothesis testing.

Hypothesis testing is determining if values are close enough statistically to other values for them to be considered "close enough" to being the same. The value that determines the "closeness" is called the P-value. In most applications, the P-value minimum would be 0.05, so we will use that.

We have two hypotheses

Null Hypothesis - what is true if the test fails, we reject if we can prove the alternative: The 70-year model is NOT better at predicting gold value.

Alternative Hypothesis - what we are trying to prove through the test: The 70-year model is better at predicting gold value.

P Value: 0.05 - i.e. 5%

We also need values to test. We will use a method that looks similar to k-fold cross-validation, in which we split the data into a training and testing set, fit the values, then perform a t-test to get the P values. In this case k=8, but k can be any reasonable number that denotes the number of groups the data is split into. We also need to divide by variance because the variances of the smaller and larger sets are quite different.

This is somewhat different from a real k-fold cross validation because it is really used on the same dataset for comparing multiple modeling algorithms. In this circumstance, I am using it to test different subsets of the dataset with the same modeling algorithm. Focus more on the concept of hypothesis testing rather than the generation of the normalized residuals because my process is unorthodox.

Now we have the values, so we can use a T-test to test the hypothesis. SciPy adds a function that performs T-tests in one line

The P-value is significantly smaller than 0.05, indicating that the two are significantly different. The associated statistic suggests that the old model has a higher adjusted residual than the new model (because it is positive). A higher residual indicates a worse fit.

We can reject our null hypothesis and accept our alternate hypothesis. For this example.

Conclusion - Insight

The ultimate goal of this project was to get the slopes of the linear regressions, which help us answer the questions we asked in the beginning. Perhaps the most important part of data science is communicating the insights we learned from the data.

Let's remind ourselves of the central questions of the project:

"Is money that sits in savings accounts constantly losing buying power? Furthermore, what is the minimum risk required to prevent yourself from losing money over time? Certificate of deposit? Gold?"

Let's look at the slopes we found, including the new slope for gold. This is where we drop bias because we want to generalize to all principle investments.

These numbers show the estimated percentage of the principal value you will earn back over 1 month and 1 year. NOTE: This will not reflect the USD AMOUNT of money in the account! This is a measure of buying power!

For example, in a 12 month CD, you will be able to buy on average 1.68% more potatoes than you could when you started. In that year span, the price of potatoes has changed, but so has the money in your account. This is a way to keep value currency agnostic.

We can finally answer our questions. (With the stipulation that we are basing our projections on the last 10 years of data and using the perfectly average interest rates; data science always comes with assumptions)

"Is money that sits in savings accounts constantly losing buying power?" -

Yes! Money is losing buying power over time regardless of interest (assuming average interest rate). This is true of average checking, savings, and money market accounts. It is possible to find better-than-average rates for these accounts, so it is possible to outperform the model if we are lucky.

"What is the minimum risk required to prevent your account from losing money over time? Certificate of deposit? Gold?" -

The minimum risk would be giving up our money to a bank for 6 months. The minimum time appears to be an average 6 month CD, which makes next to nothing, but is able to preserve value. Any longer CD has the potential to earn increasingly higher values in the same time period, but lock in the money for longer.

Gold is a much more risky investment because it strays far up and down the regression line over time. There is a chance it outperforms 5 year CD within a year, but there is also a chance that it underperforms a savings account. It is entirely dependent on the time we buy and sell. Over a grand 70-year timescale, it is approximately equivalent to a 1 year CD. Its benefit is that we can trade it at any time and might get lucky, buying it at a good time.

In conclusion, we must decide beforehand how quickly we need to use our money. If the money will be unused for decades, it is best to get a 60 month CD, but if we need absolute flexibility, gold may be a better (albeit more risky) option.

I hope that this walkthrough/project has taught you some ways data scientists leverage the data science pipeline to generate insights. I would appreciate your feedback and comments. I can be contancted at:

Jack T. Wilcom
Email: jtwilcom@wilcomweb.net
Twitter: @FlyingPeng11


Data sources:

https://data.bls.gov/cgi-bin/surveymost?cu

https://www.fdic.gov/regulations/resources/rates/historical/archive.xlsx

https://pkgstore.datahub.io/core/gold-prices/monthly_json/data/3b036045b9090ddacfaacb2deb5f19de/monthly_json.json