Hello! My Name is Jack Wilcom, a sophomore computer science student at the University of Maryland. I would like to welcome you to my final project for CMSC320 - Introduction to Data Science.
This project has two purposes:
1. To introduce the data science pipeline and demonstrate how you would use data to answer deep questions.
2. To answer a question I have been asking myself for a long time.
A little background is necessary to understand what question I am asking and why. A few years ago, I had my main savings account transferred entirely under my name, and now I have to make financial decisions with little help from my parents. Of course, the stress of this decision is significant and is no doubt exacerbated by my lack of financial knowledge. My friends and I discuss financial strategies. One of my friends is obsessed with the gold standard, frequently advertising gold as an investment. This aligns with something that I frequently say regarding the inflation of the US dollar; If your money is sitting in a savings account, you might as well be gradually losing money. Now, I have a savings account in which a lot of money sits dormant for a long period of time. While I was content with deriving that saying from thin air previously, I would like to answer the question legitimately and recommend a good course of action for myself and others.
So the question is... Is money that sits in savings accounts constantly losing buying power? Furthermore, what is the minimum risk required to prevent yourself from losing money over time? Certified deposits? Gold?
First and foremost, I absolve myself of all responsibility from the actions taken by others based on the insights from this project. If you decide to make financial decisions based on this information, that is your choice. Remember that the economy is enormously complicated and influenced by policies and disasters that are more often than not arbitrary. All insights and predictions generated from this project unrealistically assume that no policy change, economic crisis, or act of God is influencing data from recent history onward. You will see examples of the data fluctuating due to this missing information later on.
This is a good opportunity to discuss the types of missing data. These unknown data (such as the interest data before 2009) would be known as "Missing at Random" because they certainly exist and may be recorded, but it is infeasible to account for all of them. Note that they are not "Missing Completely at Random" which would indicate that the data itself is causing values to go missing.
Anyway, on with the show!
The most important dataset in this analysis must quantify inflation somehow. This immediately presents a problem because it is not immediately intuitive. I found an API maintained by the US Bureau of Labor Statistics (BLS) that supplies a value known as the consumer price index (CPI) each month. CPI is based on the average price history of many consumer products in a given year and is commonly used as a metric for the "buying power" of the US dollar domestically. This is perfect for our analysis.
The first step is to import the libraries necessary for connecting and using a JSON API. Requests is useful for communicating with websites and JSON is used to decipher what the API sends back.
import requests
import json
Next, we use the provided API URL to request the data in JSON form.
I do this independently of the other steps because you can get a maximum time range of 20 years per request, and I want all available years. Additionally, the API is most functional when using an API key (please don't use my API key). There is a maximum number of requests per day (as is common with most free APIs), so you have to be careful with the number of requests you make.
Pandas is the cornerstone of the data science pipeline we use. It stores data of varying types in tables and provides ways to quickly perform calculations on rows and columns.
Different from pandas, numpy efficiently handles SAME-TYPE array data in very efficient ways. It also provides useful values like np.nan to indicate non-numbers.
import pandas as pd
import numpy as np
# Getting the CPI for all of the years I can
# U.S. city average, All items - CUUR0000SA0
# https://data.bls.gov/cgi-bin/surveymost?cu
# Querying for inflation data
# DO NOT HAMMER THE API, THERE IS A DAILY REQUEST LIMIT!!!
sid = 'CUUR0000SA0'
key = "34547219aac541318824d07dd92cb278"
responses = []
headers = {'Content-type': 'application/json'}
year = 1913
while (year < 2022):
data = json.dumps({"seriesid": [sid],"startyear":str(year), "endyear":str(year + 19), "registrationkey":key})
p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
responses.append(p)
year+=20
responses
[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
This gives us an array of responses, which contain strings that can be read as JSON.
Next, we run through these requests, convert them to JSON, and excavate the meaningful data out of the JSON. It is also important to provide error handling for this because if data is not available, we want to know.
# Gathering inflation data into a table
idx = 0
inflation_data = pd.DataFrame(columns=["year", "month", "CPI"])
for response in reversed(responses):
doc = json.loads(response.text)
if (doc['status'] == "REQUEST_SUCCEEDED"):
data = doc['Results']['series'][0]['data']
for entry in data:
inflation_data = inflation_data.append(pd.Series(data=[entry['year'], entry['periodName'], entry['value']], index=["year", "month", "CPI"], name=idx))
idx = idx + 1
else:
print("*** REQUEST FAILED ***")
print(json)
inflation_data
year | month | CPI | |
---|---|---|---|
0 | 2021 | March | 264.877 |
1 | 2021 | February | 263.014 |
2 | 2021 | January | 261.582 |
3 | 2020 | December | 260.474 |
4 | 2020 | November | 260.229 |
... | ... | ... | ... |
1294 | 1913 | May | 9.7 |
1295 | 1913 | April | 9.8 |
1296 | 1913 | March | 9.8 |
1297 | 1913 | February | 9.8 |
1298 | 1913 | January | 9.8 |
1299 rows × 3 columns
We built a pandas table containing the information for further processing. While this table is far more manageable than JSON, it is still not yet in a useful form from a data science perspective. For this, we need to do the next step.
Usually, data received from outside sources has something wrong with it which makes it difficult to manage. For this source, the dates are two columns of strings.
It is important to make the dates machine-readable for further steps. We will use an intermediate mapping step to convert the month string into an integer, then use that integer to create a DateTime object for each row. We can then throw away the original year/month data, as it is no longer useful to us.
Luckily, there is no missing data in this case, but there often is. If there were missing dates, we would just want to drop the whole row in this situation.
inflation_data_clean = inflation_data.copy()
month_dict = {"January":"1", "February":"2", "March":"3", "April":"4", "May":"5", "June":"6", "July":"7", "August":"8", "September":"9", "October":"10", "November":"11", "December":"12"}
inflation_data_clean["month_int"] = inflation_data_clean["month"].map(lambda x: month_dict[x])
def make_date(year, month):
try:
return pd.to_datetime(str(month) + "/1/" + str(year))
except:
return np.nan
inflation_data_clean["date"] = inflation_data_clean.apply(lambda x: make_date(x["year"], x["month_int"]), axis=1)
# get rid of extra columns
inflation_data_clean = inflation_data_clean.dropna().drop(["year", "month", "month_int"], axis=1)
inflation_data_clean = inflation_data_clean[["date", "CPI"]]
inflation_data_clean
date | CPI | |
---|---|---|
0 | 2021-03-01 | 264.877 |
1 | 2021-02-01 | 263.014 |
2 | 2021-01-01 | 261.582 |
3 | 2020-12-01 | 260.474 |
4 | 2020-11-01 | 260.229 |
... | ... | ... |
1294 | 1913-05-01 | 9.7 |
1295 | 1913-04-01 | 9.8 |
1296 | 1913-03-01 | 9.8 |
1297 | 1913-02-01 | 9.8 |
1298 | 1913-01-01 | 9.8 |
1299 rows × 2 columns
We will come back to this data soon enough, but there is more data to collect.
The next thing we have to do is get data about interest rates. I searched up and down for an easier way to get this data, but I was completely unable to find a webpage or API version of all interest rate data. Instead, I found a spreadsheet in xlsx format on the FDIC website. I converted this into xls format so Pandas could read it.
First, we define a function for converting string date ranges into DateTimes. This does not preserve the day or range, just the start date, making the date simpler. This results in a loss of accuracy, but it will make merging the tables possible later on
# https://www.fdic.gov/regulations/resources/rates/historical/archive.xlsx
# must manually download and convert to xls format to be read
# Simplify date range into month and year (this is pretty rough, results in a loss of accuracy)
def simplify_date(date_range):
try:
dvals = date_range.split("-")[0].split("/")
return pd.to_datetime(dvals[0] + "/1/" + dvals[2])
except:
return np.nan
Now, we can read in the spreadsheet file and get it into a data frame. We have to iterate through each of the sheets, read them each as a data frame, drop the extra columns, properly index and name the columns and rows, and append the result to the table. We do a little bit of cleanup as well, removing missing dates with dropna.
# new interest data frame
final_interest_frame = pd.DataFrame()
for i in reversed(range(2009,2022)):
interest_table = pd.read_excel("archive.xls", sheet_name= str(i) + " Archive", header=1).transpose()
# get rid of all other values, we don't care about them
range_of_drop_columns = [2] + [x for x in range(14, len(interest_table.columns))]
interest_table = interest_table.drop(range_of_drop_columns, axis=1)
interest_table.columns = ["date range", "type", "savings", "checking", "money market",
"1 month CD","3 month CD","6 month CD","12 month CD",
"24 month CD", "36 month CD","48 month CD", "60 month CD"]
interest_table = interest_table.reset_index().drop("index", axis=1)
interest_table = interest_table.drop(0, axis=0)
# Get rid of na date value rows
interest_table = interest_table.dropna(axis=0, subset=["date range"])
if (final_interest_frame.empty):
final_interest_frame = interest_table
else:
final_interest_frame = final_interest_frame.append(interest_table)
final_interest_frame["simple_date"] = final_interest_frame["date range"].map(simplify_date)
final_interest_frame
date range | type | savings | checking | money market | 1 month CD | 3 month CD | 6 month CD | 12 month CD | 24 month CD | 36 month CD | 48 month CD | 60 month CD | simple_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81 | 3/29/21-3/31/21 | National Rate | 0.04 | 0.03 | 0.06 | 0.04 | 0.06 | 0.09 | 0.14 | 0.18 | 0.22 | 0.25 | 0.3 | 2021-03-01 |
82 | 3/29/21-3/31/21 | National Rate Cap | 0.79 | 0.78 | 0.81 | 0.79 | 0.81 | 0.84 | 0.89 | 0.93 | 0.97 | 1 | 1.05 | 2021-03-01 |
83 | 3/22/21-3/28/21 | National Rate | 0.04 | 0.03 | 0.06 | 0.04 | 0.06 | 0.09 | 0.14 | 0.19 | 0.22 | 0.25 | 0.3 | 2021-03-01 |
84 | 3/22/21-3/28/21 | National Rate Cap | 0.79 | 0.78 | 0.81 | 0.79 | 0.81 | 0.84 | 0.89 | 0.94 | 0.97 | 1 | 1.05 | 2021-03-01 |
85 | 3/15/21-3/21/21 | National Rate | 0.04 | 0.03 | 0.06 | 0.04 | 0.06 | 0.09 | 0.14 | 0.19 | 0.23 | 0.25 | 0.3 | 2021-03-01 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
62 | 6/01/09 - 6/07/09 | National Rate Cap | 0.97 | 0.89 | 1.21 | NaN | 1.42 | 1.7 | 2 | 2.29 | 2.53 | 2.78 | 2.94 | 2009-06-01 |
63 | 5/25/09 - 5/31/09 | National Rate | 0.22 | 0.14 | 0.46 | NaN | 0.71 | 0.98 | 1.27 | 1.56 | 1.8 | 2.04 | 2.21 | 2009-05-01 |
64 | 5/25/09 - 5/31/09 | National Rate Cap | 0.97 | 0.89 | 1.21 | NaN | 1.46 | 1.73 | 2.02 | 2.31 | 2.55 | 2.79 | 2.96 | 2009-05-01 |
65 | 5/18/09 - 5/24/09 | National Rate | 0.22 | 0.14 | 0.48 | NaN | 0.71 | 0.99 | 1.29 | 1.58 | 1.82 | 2.07 | 2.23 | 2009-05-01 |
66 | 5/18/09 - 5/24/09 | National Rate Cap | 0.97 | 0.89 | 1.23 | NaN | 1.46 | 1.74 | 2.04 | 2.33 | 2.57 | 2.82 | 2.98 | 2009-05-01 |
1238 rows × 14 columns
The table is still really messy after this step, so further cleanup is required.
We don't want different rows for rate and rate cap. This makes them completely unmanageable. In this part, we fix that by combining them by date. To do that, first, we have to average the rates that fall into each month. This further simplifies the data for use with the other data we have. We then match each rate cap with the rate and merge the columns together. I fill the missing data values with adjacent values because we want to assume that the interest rates are not changing if we have no reason the believe they did. They are not continuous values, even though they may look like it.
# group the rows by date
group = final_interest_frame.groupby(["simple_date"])
idx = 0
labels = ["date",
"savings","savings cap",
"checking", "checking cap",
"money market", "money market cap",
"1 month CD", "1 month CD cap",
"3 month CD", "3 month CD cap",
"6 month CD", "6 month CD cap",
"12 month CD", "12 month CD cap",
"24 month CD", "24 month CD cap",
"36 month CD", "36 month CD cap",
"48 month CD", "48 month CD cap",
"60 month CD", "60 month CD cap"]
interest_data_clean = pd.DataFrame(columns=labels)
# split by type
for key in group.groups.keys():
type_group = group.get_group(key).groupby("type")
rate_row = type_group.get_group("National Rate")
cap_row = type_group.get_group("National Rate Cap")
# averaging all of the data for each date and merging rate and cap lines.
new_row_data = [key,
rate_row["savings"].mean(), cap_row["savings"].mean(),
rate_row["checking"].mean(), cap_row["checking"].mean(),
rate_row["money market"].mean(), cap_row["money market"].mean(),
rate_row["1 month CD"].mean(), cap_row["1 month CD"].mean(),
rate_row["3 month CD"].mean(), cap_row["3 month CD"].mean(),
rate_row["6 month CD"].mean(), cap_row["6 month CD"].mean(),
rate_row["12 month CD"].mean(), cap_row["12 month CD"].mean(),
rate_row["24 month CD"].mean(), cap_row["24 month CD"].mean(),
rate_row["36 month CD"].mean(), cap_row["36 month CD"].mean(),
rate_row["48 month CD"].mean(), cap_row["48 month CD"].mean(),
rate_row["60 month CD"].mean(), cap_row["60 month CD"].mean()]
interest_data_clean = interest_data_clean.append(pd.Series(data=new_row_data, index=labels, name=idx))
idx += 1
# remove a few typo values and fill missing values with nearest neighbor
interest_data_clean = interest_data_clean.drop([0,144]).fillna(method="pad").fillna(method="bfill")
# flip the rows the other way around so they look more like the other table
interest_data_clean = interest_data_clean.reindex(index=interest_data_clean.index[::-1]).reset_index().drop("index", axis=1)
# ready-to-use interest data
interest_data_clean
date | savings | savings cap | checking | checking cap | money market | money market cap | 1 month CD | 1 month CD cap | 3 month CD | ... | 12 month CD | 12 month CD cap | 24 month CD | 24 month CD cap | 36 month CD | 36 month CD cap | 48 month CD | 48 month CD cap | 60 month CD | 60 month CD cap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-03-01 | 0.0400 | 0.7900 | 0.0320 | 0.7820 | 0.0600 | 0.8100 | 0.040000 | 0.790000 | 0.0600 | ... | 0.1400 | 0.8900 | 0.1880 | 0.9380 | 0.2260 | 0.9760 | 0.2520 | 1.0020 | 0.3040 | 1.0540 |
1 | 2021-02-01 | 0.0475 | 0.7975 | 0.0375 | 0.7875 | 0.0600 | 0.8100 | 0.040000 | 0.790000 | 0.0700 | ... | 0.1500 | 0.9000 | 0.1975 | 0.9475 | 0.2350 | 0.9850 | 0.2600 | 1.0100 | 0.3150 | 1.0650 |
2 | 2021-01-01 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.0700 | 0.8200 | 0.040000 | 0.790000 | 0.0700 | ... | 0.1550 | 0.9050 | 0.2050 | 0.9550 | 0.2450 | 0.9950 | 0.2700 | 1.0200 | 0.3250 | 1.0750 |
3 | 2020-12-01 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.0700 | 0.8200 | 0.047500 | 0.797500 | 0.0700 | ... | 0.1600 | 0.9100 | 0.2100 | 0.9600 | 0.2550 | 1.0050 | 0.2775 | 1.0275 | 0.3350 | 1.0850 |
4 | 2020-11-01 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.0700 | 0.8200 | 0.050000 | 0.800000 | 0.0720 | ... | 0.1720 | 0.9220 | 0.2220 | 0.9720 | 0.2700 | 1.0200 | 0.2940 | 1.0440 | 0.3500 | 1.1000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
138 | 2009-09-01 | 0.2125 | 0.9625 | 0.1300 | 0.8800 | 0.4075 | 1.1575 | 0.235000 | 0.985000 | 0.5000 | ... | 1.0475 | 1.7975 | 1.5150 | 2.2650 | 1.7700 | 2.5200 | 1.9950 | 2.7450 | 2.2525 | 3.0025 |
139 | 2009-08-01 | 0.2100 | 0.9600 | 0.1320 | 0.8820 | 0.4060 | 1.1560 | 0.220000 | 0.970000 | 0.5340 | ... | 1.1040 | 1.8540 | 1.5340 | 2.2840 | 1.7660 | 2.5160 | 2.0100 | 2.7600 | 2.2060 | 2.9560 |
140 | 2009-07-01 | 0.2200 | 0.9700 | 0.1425 | 0.8925 | 0.4200 | 1.1700 | 0.225000 | 0.975000 | 0.5775 | ... | 1.1550 | 1.9050 | 1.5550 | 2.3050 | 1.7675 | 2.5175 | 2.0225 | 2.7725 | 2.1825 | 2.9325 |
141 | 2009-06-01 | 0.2120 | 0.9620 | 0.1400 | 0.8900 | 0.4400 | 1.1900 | 0.236667 | 0.986667 | 0.6340 | ... | 1.2180 | 1.9680 | 1.5380 | 2.2880 | 1.7680 | 2.5180 | 2.0240 | 2.7740 | 2.1940 | 2.9440 |
142 | 2009-05-01 | 0.2200 | 0.9700 | 0.1400 | 0.8900 | 0.4700 | 1.2200 | 0.236667 | 0.986667 | 0.7100 | ... | 1.2800 | 2.0300 | 1.5700 | 2.3200 | 1.8100 | 2.5600 | 2.0550 | 2.8050 | 2.2200 | 2.9700 |
143 rows × 23 columns
I promise this is the last dataset we need. This is the price history of gold directly from a webpage of JSON data. The process of converting it to a usable data frame is similar to the CPI data.
gold_json = requests.get('https://pkgstore.datahub.io/core/gold-prices/monthly_json/data/3b036045b9090ddacfaacb2deb5f19de/monthly_json.json')
gold_df = pd.DataFrame(columns=["date_string", "price"]);
entries = json.loads(gold_json.text)
for entry in entries:
gold_df = gold_df.append(pd.Series(data=[entry['Date'], entry['Price']], index=["date_string", "price"], name=0))
# for turning the string dates into date objects
def to_date(date_string):
try:
date_split = date_string.split("-")
return pd.to_datetime(date_split[1] + "/1/" + date_split[0])
except:
return np.nan
gold_df["date"] = gold_df["date_string"].map(to_date)
gold_df["gold_price"] = gold_df["price"]
gold_df = gold_df.drop(["date_string", "price"], axis=1)[["date", "gold_price"]]
gold_df
date | gold_price | |
---|---|---|
0 | 1950-01-01 | 34.730 |
0 | 1950-02-01 | 34.730 |
0 | 1950-03-01 | 34.730 |
0 | 1950-04-01 | 34.730 |
0 | 1950-05-01 | 34.730 |
... | ... | ... |
0 | 2020-03-01 | 1593.764 |
0 | 2020-04-01 | 1680.030 |
0 | 2020-05-01 | 1715.697 |
0 | 2020-06-01 | 1734.032 |
0 | 2020-07-01 | 1840.807 |
847 rows × 2 columns
Finally, we can merge these tables together by date to make one cohesive table.
df_clean = inflation_data_clean.merge(interest_data_clean, on="date", how='left')
df_clean = df_clean.merge(gold_df, on="date", how='left')
df_clean
date | CPI | savings | savings cap | checking | checking cap | money market | money market cap | 1 month CD | 1 month CD cap | ... | 12 month CD cap | 24 month CD | 24 month CD cap | 36 month CD | 36 month CD cap | 48 month CD | 48 month CD cap | 60 month CD | 60 month CD cap | gold_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-03-01 | 264.877 | 0.0400 | 0.7900 | 0.0320 | 0.7820 | 0.06 | 0.81 | 0.0400 | 0.7900 | ... | 0.890 | 0.1880 | 0.9380 | 0.226 | 0.976 | 0.2520 | 1.0020 | 0.304 | 1.054 | NaN |
1 | 2021-02-01 | 263.014 | 0.0475 | 0.7975 | 0.0375 | 0.7875 | 0.06 | 0.81 | 0.0400 | 0.7900 | ... | 0.900 | 0.1975 | 0.9475 | 0.235 | 0.985 | 0.2600 | 1.0100 | 0.315 | 1.065 | NaN |
2 | 2021-01-01 | 261.582 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.07 | 0.82 | 0.0400 | 0.7900 | ... | 0.905 | 0.2050 | 0.9550 | 0.245 | 0.995 | 0.2700 | 1.0200 | 0.325 | 1.075 | NaN |
3 | 2020-12-01 | 260.474 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.07 | 0.82 | 0.0475 | 0.7975 | ... | 0.910 | 0.2100 | 0.9600 | 0.255 | 1.005 | 0.2775 | 1.0275 | 0.335 | 1.085 | NaN |
4 | 2020-11-01 | 260.229 | 0.0500 | 0.8000 | 0.0400 | 0.7900 | 0.07 | 0.82 | 0.0500 | 0.8000 | ... | 0.922 | 0.2220 | 0.9720 | 0.270 | 1.020 | 0.2940 | 1.0440 | 0.350 | 1.100 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1294 | 1913-05-01 | 9.7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1295 | 1913-04-01 | 9.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1296 | 1913-03-01 | 9.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1297 | 1913-02-01 | 9.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1298 | 1913-01-01 | 9.8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1299 rows × 25 columns
This table is finally clean enough for the next step in the data science pipeline.
I'll start by graphing some of the values over time; particularly the average interest rates of the three most common bank accounts.
df_clean.plot(x="date", y=["savings", "checking", "money market"], ylabel="Annual Percentage Yield (APY)", title="APY over time")
<AxesSubplot:title={'center':'APY over time'}, xlabel='date', ylabel='Annual Percentage Yield (APY)'>
Immediately we see a problem; this plot looks incomplete. This is where we must recognize the problem with the interest rates dataset, in that it only goes back to 2009.
We need to restrict our dates to 2009-present if we want to see patterns easily. The first interest rate data is from April 2009, so we'll start there.
df_2009_clean = df_clean.loc[df_clean["date"] >= pd.to_datetime("5/1/2009")]
df_2009_clean.plot(x="date", y=["savings", "checking", "money market"], ylabel="Annual Percentage Yield (APY)", title="APY over time")
df_2009_clean.plot(x="date", y=["1 month CD", "3 month CD", "6 month CD"], ylabel="Annual Percentage Yield (APY)", title="APY over time")
df_2009_clean.plot(x="date", y=["36 month CD", "48 month CD", "60 month CD"], ylabel="Annual Percentage Yield (APY)", title="APY over time")
<AxesSubplot:title={'center':'APY over time'}, xlabel='date', ylabel='Annual Percentage Yield (APY)'>
I was initially puzzled by how the data doesn't go back further, and I asked around for a reasonable explanation for this. I have a strong suspicion that it is related to the 2008 housing and loan crisis, which is why they keep records of this data now. I also was told that interest rates "never recovered", which is what we see in the data. This is a good example of data that is "missing at random".
It is also obvious that all of the rates rise and fall together. We will show this more clearly in a correlation matrix later on.
What is more interesting is the relationship between inflation rates and interest rates. It seems intuitive that as money becomes less valuable, interest rates would rise to compensate for lost value. This is something we want to find out.
# calculating CPI delta
df_clean["CPI"] = df_clean["CPI"].map(lambda x: float(x))
df_clean["CPI_change"] = df_clean["CPI"].diff(periods=-1)
df_2009_clean = df_clean.loc[df_clean["date"] >= pd.to_datetime("5/1/2009")]
df_2009_clean.plot(x="CPI", y=["money market"], ylabel="Money Market Rate", title="Money Market Rate vs Inflation", style=["x"])
df_2009_clean.plot(x="CPI_change", y=["money market"], style=["x"], xlabel="Change in CPI", ylabel="Monthly APY",
title="Monthly Interest Rate vs Change in CPI")
<AxesSubplot:title={'center':'Monthly Interest Rate vs Change in CPI'}, xlabel='Change in CPI', ylabel='Monthly APY'>
Interest rates don't seem to correlate with inflation rates month-by-month. Incredibly, interest rates appear to have decreased overall in response to low buying power (high CPI).
This is to say that money devalues somewhat independently from the means to passively gain value back. This is a major problem because it suggests that money is not necessarily safe in bank accounts. This is a hint for one of our questions.
What about the value of a commodity like gold? does it follow CPI?
df_clean.plot(x="date", y=["gold_price"], ylabel="Gold price", title="Gold price over time")
df_clean["CPI"] = df_clean["CPI"].map(lambda x: float(x))
df_clean.plot(x="CPI", y=["gold_price"], ylabel="Gold price", title="Gold price vs Inflation", style=["x"])
# calculating Gold delta
df_clean["gold_price"] = df_clean["gold_price"].map(lambda x: float(x))
df_clean["gold_change"] = df_clean["gold_price"].diff(periods=-1)
df_clean.plot(x="CPI_change", y=["gold_price"], style=["x"], xlabel="Change in CPI", ylabel="Change in Gold Price",
title="Monthly Change in Gold Price vs Change in CPI")
<AxesSubplot:title={'center':'Monthly Change in Gold Price vs Change in CPI'}, xlabel='Change in CPI', ylabel='Change in Gold Price'>
While there appears to be some positive relationship between gold and CPI, gold does not appear to follow CPI on a small (month-by-month) scale. It is interesting to see how the dollar value of gold is constant until 1970. This is because the gold standard for the US dollar we abandoned in 1969, and the values of gold and the dollar diverged.
It is important to remember that even though the plots look similar, the values of gold and APY are very different and should not be compared directly. APY is a percentage RATE of value appreciation, whereas we are looking at PRICES of gold. We'll come back to the PRICE value of a bank account compared to gold in the next part.
We should graph the CPI to see if there is anything interesting going on.
df_clean.plot(x="date", y="CPI", title="CPI over time (Higher is probably worse)")
<AxesSubplot:title={'center':'CPI over time (Higher is probably worse)'}, xlabel='date'>
We can see that somewhere around 1970, the shape of the CPI over time changes from one linear slope to another. Perhaps this hints at some kind of policy change? Regardless, it signals to us that we may be able to predict what CPI will be in the future using linear regression.
Before I move on, I wanted to demonstrate an effective way of generating a chart of correlation values between the variables. This may confirm or deny what we have posited from the graphs; CPI change is not correlated with interest rates and the interest rates for different account types are strongly correlated.
You can make a correlation matrix that looks like this to find these relationships.
Seaborn has a great variety of advanced visualization functions - https://seaborn.pydata.org/tutorial.html
Matplotlib is used by other libraries for generating visualizations, we use it here to adjust the size of the matrix (otherwise it would be tiny).
import seaborn as sns
import matplotlib.pyplot as plt
#Generating a correlation matrix
plt.subplots(figsize = (18,16))
g = sns.heatmap(df_clean.corr("spearman"), annot = True)
g.set_title("Correlation Matrix")
g.set_xticklabels(g.get_xticklabels(), rotation=30, horizontalalignment='right')
g.set_yticklabels(g.get_yticklabels(), rotation=30, horizontalalignment='right')
g
<AxesSubplot:title={'center':'Correlation Matrix'}>
This matrix confirms what we have seen already. Focus particularly on the "CPI_change" row, which has correlation values very close to 0. I suspect that the negative correlation between CPI and interest rates comes from the time period. The economy gets worse, interest rates dive, and inflation skyrockets - this is the unfortunate reality of our modern economy.
We have more than enough insight to move on to the next steps. It may seem like we didn't find anything helpful, but this would be a misconception. We were able to demonstrate a correlation between interest rates between account types and disprove any immediate month-by-month connection between values of investments and inflation.
This is where the rubber starts to meet the road. By transforming the data and performing analysis steps, we will be able to predict how much commodities will cost in the future, the minimum interest rate needed to maintain buying power over time, minimize the time we must keep our money in the bank to make sure it stays safe, and determine if gold is a good way to avoid the repercussions of inflation.
We start by making some new functions for transforming data. The buying power function takes a starting CPI and cost, then calculates (using an equation from Stack Exchange) the cost of the average product in the current year. While at first the relative value functions appears to do the same thing, it is calculating a fraction instead of a cost. Its anchor value is the "buying power" you have in a given year, and it is able to determine the buying power in different years.
If you are having trouble understanding this, don't worry. I will interpret the results in a moment.
Use modified equation from https://money.stackexchange.com/questions/88234/how-to-calculate-worth-of-money-n-year-ago-from-given-inflation-value
old_price = new_price * (old_CPI/new_CPI)
new_price = old_price / (old_CPI/new_CPI)
# function for calculating relative value for a given CPI given a current value and CPI
# uses modified equation from https://money.stackexchange.com/questions/88234/how-to-calculate-worth-of-money-n-year-ago-from-given-inflation-value
# old_price = new_price * (old_CPI/new_CPI)
# new_price = old_price / (old_CPI/new_CPI)
def buying_power(anchor_CPI, anchor_value, current_CPI):
new_price = anchor_value * (current_CPI/anchor_CPI)
return new_price
def relative_value(anchor_CPI, anchor_value, current_CPI):
new_price = anchor_value / (current_CPI/anchor_CPI)
return new_price
test = inflation_data_clean.copy()
test["modern_cost_multiplier"] = test["CPI"].map(lambda x: (buying_power(264.877, 1, float(x))))
test["relative_value"] = test["CPI"].map(lambda x: (relative_value(9.8, 100, float(x))))
test.plot(x="date", y="modern_cost_multiplier", title="How Much Something That Costs $1 Today Would Cost at Other Times", ylabel="US Dollars")
test.plot(x="date", y="relative_value", title="Buying power of $100 after 1913", ylabel="Relative buying power")
# modern cost multiplier - I.E. Something that costs $1 today would have cost $0.03 in 1913
# relative_value - If I had $1 in 1913 and waited until 2021, It would be as if I lost $0.96 of my dollar
<AxesSubplot:title={'center':'Buying power of $100 after 1913'}, xlabel='date', ylabel='Relative buying power'>
Now we see that on average, a product that cost 3 cents in 1913 would cost one dollar in 2021.
We also see that the amount of buying power (higher is better) 100 dollars had in 1913 has diminished to a measly 3. This is to say that if you kept 100 dollars in your pocket for 100 years, the amount (value) of actual goods it could buy reduces by about 97 percent. amazing!
Using this new data, we can perform calculations to determine the buying power of money when left in accounts over time.
We can start by looking at a fixed interest rate of 0.50 - approximately the average interest rate of a 3 year CD today. We use the monthly compounding interest equation that is frequently taught in algebra classes to accomplish this. Relative_value_interest is returning the relative value calculations from the previous cell on money that is appreciating in value.
APY to monthly rate: https://www.thebalance.com/calculate-monthly-interest-315421
import math
# APY to monthly rate
# https://www.thebalance.com/calculate-monthly-interest-315421
apy = 0.50
monthly = (apy/100)/12
def get_month_delta(start_date, end_date):
return (end_date.year - start_date.year)*12 + (end_date.month - start_date.month)
# compounds monthly
def interest_calculator(principle, start_date, end_date, interest_rate):
months = get_month_delta(start_date, end_date)
return principle * math.pow((1 + interest_rate), months)
def relative_value_interest(start_CPI, start_date, end_CPI, end_date, principle, interest_rate):
usd = interest_calculator(principle, start_date, end_date, interest_rate)
return relative_value(start_CPI, usd, end_CPI)
test["relative_value_interest"] = test.apply(lambda x: (relative_value_interest(9.8, pd.to_datetime("1/1/1913"), float(x["CPI"]), x["date"], 100, monthly)), axis=1)
test["true_usd_amount"] = test.apply(lambda x: (interest_calculator(100, pd.to_datetime("1/1/1913"), x["date"], monthly)), axis=1)
test.plot(x="date", y=["relative_value", "relative_value_interest", "true_usd_amount"], title="Insights about $100 over 100 years", ylabel="Value")
<AxesSubplot:title={'center':'Insights about $100 over 100 years'}, xlabel='date', ylabel='Value'>
We can see that when factoring in inflation, the interest rate is negligible, even for today's 3-year CD rate.
While this gives us a unique perspective of what is happening over a long period, it is quite unrealistic. We can't assume that the interest rate stays the same every month. Luckily, We have the data to see what this looks like with actual interest rates instead. Unluckily, each calculation relies on the previous one, so we use iteration through each individual row of the table: this is relatively slow.
This function is able to do it - it was a little tricky to write.
# calculate interest with rates that change every month
# takes arguments (self-explanatory) and returns
def dynamic_interest(principle, start_date, date_column, rate_column, df):
interest_df = pd.DataFrame(columns=[date_column, "interest"])
interest_df = interest_df.append(pd.Series(data=[start_date, principle], index=[date_column, "interest"], name=0))
# going through each date since start in reverse order
for date in reversed(df.loc[df[date_column] > start_date][date_column]):
# getting the current and previous row
current_row = df.loc[df[date_column] == date]
prev_date = df[date_column].shift(periods=-1)[current_row.index[0]]
prev_row = df.loc[df[date_column] == prev_date]
prev_interest_row = interest_df.loc[interest_df[date_column] == prev_date]
monthly_rate = current_row[rate_column].to_list()[0]/120 # converting APY into monthly rate
prev_interest = prev_interest_row["interest"].to_list()[0]
# actually calculating the interest and adding it
new_value = interest_calculator(prev_interest, prev_date, date, monthly_rate)
interest_df = interest_df.append(pd.Series(data=[date, new_value], index=[date_column, "interest"], name=0))
return interest_df
Now we use the function for the different interest rates.
CPI_at_date = df_clean.loc[df_clean["date"] == pd.to_datetime("5/1/2009")]["CPI"].to_list()[0]
cols = ["savings", "checking", "money market","1 month CD","3 month CD","6 month CD","12 month CD",
"24 month CD", "36 month CD","48 month CD", "60 month CD"]
df_clean_2009 = (df_clean.loc[df_clean["date"] >= pd.to_datetime("5/1/2009")]).copy()
for col in cols:
df_clean_2009 = df_clean_2009.merge(dynamic_interest(1, pd.to_datetime("5/1/2009"), "date", col, df_clean_2009), on="date", how="left")
df_clean_2009[col + " interest"] = df_clean_2009["interest"]
df_clean_2009 = df_clean_2009.drop(["interest"], axis=1)
df_clean_2009.plot(x="date", y=[x + " interest" for x in cols], title="USD Average Interest on $1 over Approximately 10 years", ylabel="USD")
<AxesSubplot:title={'center':'USD Average Interest on $1 over Approximately 10 years'}, xlabel='date', ylabel='USD'>
This does not look as impressive when adjusted for relative value as a result of monthly inflation.
for col in cols:
df_clean_2009[col + " interest (adjusted)"] = df_clean_2009.apply(lambda x: relative_value(float(CPI_at_date), float(x[col + " interest"]), float(x["CPI"])), axis=1)
df_clean_2009.plot(x="date", y=[x + " interest (adjusted)" for x in cols][0:3], title="True Average Value on $1 over Approximately 10 years (Adjusted)", ylabel="Value")
df_clean_2009.plot(x="date", y=[x + " interest (adjusted)" for x in cols][3:6], title="True Average Value on $1 over Approximately 10 years (Adjusted)", ylabel="Value")
df_clean_2009.plot(x="date", y=[x + " interest (adjusted)" for x in cols][6:], title="True Average Value on $1 over Approximately 10 years (Adjusted)", ylabel="Value")
<AxesSubplot:title={'center':'True Average Value on $1 over Approximately 10 years (Adjusted)'}, xlabel='date', ylabel='Value'>
Remember what "Value" means in this context. It is relative to the starting value of 100 dollars. We are mostly interested in the slope of these graphs, not necessarily the bias. This is how we will generalize the findings to a model allowing us to predict future values.
Finally, we have to add gold into the mix for comparison. The calculation is simpler because we already have the true value.
def gold_investment_USD(start_amount, start_value, current_value):
return (float(current_value)/float(start_value)) * float(start_amount)
# USD investment calculation from 1950
start_value = (df_clean.loc[df_clean["date"] == pd.to_datetime("1/1/1950")])["gold_price"].to_list()[0]
df_clean["gold interest"] = df_clean["gold_price"].map(lambda x: gold_investment_USD(1, start_value, x))
df_clean.plot(x="date", y="gold interest", title="Accumulated USD on $1 over Approximately 70 Years Invested in Gold", ylabel="USD")
# USD investment calculation from 2009
start_value = (df_clean_2009.loc[df_clean["date"] == pd.to_datetime("5/1/2009")])["gold_price"].to_list()[0]
df_clean_2009["gold interest"] = df_clean_2009["gold_price"].map(lambda x: gold_investment_USD(1, start_value, x))
df_clean_2009.plot(x="date", y="gold interest", title="Accumulated USD on $1 over Approximately 10 Years Invested in Gold", ylabel="USD")
df_clean_2009["gold interest (adjusted)"] = df_clean_2009.apply(lambda x: relative_value(float(CPI_at_date), float(x["gold interest"]), float(x["CPI"])), axis=1)
df_clean_2009.plot(x="date", y=["gold interest (adjusted)", "24 month CD interest (adjusted)", "12 month CD interest (adjusted)", "36 month CD interest (adjusted)"], title="True Average Value on $1 over Approximately 10 Years Invested (Adjusted)", ylabel="Value")
<AxesSubplot:title={'center':'True Average Value on $1 over Approximately 10 Years Invested (Adjusted)'}, xlabel='date', ylabel='Value'>
Gold looks like a good way to preserve value over time, about on par (on minimum) with a one-year CD. It is best to take this with a grain of salt since gold has been relatively unstable in recent times (past 10 years).
The next step is to create models from the data that allow us to reliably calculate how much money you would lose of gain in the future. I'm going to use linear regression to find the slopes of the models.
Linear regression is perhaps the most straightforward tactic for machine learning, in which the computer generates a line of best fit, minimizing the total distance (residual) for the true data. It gives us an estimated slope and bias (y-intercept). This may not be helpful if the data does not have a linear relationship (as this is assumed). We would expect the data to be more exponential than linear, but the difference between exponential and linear in this short of a time period is negligible. If we wanted to better fit something that was exponential, we could use polyfit or interaction terms, but this won't be necessary.
We will use the linear model from SKLearn - SKLearn docs: https://scikit-learn.org/stable/user_guide.html
We are not necessarily concerned with the biases in this specific use case, as they are not really important to make future predictions. We keep them anyway so we can check the fits later on.
from sklearn.linear_model import LinearRegression
df_train = df_clean_2009.copy()
df_train["month_delta"] = df_train["date"].map(lambda x: get_month_delta(pd.to_datetime("5/1/2009"), x) + 1)
x_vals = np.array(df_train["month_delta"].to_list()).reshape(-1, 1)
results = {}
# Linear regression for the accounts
for col in cols:
reg = LinearRegression()
reg = reg.fit(x_vals, np.array(df_train[col + " interest (adjusted)"].to_list()).reshape(-1, 1))
results[col] = (reg.coef_[0][0], reg.intercept_[0])
# Linear regression for gold investment
df_train_gold = (df_train.loc[df_train["gold interest (adjusted)"] > 0]).copy()
reg = LinearRegression()
reg = reg.fit(np.array(df_train_gold["month_delta"].to_list()).reshape(-1, 1), np.array(df_train_gold["gold interest (adjusted)"].to_list()).reshape(-1, 1))
results["gold"] = (reg.coef_[0][0], reg.intercept_[0])
results
{'savings': (-0.0006615069968276128, 1.0083307653031142), 'checking': (-0.0008653214119036393, 1.0002032346371315), 'money market': (-0.00037518654810559667, 1.021729359548015), '1 month CD': (-0.000668030720890915, 1.0079613393379816), '3 month CD': (-0.0003526810067646925, 1.0265295883018988), '6 month CD': (0.0002539673516912097, 1.043067912318345), '12 month CD': (0.0013998010825920313, 1.0592085023133733), '24 month CD': (0.0034300683446121237, 1.0848252913561671), '36 month CD': (0.005687232564269877, 1.0929201304147524), '48 month CD': (0.008037294923394334, 1.0900064274733425), '60 month CD': (0.012253749849245179, 1.0514582667245298), 'gold': (-0.0013110124771311136, 1.3997662501745414)}
We have this map of slopes and biases.
Now to graph the models against the actual data to see how well they fit
# run through and graph all of the models
for col in cols + ["gold"]:
df_train[col + " estimate"] = df_train["month_delta"].map(lambda x : results[col][0] * x + results[col][1])
df_train.plot(x="date", y=[col+" interest (adjusted)", col+" estimate"], style=["x"], title="Predicted Value vs Calculated value of " + col, ylabel="Value")
All but the last regression seem to fit well.
We saw already that gold did not look very linear in the 10-year span, so it is unsurprising that the linear regression is confused.
Let's try again using gold data from a larger range of values - this should help reduce the effect of the large dip in the center of the 10-year span.
# getting all rows where gold value is > 0 and after 1970
new_gold_train = ((df_clean.loc[df_clean["date"] >= pd.to_datetime("1/1/1970")]).loc[df_clean["gold_price"] > 0]).copy()
# repeating a lot of our calculations again with this new range
# start in the same place as before so everything matches up
starting_row = new_gold_train.loc[new_gold_train["date"] == pd.to_datetime("5/1/2009")]
start_CPI = starting_row["CPI"].to_list()[0]
start_gold_price = starting_row["gold_price"].to_list()[0]
new_gold_train["month_delta"] = new_gold_train["date"].map(lambda x: get_month_delta(pd.to_datetime("5/1/2009"), x) + 1)
new_gold_train["gold interest"] = new_gold_train["gold_price"].map(lambda x: gold_investment_USD(1, start_gold_price, x))
new_gold_train["gold interest (adjusted)"] = new_gold_train.apply(lambda x: relative_value(float(start_CPI), float(x["gold interest"]), float(x["CPI"])), axis=1)
reg = LinearRegression()
reg = reg.fit(np.array(new_gold_train["month_delta"].to_list()).reshape(-1, 1), np.array(new_gold_train["gold interest (adjusted)"].to_list()).reshape(-1, 1))
new_result = (reg.coef_[0][0], reg.intercept_[0])
print(new_result)
new_gold_train["gold estimate"] = new_gold_train["month_delta"].map(lambda x : new_result[0] * x + new_result[1])
new_gold_train.plot(x="date", y=["gold interest (adjusted)", "gold estimate"], style=["x"], title="New Predicted Value vs Calculated value of " + col, ylabel="Value")
(0.0010105559443416498, 0.9697902860047505)
<AxesSubplot:title={'center':'New Predicted Value vs Calculated value of gold'}, xlabel='date', ylabel='Value'>
This model looks more slightly more reasonable.
We can test this using hypothesis testing.
Hypothesis testing is determining if values are close enough statistically to other values for them to be considered "close enough" to being the same. The value that determines the "closeness" is called the P-value. In most applications, the P-value minimum would be 0.05, so we will use that.
We have two hypotheses
Null Hypothesis - what is true if the test fails, we reject if we can prove the alternative: The 70-year model is NOT better at predicting gold value.
Alternative Hypothesis - what we are trying to prove through the test: The 70-year model is better at predicting gold value.
P Value: 0.05 - i.e. 5%
We also need values to test. We will use a method that looks similar to k-fold cross-validation, in which we split the data into a training and testing set, fit the values, then perform a t-test to get the P values. In this case k=8, but k can be any reasonable number that denotes the number of groups the data is split into. We also need to divide by variance because the variances of the smaller and larger sets are quite different.
This is somewhat different from a real k-fold cross validation because it is really used on the same dataset for comparing multiple modeling algorithms. In this circumstance, I am using it to test different subsets of the dataset with the same modeling algorithm. Focus more on the concept of hypothesis testing rather than the generation of the normalized residuals because my process is unorthodox.
from sklearn.model_selection import train_test_split
# Splits the data into 8 random, independent subsets for 8-fold cross-validation
# Edited code from project 3 EC
def ksplit(X, y):
x_split = []
y_split = []
X_temp = X
y_temp = y
remaining = 100
while ((12.5/remaining) != 1):
nd = remaining * (12.5/remaining)
X_train, X_test, y_train, y_test = train_test_split(X_temp, y_temp, test_size= (10/remaining))
x_split.append(X_test)
y_split.append(y_test)
X_temp = X_train
y_temp = y_train
remaining = (remaining-nd)
x_split.append(X_train)
y_split.append(y_train)
return (x_split, y_split)
def validation(x_split, y_split):
average_residuals = []
for index, subset in enumerate(x_split):
# creating the train and test sets
temp_x = np.array(x_split.copy(), dtype=object)
temp_y = np.array(y_split.copy(), dtype=object)
# taking out the test group
test_set_x = subset
temp_x = np.delete(temp_x, index)
test_set_y = temp_y[index]
temp_y = np.delete(temp_y, index)
# re-combining the other 7 groups
train_set_x = np.array([])
train_set_y = np.array([])
for train_index, train_subset in enumerate(temp_x):
if (not train_set_x.any()):
train_set_x = train_subset
train_set_y = temp_y[train_index]
else:
train_set_x = np.concatenate((train_set_x, train_subset))
train_set_y = np.concatenate((train_set_y, temp_y[train_index]))
# Testing
reg = LinearRegression()
reg.fit(train_set_x, train_set_y)
residual_num = 0
residual_sum = 0
for idx, val in enumerate(test_set_x):
residual_sum += abs(test_set_y[idx][0] - reg.predict([val])[0])
residual_num += 1
average_residuals.append((residual_sum/residual_num).tolist()[0])
return average_residuals
# old gold model
gold_X = np.array(df_train_gold["month_delta"].to_list()).reshape(-1, 1)
gold_y = np.array(df_train_gold["gold interest (adjusted)"].to_list()).reshape(-1, 1)
gold_group = ksplit(gold_X, gold_y)
gold_residual = validation(gold_group[0], gold_group[1])
# new gold model
new_gold_X = np.array(new_gold_train["month_delta"].to_list()).reshape(-1, 1)
new_gold_y = np.array(new_gold_train["gold interest (adjusted)"].to_list()).reshape(-1, 1)
new_gold_group = ksplit(new_gold_X, new_gold_y)
new_gold_residual = validation(new_gold_group[0], new_gold_group[1])
# normalize by variance
gold_residual_variance = df_train_gold["gold interest (adjusted)"].var()
new_gold_residual_variance = new_gold_train["gold interest (adjusted)"].var()
gold_residual = [x/gold_residual_variance for x in gold_residual]
new_gold_residual = [x/new_gold_residual_variance for x in new_gold_residual]
Now we have the values, so we can use a T-test to test the hypothesis. SciPy adds a function that performs T-tests in one line
import scipy
print("T-test for \"Is the new model better fit than the old one?\" - \n" + str(scipy.stats.ttest_ind(gold_residual, new_gold_residual)) + "\n")
T-test for "Is the new model better fit than the old one?" - Ttest_indResult(statistic=8.236903784486579, pvalue=9.734288291659962e-07)
The P-value is significantly smaller than 0.05, indicating that the two are significantly different. The associated statistic suggests that the old model has a higher adjusted residual than the new model (because it is positive). A higher residual indicates a worse fit.
We can reject our null hypothesis and accept our alternate hypothesis. For this example.
The ultimate goal of this project was to get the slopes of the linear regressions, which help us answer the questions we asked in the beginning. Perhaps the most important part of data science is communicating the insights we learned from the data.
Let's remind ourselves of the central questions of the project:
"Is money that sits in savings accounts constantly losing buying power? Furthermore, what is the minimum risk required to prevent yourself from losing money over time? Certificate of deposit? Gold?"
Let's look at the slopes we found, including the new slope for gold. This is where we drop bias because we want to generalize to all principle investments.
for key in results.keys():
if (key != "gold"):
rounded = round(results[key][0]* 100, 4)
print(key + ": " + str(rounded)+ "% per month " + str(round(rounded*12, 6)) + "% per year")
rounded = round(new_result[0]* 100, 4)
print("gold (new): " + str(rounded) + "% per month " + str(round(rounded * 12, 6)) + "% per year")
savings: -0.0662% per month -0.7944% per year checking: -0.0865% per month -1.038% per year money market: -0.0375% per month -0.45% per year 1 month CD: -0.0668% per month -0.8016% per year 3 month CD: -0.0353% per month -0.4236% per year 6 month CD: 0.0254% per month 0.3048% per year 12 month CD: 0.14% per month 1.68% per year 24 month CD: 0.343% per month 4.116% per year 36 month CD: 0.5687% per month 6.8244% per year 48 month CD: 0.8037% per month 9.6444% per year 60 month CD: 1.2254% per month 14.7048% per year gold (new): 0.1011% per month 1.2132% per year
These numbers show the estimated percentage of the principal value you will earn back over 1 month and 1 year. NOTE: This will not reflect the USD AMOUNT of money in the account! This is a measure of buying power!
For example, in a 12 month CD, you will be able to buy on average 1.68% more potatoes than you could when you started. In that year span, the price of potatoes has changed, but so has the money in your account. This is a way to keep value currency agnostic.
We can finally answer our questions. (With the stipulation that we are basing our projections on the last 10 years of data and using the perfectly average interest rates; data science always comes with assumptions)
"Is money that sits in savings accounts constantly losing buying power?" -
Yes! Money is losing buying power over time regardless of interest (assuming average interest rate). This is true of average checking, savings, and money market accounts. It is possible to find better-than-average rates for these accounts, so it is possible to outperform the model if we are lucky.
"What is the minimum risk required to prevent your account from losing money over time? Certificate of deposit? Gold?" -
The minimum risk would be giving up our money to a bank for 6 months. The minimum time appears to be an average 6 month CD, which makes next to nothing, but is able to preserve value. Any longer CD has the potential to earn increasingly higher values in the same time period, but lock in the money for longer.
Gold is a much more risky investment because it strays far up and down the regression line over time. There is a chance it outperforms 5 year CD within a year, but there is also a chance that it underperforms a savings account. It is entirely dependent on the time we buy and sell. Over a grand 70-year timescale, it is approximately equivalent to a 1 year CD. Its benefit is that we can trade it at any time and might get lucky, buying it at a good time.
In conclusion, we must decide beforehand how quickly we need to use our money. If the money will be unused for decades, it is best to get a 60 month CD, but if we need absolute flexibility, gold may be a better (albeit more risky) option.
I hope that this walkthrough/project has taught you some ways data scientists leverage the data science pipeline to generate insights. I would appreciate your feedback and comments. I can be contancted at:
Jack T. Wilcom
Email: jtwilcom@wilcomweb.net
Twitter: @FlyingPeng11