The total Gross Domestic Product (GDP) for the United States is derived by summing up 4 things namely: Consumption Expenditure, Private Investment, Government Spending, and Net Exports. Consumption Expenditure makes up for about 70% of the GDP, Private Investment makes for about 20%, Govt. Spending makes for 9% and Net Exports about < 1%.
Now, Private investment is only 20%, but is the most volatile out of the other factors. It means Private Investment changes signinficantly and rapidly whenever there is economic boom or economic downturn. Private Investment basically includes the capital spent by business to purchase assets like machinary, equipments, land etc. More details about terminology can be found here.
However, traditionally when politicians and economists argue about Private Investment, people are either for or against raising corporate taxes/ tax breaks to big businesses. People who argue that lower taxes on businesses make them invest within U.S. as their corporate income after taxes is higher that way. They argue this also leads to higher employment within the country and other benefits. This type of theory is called Trickle Down Economics. You can read more about it here.
However, lot of recent publications like this one, argue that Trickle Down Effect does not work in reality. So, in this tutorial we make a model using data queried from Wharton Research Data Services (WRDS). Then we test our model against real aggregate variables for the Economy with data from FRED. This model will take in mulitple independent variable including income after taxes for businesses and predict the cpaital expenditure for a given period. If we have a reasonable model, we will have showed that taxes are not the only varibale to be considered. Hence we will have moved in a direction to disprove 'trickle down economics'
Data Collection - source: WRDS
Data Wrangling
Exploratory Data Analysis
Building Linear Model
Data Collection (for testing) - source: FRED
Data Organizing & Visualizing
Testing
Policy and Insight
#Libraries needed for the data collection and visualization process.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
This data from WRDS ranges across the years 1950 to 2020. This is annual data. So, for each year there is data on about 11,000 companies. The data contains things that would appear on the firms' balance sheets. The varibales we query are described below. Also, not that these would just be all public companies. Private business, small businesses, and the 'mom & pops shops' will not be included in this data.
The query that I used to get the data in a CSV file with needed variables from WRDS can be found here.
#Reading the csv file into as pandas DataFrame
firms = pd.read_csv('1950-2020.csv')
firms
Brief Description of the column titles that are relevant for this tutorial:
fyear: denotes the fiscal year
tic: denotes the ticker symbol under which this comapny is displayed on its stock exchange
assets: denotes total assets of the company in that fiscal year
capx: denotes the capital expenditure by the company in that fiscal year
depr: dentotes the dollar value of the amount of depreciation occured in the companies assets in that year
income: denotes the income after accounting for costs and taxes
liabl: denotes total liabilities of the company in that year
Firstly, we change divide all the values in the columns 'assets', 'capx', 'cost', 'depr', 'income', and 'liabl' by 1000. We do this to make those dollar amounts from millions of dollars (as indicated by WRDS) to billions of dollars. We do this because the data we test our model on later has units as billions of dollars.
firms['assets'] = firms['assets'].div(1000)
firms['capx'] = firms['capx'].div(1000)
firms['cost'] = firms['cost'].div(1000)
firms['depr'] = firms['depr'].div(1000)
firms['income'] = firms['income'].div(1000)
firms['liabl'] = firms['liabl'].div(1000)
'Net_savings' is one of our independent varible. and the formula to calculate that given a company's balance sheet is to subtract total liabilities from total assets.
firms['net_savings'] = firms['assets'] - firms['liabl']
Note, going forward 'capx' is the Dependent Variable and the Independent Varibles are: 'depr', 'income', 'net_savings'
firms
Now, we group the data by years and sum data across all companies for every year. So, every row represents 1 fiscal year and total capital expenditure, depr, etc in that year.
yearly = firms.groupby('fyear', as_index = False).sum()
yearly
Now we plot capx against each of the independent varibale and see what the general relationship is.
#This plot is just fiscal year vs capx. Fiscal year is not an independent varibale here. This is just to show the trend
sns.scatterplot(x = 'fyear', y = 'capx', data = yearly)
sns.scatterplot(x = 'depr', y = 'capx', data = yearly)
plt.grid(True)
sns.scatterplot(x = 'income', y = 'capx', data = yearly)
plt.grid(True)
sns.scatterplot(x = 'net_savings', y = 'capx', data = yearly)
plt.grid(True)
Increase in capx with increase in income (after adjusting for tazes) is a relationship we were expecting based on intuition. But we also have similar relationship between capx and the other two varibales.
Now, we beign to build a model that will predict total Capital Expenditure (capx) for a year given income, depreciation (depr), and net_savings for that year. We will use this full dataset to train our model, i.e. we do not split the data into train and test portions. Instead of using the traditional Cross Validation approach to test and see how good our model is, we will use a domain specific test. It will be clear when we reach the testing phase!
# Libraries need for Data Analysis and building the Model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from statsmodels.formula.api import ols
We try to use the Ordinary Least Squares to fit a linear regression to the data from the 'yearly' table
independent variable: capx
regressors: net_savings, depr, income
Below is the full summary of our results from the regression analysis using the statsmodels library
reg = ols('capx ~ depr + income + net_savings', data = yearly)
model = reg.fit()
model.summary()
From the stats above, we see that the R^2 value is 0.979 which means the model generated is a good fit. Since there were only 71 samples it might have over fitted, but we will find out in the testing section.
Below we apply the 'Sklearn' library's linear regression analysis. The advantage of this is that the linear model is automatically stored which we will need to predict values in the testing section.
X = yearly[['income', 'depr', 'net_savings']]
y = yearly['capx']
#Fit the linear model to the X, y above.
lin_model = LinearRegression().fit(X,y)
print(lin_model.score(X,y))
print("intercept is " + str(lin_model.intercept_))
print("Coefficient for income is " + str(lin_model.coef_[0]))
print("Coefficient for depr is " + str(lin_model.coef_[1]))
print("Coefficient for net_savings is " + str(lin_model.coef_[2]))
Below we try to fit polynomial models instead of Linear models
#Polynomial model of degree 2
poly = PolynomialFeatures(degree=2)
poly_X = poly.fit_transform(X)
poly_model = LinearRegression().fit(poly_X, y)
poly_model.score(poly_X,y)
#Polynomial model of degree 2
poly = PolynomialFeatures(degree=3)
poly_X = poly.fit_transform(X)
poly_model = LinearRegression().fit(poly_X, y)
poly_model.score(poly_X,y)
We see that there is an increase in R^2 values. as we increase the degree of the polynomial. However, we stick with the linear model for 2 reasons:
Now, for testing we choose actual aggregate data about the whole US Economy from Federal Reserve St. Louis (FRED) data website hyperlinked in the the introduction above. This is a government website which has data about macroeconomic variabels like total consumer expenditure, change in GDP etc over the years.
So, we grab some more data from FRED st. Louis..
Here are the links for data that we are interested in:
https://fred.stlouisfed.org/series/GPDIC1
https://fred.stlouisfed.org/series/CP
https://fred.stlouisfed.org/series/TNWMVBSNNCB
https://fred.stlouisfed.org/series/M1PTOTL1ES000
all of these are quarterly data from 1947 to 2020, except depreciation (in the last link above)
# While downloading, each variable had its own website so we have multiple CSVs.
#So we read each csv to a dataframe and then combine the data in one dataframe
one = pd.read_csv('FRED_gross_investment.csv')
two = pd.read_csv('FRED_gross_private_income.csv')
three = pd.read_csv('FRED_total_depreciation.csv')
four = pd.read_csv('FRED_savings.csv')
#data on total depreceiation in the US was annual so we pad each year's data
#to all its quarters to match with quarterly data of other variables.
x = [item for item in three['depreciation'] for i in range(4)]
#The data for last three quarters is missing in the depreciation column
#so we just pad the depreciation value in billions from the previous quarter to
#the latest three quarters.
for i in range(3):
x.append(2960)
macro = pd.DataFrame()
macro['date'] = one['DATE']
macro['investment'] = one['gross_investment']
macro['private_income'] = two['corporate_profits']
macro['depreciation'] = x
macro['savings_level'] = four['gross_net_savings']
macro
'investment' represents capx at economy level here. So, it is literally Gross Domestic Investment for each quarter.
'private_income' is analogous to income in the firms dataframe at national level.
'depreciation' is analogous to depreciation in the firms dataframe at national level.
'savings_level' is analogous to net_savings in the firms dataframe at national level.
Also, we should keep in mind that we took data from only public firms, but Gross Investment in the the US comprises of Private and Public business. So, in general the 'Capx' in our 'firms' Dataframe should be less the actual value of Gross Invetment in the U.S.
#CHARTS
#Just like we did for the firms data we plot these analogous variables against investment.
#We try to eyeball the charts and ensure that the type of correlations are within the scope of
#our model to try and predict.
sns.scatterplot(x = 'private_income', y = 'investment', data = macro)
plt.grid(True)
sns.scatterplot(x = 'depreciation', y = 'investment', data = macro)
plt.grid(True)
sns.scatterplot(x = 'savings_level', y = 'investment', data = macro)
plt.grid(True)
X_test = macro[['private_income', 'depreciation', 'savings_level']]
y_test = macro['investment']
#We predict macroeconomic variable: investment. (actual values are stored in y_test)
y_pred = lin_model.predict(X_test)
#Finding the average absolute deviation per value to see how much a predicted
#value deviates from the actual values in that quarter on average.
total_dev = 0
for i in range(0, 295):
total_dev += abs(y_test[i] - y_pred[i])
total_dev/295
So we see that on average our predicted value deviates by approx. $366 billion for actual Gross Investment in the U.S. This is a huge dollar amount, but minor compared to the thousands of billions of dollars invested as a part of domestic investment in the U.S.
plt.figure(figsize=(15,10))
plt.scatter(macro['date'], y_test)
plt.plot(macro['date'], y_pred, 'r')
The red line in the chart above connects the points resulting from the predicted values for corresponding fiscal quaters based on our model. We can see that the general trend predicted by our model matches the one with data on Gross Private Investmet from FRED. However, there is some deviation in predicted values from the actual values depicted by blued dots. Since the data ranges across 295 fiscal quarters, the x-axis is a bit crowded in the chart above. But note that the deviation decreases significanlty after Q3 in the year 2011. So we plot that below to get a better look:
#Plotting same chart as above from 3rd Quarter of 2011 to most recent
new = macro.iloc[-38:]
plt.figure(figsize=(15,10))
plt.scatter(new['date'], y_test[-38:])
plt.plot(new['date'], y_pred[-38:], 'r')
FUN FACT: Note the last to the second dot that seems like an outlier. That is the third quarter of 2020. It is intuitive why the gross domestic investment is so low. It is due to businesses being shut for the most part.
Assuming our model was 'successful' based on the standards that it predicts the same general trend of Gross Domestic Investment based on the 3 variables chosen. So, if there are changes in some exogenous variable, and we know its effect on either depreciation, income or net_savings of the corporate US then we can have a general idea which direction the GDI will move in. However, we have not proven any causal relationships. With my limited domain expertise coming from Econmics major, my judgement is restricted. So, we leave causal explanations for the experts!
Also, since our model takes in input 2 other variables than just income (after taxes) we have shown that Trickle Down Economics is most likely not a practical theory !
Below are references to some literature in this field of Economics and Statistics:
Voss, Graham M., 2002. "Public and private investment in the United States and Canada," Economic Modelling, Elsevier, vol. 19(4), pages 641-664, August.
Domar, E. (1953). Depreciation, Replacement and Growth. The Economic Journal, 63(249), 1-32. doi:10.2307/2226748
TED Talk: Hanauer, N. The dirty secret of capitalism -- and a new way forward. Retrieved 13 December 2020, from https://www.ted.com/talks/nick_hanauer_the_dirty_secret_of_capitalism_and_a_new_way_forward?language=en