Logo

Christyan Jean-Charles

LinkedIn
Resume
GitHub

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as pdr
import seaborn as sns

1. Summary:

In this repo, I ciphered through dictionaries with positive and negative sentiments as well as created my three own variables to measure the sentiment of S&P 500 firm 10-K’s. Using these sentiment variables, I went through the data to find out the correlation between sentiment analysis and stock returns. I conducted a cross-sectional study to find out if positive or negative sentiments within 10-K filings have an effect on returns. With a wide variety of industries within the data sample, it was interesting to see to what extent the sentiment of each companies’ 10-K correlated with the cumulative returns of each stock. I wanted to see the true importance of the sentiment of 10-K filings and whether they had a real impact on returns.

Throughout this repo, I found myself re-evaluating the 10-K’s based on their sentiment. I attempted to create a conclusive list that would give me a good basis to determine the sentiment score of each firm’s 10-K. I found that while the LM_dictionary proved to have no real correlation, the ML_dictionary showed a slight positive trend between sentiment and returns. With my hand-made sentiments I found some interesting trends that I will share towards the latter half of this file.

2. Data Section:

The sample data consists of S&P 500 firms as well as their returns. The return variables were based around two variables one which consider the filing date plus two days which were inclusive, as well as the filing date plus 3 days to the filing date plus 10 days.

Sample Data:

sample_df = pd.read_csv('output/analysis_sample.csv')
sample_df
Symbol Security GICS Sector GICS Sub-Industry Headquarters Location Date added CIK Founded LM_pos_score LM_neg_score ... mb prof_a ppe_a cash_a xrd_a dltt_a invopps_FG09 sales_g dv_a short_debt
0 MMM 3M Industrials Industrial Conglomerates Saint Paul, Minnesota 1957-03-04 66740 1902 0.003977 0.023249 ... 2.838265 0.197931 0.218538 0.101228 0.042361 0.355625 2.564301 0.098527 0.072655 0.086095
1 AOS A. O. Smith Industrials Building Products Milwaukee, Wisconsin 2017-07-26 91142 1916 0.003756 0.012984 ... 4.368153 0.197847 0.183974 0.181729 0.027113 0.061075 NaN 0.222291 0.048958 0.080191
2 ABT Abbott Health Care Health Care Equipment North Chicago, Illinois 1957-03-04 1800 1888 0.003726 0.012793 ... 3.825614 0.166285 0.134475 0.136297 0.036465 0.242726 3.559664 0.244654 0.042582 0.051893
3 ABBV AbbVie Health Care Pharmaceuticals North Chicago, Illinois 2012-12-31 1551152 2013 (1888) 0.006481 0.015448 ... 2.528878 0.194433 0.040074 0.067086 0.054911 0.442929 2.144449 0.227438 0.063203 0.163364
4 ACN Accenture Information Technology IT Consulting & Other Services Dublin, Ireland 2011-07-06 1467373 1989 0.008642 0.016861 ... 5.474851 0.195625 0.111674 0.189283 0.025902 0.063702 5.023477 0.140013 0.051790 0.215661
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
347 XYL Xylem Inc. Industrials Industrial Machinery White Plains, New York 2011-11-01 1524472 2011 0.007049 0.017495 ... 3.225061 0.103432 0.114548 0.163001 0.024650 0.324190 2.909645 0.065422 0.024529 0.025073
348 YUM Yum! Brands Consumer Discretionary Restaurants Louisville, Kentucky 1997-10-06 1041061 1997 0.006078 0.016549 ... 9.129993 0.395240 0.337915 0.123366 0.000000 1.019505 8.944086 0.164897 0.099229 0.012864
349 ZBRA Zebra Technologies Information Technology Electronic Equipment & Instruments Lincolnshire, Illinois 2019-12-23 877212 1969 0.006258 0.014964 ... 5.635335 0.192759 0.064843 0.055350 0.091231 0.167820 5.301699 0.265063 0.000000 0.089083
350 ZBH Zimmer Biomet Health Care Health Care Equipment Warsaw, Indiana 2001-08-07 1136869 1927 0.004591 0.021783 ... 1.592191 0.092759 0.097530 0.020400 0.021892 0.242318 1.415104 0.115553 0.008531 0.227553
351 ZTS Zoetis Health Care Pharmaceuticals Parsippany, New Jersey 2013-06-21 1555280 1952 0.005036 0.019980 ... 8.969729 0.236475 0.187266 0.250719 0.036547 0.485108 8.792744 0.164349 0.034101 0.006044

352 rows × 89 columns

Although in my build_sample file I opted to use strictly the returns from each firm around the filing day, I attempted to find the return variables by merging the sp500 data frame with the CRSP data frame that contained the stock returns. After merging:

  1. Converted filing_date to a date time
  2. Created a new column for the time difference in days between filing_date and date
  3. Filtered the data frame to include a variable for only data for t, t+1, and t+2
  4. Filtered the data frame to include a variable for only data for t+3 to t+10 as well
  5. Calculated the cumulative returns
  6. Attempted to merge them back within the data frame
#Convert filing_date to datetime
inner_merged['filing_date'] = pd.to_datetime(inner_merged['filing_date'])

# Create a new column for the time difference in days between filing_date and date
inner_merged['days_diff'] = (inner_merged['date'] - inner_merged['filing_date']).dt.days

# Filter the DataFrame to include a variable for only data for t, t+1, and t+2
Ret02 = inner_merged.loc[inner_merged['days_diff'].between(0, 2)]

# Filter the DataFrame to include a variable for only data for t+3 to t+10
Ret310 = inner_merged.loc[inner_merged['days_diff'].between(3, 10)]

This is where things got tricky for me as I could not figure out the best way to go about merging my two return variables back into the larger data frame.

The sentiment variables were built by turning the LM and ML text files into positive and negative list that could be interpreted by regex.

My Code to Build the Sentiment Variables:

# ML (BHR) Positive and Negative:

BHR_negative = pd.read_csv('inputs/ML_negative_unigram.txt',
            names=['word'])['word'].to_list()   

with open('inputs/ML_positive_unigram.txt', 'r') as file:
    BHR_positive = [line.strip() for line in file]
    
BHR_negative_regex = ['('+'|'.join(BHR_negative)+')']
BHR_positive_regex = ['('+'|'.join(BHR_positive)+')']

# LM Positive and Negative:

LM = pd.read_csv('inputs/LM_MasterDictionary_1993-2021.csv')
LM_negative = LM.query('Negative > 0')['Word'].to_list()
LM_positive = LM.query('Positive > 0')['Word'].to_list()

LM_neg_regex = ['('+'|'.join(LM_negative)+')']
LM_pos_regex = ['('+'|'.join(LM_positive)+')']

The reason I chose to base my contextual sentiment around supply chain, R&D, and financial performance is because I felt these three components were essential to the success of many companies within the S&P 500. I wanted to see whether positive contextual sentiments of these components would also help in understanding the correlation between sentiment and returns. If these three components are as important as they are made out to be then it would only make sense for the returns to be positively correlated with better stock returns.

test_df = pd.read_csv('output/analysis_sample.csv')
final_sample = test_df[['Symbol', 'GICS Sector', 'ret', 'LM_pos_score', 'LM_neg_score', 'BHR_pos_score', 'BHR_neg_score', 'Positive_SC_score', 'Negative_SC_score', 'Positive_RD_score', 'Negative_RD_score', 'Positive_FP_score', 'Negative_FP_score', 'cash_a', 'prof_a','dv_a','capx_a','xrd_a']].reset_index(drop=True)

final_sample.to_csv('output/final_analysis.csv', index=False)
final_analysis = pd.read_csv('output/final_analysis.csv')
final_analysis.describe()
ret LM_pos_score LM_neg_score BHR_pos_score BHR_neg_score Positive_SC_score Negative_SC_score Positive_RD_score Negative_RD_score Positive_FP_score Negative_FP_score cash_a prof_a dv_a capx_a xrd_a
count 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000 352.000000
mean 0.001998 0.005190 0.016126 0.024245 0.026095 0.004186 0.002663 0.004465 0.002981 0.011976 0.004936 0.131501 0.156031 0.023759 0.031429 0.028012
std 0.036578 0.001368 0.003386 0.003770 0.003386 0.001100 0.001040 0.001450 0.000757 0.002673 0.001235 0.121036 0.085127 0.026821 0.026099 0.043721
min -0.242779 0.000272 0.007327 0.003530 0.014692 0.001059 0.000000 0.001629 0.000543 0.004141 0.001184 0.003713 -0.099432 0.000000 0.001387 0.000000
25% -0.014722 0.004361 0.013670 0.022367 0.023993 0.003404 0.001936 0.003528 0.002447 0.010346 0.004136 0.043401 0.099332 0.000000 0.013221 0.000000
50% -0.000509 0.005113 0.016010 0.024452 0.026107 0.004112 0.002529 0.004304 0.002909 0.011922 0.004873 0.096322 0.142072 0.017442 0.023601 0.008529
75% 0.017986 0.005864 0.018104 0.026422 0.028141 0.004921 0.003250 0.005248 0.003447 0.013669 0.005603 0.171776 0.201154 0.035275 0.040391 0.040744
max 0.162141 0.010899 0.026658 0.037982 0.038030 0.007194 0.006580 0.012059 0.006560 0.021191 0.010717 0.607837 0.405925 0.164573 0.170436 0.295576

The stats for my final analysis sample show that on average, the stocks in the sample have a very small positive return. However looking at the standard deviation, the return values vary widely from the mean value. The percentiles show that 75% of the stocks in this sample were above .0180 or 1.8%.

For the most part, I believe that my contextual sentiments pass the “basic smell tests.” When looking at graphical evidence, the data from my final sample analysis shows that positive sentiments were found in industries where I most expected them. When it came to financial performance, the financial sector had the highest positive contextual sentiment score with consumer staples coming in 2nd, and the industrial sector coming in third.

# create the barplot
ax = sns.barplot(data=final_analysis, x="GICS Sector", y="Positive_FP_score", errorbar=None,  saturation=.5, errcolor=".2", edgecolor=".2", width=0.5)

# adjust the x-axis labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

# set the figure size
sns.set(rc={'figure.figsize':(12,8)})

png

The sector with the highest positive R&D scores were the Health Care, IT, Financial, and Real Estate industries, which makes sense especially for the Health Care sector. As they are constantly under pressure to create cures and vaccines especially after the onset of COVID-19, it only makes sense that the health care sector tops the list. Same applies for the IT sector, as technology continues to evolve it would make sense for the R&D sentiment to be positive as they strive towards improving R&D to stay ahead of the curve.

# create the barplot
ax = sns.barplot(data=final_analysis, x="GICS Sector", y="Positive_RD_score", errorbar=None,  saturation=.5, errcolor=".2", edgecolor=".2", width=0.5)

# adjust the x-axis labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

# set the figure size
sns.set(rc={'figure.figsize':(12,8)})

png

3. Results:

Correlation table between the return variable and the 10 sentiment measures:

sentiment_cols = ['LM_pos_score', 'LM_neg_score', 'BHR_pos_score', 'BHR_neg_score','Positive_SC_score', 'Negative_SC_score', 'Positive_RD_score','Negative_RD_score', 'Positive_FP_score', 'Negative_FP_score']

correlations = final_analysis[sentiment_cols + ['ret']].corr()['ret'].to_frame()
correlations.columns = ['correlation with ret']

correlations
correlation with ret
LM_pos_score -0.090945
LM_neg_score -0.003743
BHR_pos_score 0.059411
BHR_neg_score 0.046202
Positive_SC_score -0.023058
Negative_SC_score -0.035132
Positive_RD_score -0.065353
Negative_RD_score 0.022159
Positive_FP_score 0.084587
Negative_FP_score 0.016209
ret 1.000000

Scatter Plot of Each Sentiment Measure Against the Return:

# Select columns of interest
sentiment_cols = ['LM_pos_score', 'LM_neg_score', 'BHR_pos_score', 'BHR_neg_score', 
                  'Positive_SC_score', 'Negative_SC_score', 'Positive_RD_score', 
                  'Negative_RD_score', 'Positive_FP_score', 'Negative_FP_score']
return_col = 'ret'
data = final_analysis[sentiment_cols + [return_col]]

# Create scatterplots
fig, axs = plt.subplots(nrows=5, ncols=2, figsize=(10, 20))
axs = axs.flatten()
for i, col in enumerate(sentiment_cols):
    axs[i].scatter(data[col], data[return_col], alpha=0.5)
    axs[i].set_xlabel(col)
    axs[i].set_ylabel(return_col)
    axs[i].set_title(f'{col} vs {return_col}')

plt.tight_layout()
plt.show()

png

Topic 1:

When looking at the data for the return variable in comparison to the LM sentiment variables, a positive relationship can be seen. Most of the points on the graph are above 0.00 trending towards higher returns as the positive score increase. The same can be seen in greater magnitude within the ML positive sentiment. With a higher density of points falling within the upper right hand region of the graph.

The opposite is displayed for both the negative ML and LM sentiment. The magnitude of the correlation between returns and these two sentiments is relative to the positive sentiment graphs. These points lie to the left of the graph edging towards negative returns.

Overall, we can see that there is a consistent pattern in the relationship between the return variable and the four sentiment measures. Positive sentiment measures tended to show a more positive relationship with returns in comparison to the negative sentiments which showed a negative relationship with returns. However, it seems as if the ML(BHR)sentiment measures appear to have a slightly stronger relation with the return variable.

Topic 2:

I found that my data was in agreement with the Garcia, Hu, and Rohrer paper. One reason they may have included such a vast amount of data is because a relatively small sample size like mine could have led to extremities that could heavily skew the data, causing bias results. Their data encompassed many more firms, years, and additional controls which would help to decrease the amount of outliers within the data. The amount of additional firms, years, and controls could have also helped to strengthen the validity of the study. When conducting a study, such as this cross sectional event study between sentiments and stock returns, it is important to collect conclusive data that either proves or debunks your hypothesis.

Topic 3:

Unfortunately my contextual sentiments don’t look different enough in relations with my return variable. There’s no real need for further investigation as a regression line shows no trends between the contextual sentiment variables and the return variable.

1. Negative RD Sentiment:

sns.regplot(data=final_analysis, x='Negative_RD_score', y='ret')
plt.show()

png

2. Positive RD Sentiment:

sns.regplot(data=final_analysis, x='Positive_RD_score', y='ret')
plt.show()

png

3. Negative Supply Chain Sentiment:

sns.regplot(data=final_analysis, x='Negative_SC_score', y='ret')
plt.show()

png

4. Positive Supply Chain Sentiment:

sns.regplot(data=final_analysis, x='Positive_SC_score', y='ret')
plt.show()

png

5. Negative Financial Performance Sentiment:

sns.regplot(data=final_analysis, x='Negative_FP_score', y='ret')
plt.show()

png

6. Positive Financial Performance Sentiment:

sns.regplot(data=final_analysis, x='Positive_FP_score', y='ret')
plt.show()

png

As better displayed in the graphs above, there’s a slight positive trend within the data frames but the correlation isn’t high enough to require further investigation within the topics. However when looking at the sentiment scores in comparison to financial ratios there are some noticeable trends.

sns.regplot(data=final_analysis, x='Positive_RD_score', y='cash_a')
plt.show()

png

A great example would be in the above graph. A 10-K ridden with positive R&D sentiments tended to have a higher cash per asset ratio. With a noticeably positive regression line, it’s hard not to notice the importance of good quality R&D to a companies assets.

sns.regplot(data=final_analysis, x='Positive_SC_score', y='prof_a')
plt.show()

png

One last great example I found for this topic is in the above graph. Whereas there was no correlation between returns and a positive supply chain sentiment. I noticed there was a positive trend between profit per asset and 10-K’s that displayed a positive sentiment towards their supply chain. As supply chain costs and obstacles decrease and the business continues to thrive within its supply chain function, it makes sense that profitability would have a moderately positive correlation with firms that have positive sentiment towards supply chain displayed in their 10-K.

Topic 4:

However when looking at the 3 contextual sentiment measures in comparison with the return variable, there isn’t a real difference in the sign and magnitude. I believe I made a comprehensive list to detect the contextual sentiment of the 10-K. However looking at the graphs in regards to the return, I can tell that companies try to disguise the sentiment of their 10-K if it evokes a negative sentiment. That’s why hardly any of my negative contextual sentiments have a regression higher than 0. Whereas, the positive contextual sentiment variables seem to have a slight positive correlation in regards to the return variable.