ArcticDB_demo_equity_analytics
在 Github 中查看 | 在 Google Colab 中打开使用 ArcticDB 进行股票分析:一个实践示例

一个简单的股票价格时间序列工作流程¶
在这个 Notebook 中,我们
- 使用 yfinance 包从 Yahoo! 获取历史科技股和标普 500 价格
- 将 10 年的每日价格原始数据存储在 Notebook 中设置的临时本地 ArcticDB 库中
- 数据获取分为两步:初始历史数据加载和近期价格更新
- 这表明了一个典型的系统工作流程:初始回填和每日近期数据更新
- 获取调整后的收盘价 (Adjusted Close),移除缺失数据并计算收益。然后将结果保存到库中
- 读取清洗后的收益数据,并用它们计算股票相对于标普 500 的滚动 Beta 值
注意:此 Notebook 的数据来源于 Yahoo!¶
使用前请阅读 Yahoo 条款
安装和导入¶
输入 [2]
已复制!
!pip install arcticdb yfinance
!pip install arcticdb yfinance
输入 [3]
已复制!
import arcticdb as adb
import yfinance as yf
import pandas as pd
from typing import List, Tuple
from datetime import datetime
import matplotlib.pyplot as plt
import arcticdb as adb import yfinance as yf import pandas as pd from typing import List, Tuple from datetime import datetime import matplotlib.pyplot as plt
ArcticDB 设置¶
输入 [4]
已复制!
arctic = adb.Arctic("lmdb://arcticdb_equity")
lib = arctic.get_library('demo', create_if_missing=True)
arctic = adb.Arctic("lmdb://arcticdb_equity") lib = arctic.get_library('demo', create_if_missing=True)
从 Yahoo! 获取历史价格¶
输入 [24]
已复制!
start = datetime(2013, 1, 1)
end = datetime(2022, 12, 31)
freq = '1d'
symbols = {
'^GSPC': 'S&P 500',
'AAPL': 'Apple',
'MSFT': 'Microsoft',
'GOOG': 'Google',
'AMZN': 'Amazon',
'NVDA': 'Nvidia',
'META': 'Meta',
'TSLA': 'Tesla',
'TSM': 'TSMC',
'TCEHY': 'Tencent',
'005930.KS': 'Samsung',
'ORCL': 'Oracle',
'ADBE': 'Adobe',
'ASML': 'ASML',
'CSCO': 'Cisco'
}
start = datetime(2013, 1, 1) end = datetime(2022, 12, 31) freq = '1d' symbols = { '^GSPC': 'S&P 500', 'AAPL': 'Apple', 'MSFT': 'Microsoft', 'GOOG': 'Google', 'AMZN': 'Amazon', 'NVDA': 'Nvidia', 'META': 'Meta', 'TSLA': 'Tesla', 'TSM': 'TSMC', 'TCEHY': 'Tencent', '005930.KS': 'Samsung', 'ORCL': 'Oracle', 'ADBE': 'Adobe', 'ASML': 'ASML', 'CSCO': 'Cisco' }
输入 [25]
已复制!
hist = yf.download(list(symbols.keys()), interval=freq, start=start, end=end)
hist = yf.download(list(symbols.keys()), interval=freq, start=start, end=end)
[*********************100%%**********************] 15 of 15 completed
输入 [26]
已复制!
# the column levels[0] are the fields for each stock
print(hist.columns.levels[0])
hist['Volume'].head(5)
# the column levels[0] are the fields for each stock print(hist.columns.levels[0]) hist['Volume'].head(5)
Index(['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume'], dtype='object')
输出[26]
005930.KS | AAPL | ADBE | AMZN | ASML | CSCO | GOOG | META | MSFT | NVDA | ORCL | TCEHY | TSLA | TSM | ^GSPC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
日期 | |||||||||||||||
2013-01-02 | 11449650.0 | 560518000.0 | 6483800.0 | 65420000.0 | 1824000.0 | 40304500.0 | 102033017.0 | 69846400.0 | 52899300.0 | 47883600.0 | 33758400.0 | 362500.0 | 17922000.0 | 10226100.0 | 4.202600e+09 |
2013-01-03 | 14227400.0 | 352965200.0 | 3906000.0 | 55018000.0 | 1725400.0 | 50603500.0 | 93075567.0 | 63140600.0 | 48294400.0 | 29888800.0 | 21819500.0 | 355000.0 | 11130000.0 | 13148600.0 | 3.829730e+09 |
2013-01-04 | 12999800.0 | 594333600.0 | 3809300.0 | 37484000.0 | 3170800.0 | 36378900.0 | 110954331.0 | 72715400.0 | 52521100.0 | 52496800.0 | 21687300.0 | 101000.0 | 10110000.0 | 7464200.0 | 3.424290e+09 |
2013-01-07 | 12610950.0 | 484156400.0 | 3632100.0 | 98200000.0 | 2066100.0 | 30790700.0 | 66476239.0 | 83781800.0 | 37110400.0 | 61073200.0 | 14008300.0 | 83000.0 | 6630000.0 | 9429900.0 | 3.304970e+09 |
2013-01-08 | 13822250.0 | 458707200.0 | 3080900.0 | 60214000.0 | 1182400.0 | 33218100.0 | 67295297.0 | 45871300.0 | 44703100.0 | 46642400.0 | 17408900.0 | 49000.0 | 19260000.0 | 8112900.0 | 3.601600e+09 |
将数据写入库中¶
目前 ArcticDB 无法直接存储多级列的 dataframe,因此我们为每个价格字段使用一个 symbol。
输入 [27]
已复制!
def field_name_to_sym(field_name: str) -> str:
return f"hist/price_{field_name.replace(' ', '')}"
def field_name_to_sym(field_name: str) -> str: return f"hist/price_{field_name.replace(' ', '')}"
输入 [28]
已复制!
for l in hist.columns.levels[0]:
lib.write(field_name_to_sym(l), hist[l])
for l in hist.columns.levels[0]: lib.write(field_name_to_sym(l), hist[l])
输入 [29]
已复制!
# read back and check that the data round-trips precisely
for l in hist.columns.levels[0]:
hist_check_db = lib.read(field_name_to_sym(l)).data
if not hist[l].equals(hist_check_db):
print(f"Field '{l}' does not round-trip")
# read back and check that the data round-trips precisely for l in hist.columns.levels[0]: hist_check_db = lib.read(field_name_to_sym(l)).data if not hist[l].equals(hist_check_db): print(f"Field '{l}' does not round-trip")
输入 [30]
已复制!
update_start = datetime(2022, 7, 1)
update_end = datetime(2023, 12, 31)
update_hist = yf.download(list(symbols.keys()), interval=freq, start=update_start, end=update_end)
update_start = datetime(2022, 7, 1) update_end = datetime(2023, 12, 31) update_hist = yf.download(list(symbols.keys()), interval=freq, start=update_start, end=update_end)
[*********************100%%**********************] 15 of 15 completed
输入 [31]
已复制!
for l in update_hist.columns.levels[0]:
lib.update(field_name_to_sym(l), update_hist[l])
for l in update_hist.columns.levels[0]: lib.update(field_name_to_sym(l), update_hist[l])
输入 [32]
已复制!
# these are the symbols we have created
lib.list_symbols()
# these are the symbols we have created lib.list_symbols()
输出[32]
['hist/price_High', 'hist/price_Low', 'hist/price_Volume', 'hist/price_Close', 'hist/price_AdjClose', 'hist/price_Open']
输入 [33]
已复制!
# each symbol contains data for one price field, with the stock tickers as columns and dates as the index
lib.head(field_name_to_sym('Close')).data
# each symbol contains data for one price field, with the stock tickers as columns and dates as the index lib.head(field_name_to_sym('Close')).data
输出[33]
005930.KS | AAPL | ADBE | AMZN | ASML | CSCO | GOOG | META | MSFT | NVDA | ORCL | TCEHY | TSLA | TSM | ^GSPC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
日期 | |||||||||||||||
2013-01-02 | 31520.0 | 19.608213 | 38.340000 | 12.8655 | 66.779999 | 20.340000 | 18.013729 | 28.000000 | 27.620001 | 3.1800 | 34.689999 | 6.720 | 2.357333 | 18.100000 | 1462.420044 |
2013-01-03 | 30860.0 | 19.360714 | 37.750000 | 12.9240 | 65.379997 | 20.450001 | 18.024191 | 27.770000 | 27.250000 | 3.1825 | 34.310001 | 6.660 | 2.318000 | 18.090000 | 1459.369995 |
2013-01-04 | 30500.0 | 18.821428 | 38.130001 | 12.9575 | 64.709999 | 20.480000 | 18.380356 | 28.760000 | 26.740000 | 3.2875 | 34.610001 | 6.694 | 2.293333 | 17.959999 | 1466.469971 |
2013-01-07 | 30400.0 | 18.710714 | 37.939999 | 13.4230 | 63.660000 | 20.290001 | 18.300158 | 29.420000 | 26.690001 | 3.1925 | 34.430000 | 6.600 | 2.289333 | 17.700001 | 1461.890015 |
2013-01-08 | 30000.0 | 18.761070 | 38.139999 | 13.3190 | 63.139999 | 20.309999 | 18.264042 | 29.059999 | 26.549999 | 3.1225 | 34.439999 | 6.570 | 2.245333 | 17.540001 | 1457.150024 |
读取并处理整个价格数据集¶
- 读取数据,使用 Adj Close 字段作为我们的主要价格来源
- 通过前向填充 (forward filling) 移除缺失数据。这是一种简单但不复杂的方法
- 计算价格的日收益
- 将收益写回 ArcticDB 作为另一个 symbol
输入 [36]
已复制!
hist_adj_close = lib.read(field_name_to_sym('Adj Close')).data
hist_adj_close = lib.read(field_name_to_sym('Adj Close')).data
输入 [37]
已复制!
# ffill to remove nans (missing data)
hist_adj_close_clean = hist_adj_close.ffill()
# the following line will return True if there are any nans
hist_adj_close_clean.isnull().any().any()
# ffill to remove nans (missing data) hist_adj_close_clean = hist_adj_close.ffill() # the following line will return True if there are any nans hist_adj_close_clean.isnull().any().any()
输出[37]
False
输入 [38]
已复制!
hist_daily_returns = hist_adj_close_clean.pct_change(1).iloc[1:]
hist_daily_returns.iloc[:5, :5]
hist_daily_returns = hist_adj_close_clean.pct_change(1).iloc[1:] hist_daily_returns.iloc[:5, :5]
输出[38]
005930.KS | AAPL | ADBE | AMZN | ASML | |
---|---|---|---|---|---|
日期 | |||||
2013-01-03 | -0.020939 | -0.012622 | -0.015389 | 0.004547 | -0.020964 |
2013-01-04 | -0.011666 | -0.027854 | 0.010066 | 0.002592 | -0.010248 |
2013-01-07 | -0.003278 | -0.005882 | -0.004983 | 0.035925 | -0.016226 |
2013-01-08 | -0.013158 | 0.002691 | 0.005272 | -0.007748 | -0.008168 |
2013-01-09 | 0.000000 | -0.015629 | 0.013634 | -0.000113 | 0.005702 |
输入 [40]
已复制!
returns_sym = 'hist/returns_AdjClose_clean'
lib.write(returns_sym, hist_daily_returns)
returns_sym = 'hist/returns_AdjClose_clean' lib.write(returns_sym, hist_daily_returns)
输出[40]
VersionedItem(symbol='hist/returns_AdjClose_clean', library='demo', data=n/a, version=1, metadata=None, host='LMDB(path=/users/isys/nclarke/jupyter/arctic/demos/arcticdb_equity)')
数据处理结束 - 创建快照¶
创建快照是可选的,但它有助于记录所有数据的状态和每日更新流程的结束点。
输入 [41]
已复制!
snapshot_name = f"eod:{hist_daily_returns.iloc[-1].name.date()}"
print(snapshot_name)
if snapshot_name in lib.list_snapshots():
lib.delete_snapshot(snapshot_name)
lib.snapshot(snapshot_name, metadata="EOD audit point")
snapshot_name = f"eod:{hist_daily_returns.iloc[-1].name.date()}" print(snapshot_name) if snapshot_name in lib.list_snapshots(): lib.delete_snapshot(snapshot_name) lib.snapshot(snapshot_name, metadata="EOD audit point")
eod:2023-11-17
可视化收益¶
输入 [42]
已复制!
plot_start = datetime(2021, 1, 1)
plot_end = datetime(2023, 12, 31)
returns_plot = lib.read(returns_sym, date_range=(plot_start, plot_end)).data
daily_cum_returns = ((1 + returns_plot).cumprod() - 1)
daily_cum_returns.rename(columns=symbols).plot(figsize=(20, 10), grid=True, linewidth=0.9, title="Daily Cumulative Stock Returns")
plot_start = datetime(2021, 1, 1) plot_end = datetime(2023, 12, 31) returns_plot = lib.read(returns_sym, date_range=(plot_start, plot_end)).data daily_cum_returns = ((1 + returns_plot).cumprod() - 1) daily_cum_returns.rename(columns=symbols).plot(figsize=(20, 10), grid=True, linewidth=0.9, title="每日累计股票收益")
输出[42]
<Axes: title={'center': 'Daily Cumulative Stock Returns'}, xlabel='Date'>
分析:计算相对于标普 500 的滚动 Beta 值¶
注意 Beta 值从 2020 年第二季度开始的大幅变化,这在 130 天后(滚动窗口的大小)从 Beta 值中移出。
可能是由于 Covid 疫情初期市场的剧烈波动?
输入 [43]
已复制!
index_ticker = "^GSPC"
roll_days = 130
beta_start = datetime(2018, 1, 1)
beta_end = datetime(2022, 12, 31)
beta_returns = lib.read(returns_sym, date_range=(beta_start, beta_end)).data
index_returns = beta_returns[index_ticker]
stock_returns = beta_returns.drop(columns=index_ticker)
index_ticker = "^GSPC" roll_days = 130 beta_start = datetime(2018, 1, 1) beta_end = datetime(2022, 12, 31) beta_returns = lib.read(returns_sym, date_range=(beta_start, beta_end)).data index_returns = beta_returns[index_ticker] stock_returns = beta_returns.drop(columns=index_ticker)
输入 [44]
已复制!
rolling_cov = stock_returns.rolling(roll_days).cov(index_returns).iloc[roll_days-1:]
rolling_index_var = index_returns.rolling(roll_days).var().iloc[roll_days-1:]
rolling_beta = rolling_cov.divide(rolling_index_var, axis='index').rename(columns=symbols)
rolling_cov = stock_returns.rolling(roll_days).cov(index_returns).iloc[roll_days-1:] rolling_index_var = index_returns.rolling(roll_days).var().iloc[roll_days-1:] rolling_beta = rolling_cov.divide(rolling_index_var, axis='index').rename(columns=symbols)
输入 [45]
已复制!
ax = rolling_beta.plot(figsize=(20, 10), grid=True, linewidth=0.9, title=f"Rolling {roll_days}-day beta")
ax.legend(loc='upper left')
ax = rolling_beta.plot(figsize=(20, 10), grid=True, linewidth=0.9, title=f"{roll_days} 日滚动 Beta 值") ax.legend(loc='upper left')
输出[45]
<matplotlib.legend.Legend at 0x7ef5d42b1fd0>
结论¶
- 我们演示了一个简单的数据获取和分析流程
- 尽管简单,但这种模式可以很好地扩展到更大的数据集
- 在实际系统中,数据收集和存储将与分析分开
- 总的来说,ArcticDB 的使用简单、清晰、明了
- 研究人员和数据科学家喜欢它的简单性 - 这使他们能够专注于数据和他们的研究
有关股票 Beta 值的更多信息,请参阅 https://en.wikipedia.org/wiki/Beta_(finance)