ArcticDB_demo_lmdb
在 Github 中查看 | 在 Google Colab 中打开In [ ]
已复制!
!pip install arcticdb
!pip install arcticdb
In [2]
已复制!
import time
import numpy as np
import pandas as pd
from datetime import datetime
import arcticdb as adb
import time import numpy as np import pandas as pd from datetime import datetime import arcticdb as adb
ArcticDB 概念和术语¶
- 命名空间 – 库的集合。用于将逻辑环境彼此分开。类似于数据库服务器。
- 库 – 包含按特定方式(不同用户、市场等)分组的多个符号。类似于数据库。
- 符号 – 数据存储的原子单元。通过字符串名称标识。存储在符号下的数据非常类似于 Pandas DataFrame。类似于表。
- 版本 – 对符号执行的每个修改操作(写入、追加、更新)都会创建该对象的新版本。
- 快照 – 与特定时间点所有或部分符号关联的数据可以进行快照,稍后通过读取方法检索。
ArcticDB 专为时间序列数据设计¶
让我们创建几个包含日数据的小型数据帧,以便了解情况。
In [3]
已复制!
daily1 = pd.DataFrame(np.ones((4, 3))*1, index=pd.date_range('1/1/2023', periods=4, freq="D"), columns=list('ABC'))
daily1
daily1 = pd.DataFrame(np.ones((4, 3))*1, index=pd.date_range('1/1/2023', periods=4, freq="D"), columns=list('ABC')) daily1
Out[3]
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
In [4]
已复制!
daily2 = pd.DataFrame(np.ones((4, 3))*2, index=pd.date_range('1/5/2023', periods=4, freq="D"), columns=list('ABC'))
daily2
daily2 = pd.DataFrame(np.ones((4, 3))*2, index=pd.date_range('1/5/2023', periods=4, freq="D"), columns=list('ABC')) daily2
Out[4]
A | B | C | |
---|---|---|---|
2023-01-05 | 2.0 | 2.0 | 2.0 |
2023-01-06 | 2.0 | 2.0 | 2.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
In [5]
已复制!
daily3 = pd.DataFrame(np.ones((4, 3))*3, index=pd.date_range('1/3/2023', periods=4, freq="D"), columns=list('ABC'))
daily3
daily3 = pd.DataFrame(np.ones((4, 3))*3, index=pd.date_range('1/3/2023', periods=4, freq="D"), columns=list('ABC')) daily3
Out[5]
A | B | C | |
---|---|---|---|
2023-01-03 | 3.0 | 3.0 | 3.0 |
2023-01-04 | 3.0 | 3.0 | 3.0 |
2023-01-05 | 3.0 | 3.0 | 3.0 |
2023-01-06 | 3.0 | 3.0 | 3.0 |
库管理¶
在此演示中,我们将配置基于 LMDB 文件的后端。ArcticDB 在配置对象存储后端(例如 S3)时可实现高性能和规模。
In [6]
已复制!
arctic = adb.Arctic("lmdb://arcticdb_demo")
arctic = adb.Arctic("lmdb://arcticdb_demo")
您可以拥有无限数量的库,但我们首先只创建一个。
In [7]
已复制!
lib = arctic.get_library('sample', create_if_missing=True)
lib = arctic.get_library('sample', create_if_missing=True)
读取和写入数据¶
从源读取 pandas 数据帧,并将其写入目标¶
ArcticDB 通常遵循 Pandas In, Pandas Out 的哲学。read 和 write 都使用 Pandas DataFrames。
注意 - 在一个库中,通常有成千上万个符号。
In [8]
已复制!
write_record = lib.write("DAILY", daily1)
write_record
write_record = lib.write("DAILY", daily1) write_record
Out[8]
VersionedItem(symbol='DAILY', library='sample', data=n/a, version=0, metadata=None, host='LMDB(path=/content/arcticdb_demo)')
In [9]
已复制!
read_record = lib.read("DAILY")
read_record
read_record = lib.read("DAILY") read_record
Out[9]
VersionedItem(symbol='DAILY', library='sample', data=<class 'pandas.core.frame.DataFrame'>, version=0, metadata=None, host='LMDB(path=/content/arcticdb_demo)')
注意:您可以使用库级别的 快照 将多个符号/表一起进行版本控制!
In [10]
已复制!
read_record.data
read_record.data
Out[10]
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
修改数据¶
ArcticDB 支持数据修改,例如 update 和 append。
In [11]
已复制!
lib.append("DAILY", daily2)
lib.read("DAILY").data
lib.append("DAILY", daily2) lib.read("DAILY").data
Out[11]
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
2023-01-05 | 2.0 | 2.0 | 2.0 |
2023-01-06 | 2.0 | 2.0 | 2.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
In [12]
已复制!
lib.update("DAILY", daily3)
lib.read("DAILY").data
lib.update("DAILY", daily3) lib.read("DAILY").data
Out[12]
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 3.0 | 3.0 | 3.0 |
2023-01-04 | 3.0 | 3.0 | 3.0 |
2023-01-05 | 3.0 | 3.0 | 3.0 |
2023-01-06 | 3.0 | 3.0 | 3.0 |
2023-01-07 | 2.0 | 2.0 | 2.0 |
2023-01-08 | 2.0 | 2.0 | 2.0 |
ArcticDB 是双时间维度的¶
所有 ArcticDB 操作都带有版本 - 可以及时回溯以了解历史修订,并实现数据的 时间点分析!
In [13]
已复制!
# Rewind to version...
lib.read("DAILY", as_of=write_record.version).data
# 回溯到版本... lib.read("DAILY", as_of=write_record.version).data
Out[13]
A | B | C | |
---|---|---|---|
2023-01-01 | 1.0 | 1.0 | 1.0 |
2023-01-02 | 1.0 | 1.0 | 1.0 |
2023-01-03 | 1.0 | 1.0 | 1.0 |
2023-01-04 | 1.0 | 1.0 | 1.0 |
ArcticDB 支持超大型 DataFrames¶
一个典型的用例是将 >100k 指标的历史存储在一个数据帧中,以便于进行时间序列和横截面分析。
对于本演示 Notebook,我们将只处理 10,000 行每小时数据,以及 10,000 列指标。
In [14]
已复制!
n = 10_000
large = pd.DataFrame(np.linspace(1, n, n)*np.linspace(1, n, n)[:,np.newaxis], columns=[f'c{i}' for i in range(n)], index=pd.date_range('1/1/2020', periods=n, freq="H"))
large.tail()
n = 10_000 large = pd.DataFrame(np.linspace(1, n, n)*np.linspace(1, n, n)[:,np.newaxis], columns=[f'c{i}' for i in range(n)], index=pd.date_range('1/1/2020', periods=n, freq="H")) large.tail()
Out[14]
c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | ... | c9990 | c9991 | c9992 | c9993 | c9994 | c9995 | c9996 | c9997 | c9998 | c9999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-02-20 11:00:00 | 9996.0 | 19992.0 | 29988.0 | 39984.0 | 49980.0 | 59976.0 | 69972.0 | 79968.0 | 89964.0 | 99960.0 | ... | 99870036.0 | 99880032.0 | 99890028.0 | 99900024.0 | 99910020.0 | 99920016.0 | 99930012.0 | 99940008.0 | 99950004.0 | 99960000.0 |
2021-02-20 12:00:00 | 9997.0 | 19994.0 | 29991.0 | 39988.0 | 49985.0 | 59982.0 | 69979.0 | 79976.0 | 89973.0 | 99970.0 | ... | 99880027.0 | 99890024.0 | 99900021.0 | 99910018.0 | 99920015.0 | 99930012.0 | 99940009.0 | 99950006.0 | 99960003.0 | 99970000.0 |
2021-02-20 13:00:00 | 9998.0 | 19996.0 | 29994.0 | 39992.0 | 49990.0 | 59988.0 | 69986.0 | 79984.0 | 89982.0 | 99980.0 | ... | 99890018.0 | 99900016.0 | 99910014.0 | 99920012.0 | 99930010.0 | 99940008.0 | 99950006.0 | 99960004.0 | 99970002.0 | 99980000.0 |
2021-02-20 14:00:00 | 9999.0 | 19998.0 | 29997.0 | 39996.0 | 49995.0 | 59994.0 | 69993.0 | 79992.0 | 89991.0 | 99990.0 | ... | 99900009.0 | 99910008.0 | 99920007.0 | 99930006.0 | 99940005.0 | 99950004.0 | 99960003.0 | 99970002.0 | 99980001.0 | 99990000.0 |
2021-02-20 15:00:00 | 10000.0 | 20000.0 | 30000.0 | 40000.0 | 50000.0 | 60000.0 | 70000.0 | 80000.0 | 90000.0 | 100000.0 | ... | 99910000.0 | 99920000.0 | 99930000.0 | 99940000.0 | 99950000.0 | 99960000.0 | 99970000.0 | 99980000.0 | 99990000.0 | 100000000.0 |
5 行 × 10000 列
In [15]
已复制!
t1 = time.time()
lib.write('large', large)
t2 = time.time()
print(f'Wrote {n*n/(t2-t1)/1e6:.2f} million floats per second.')
t1 = time.time() lib.write('large', large) t2 = time.time() print(f'Wrote {n*n/(t2-t1)/1e6:.2f} million floats per second.')
Wrote 13.30 million floats per second.
您可以高效地选择行和列,这在数据不适合内存时是必要的。
In [16]
已复制!
subframe = lib.read(
"large",
columns=["c0", "c1", "c5000", "c5001", "c9998", "c9999"],
date_range=(datetime(2020, 6, 13, 8), datetime(2020, 6, 13, 13))
).data
subframe
subframe = lib.read( "large", columns=["c0", "c1", "c5000", "c5001", "c9998", "c9999"], date_range=(datetime(2020, 6, 13, 8), datetime(2020, 6, 13, 13)) ).data subframe
Out[16]
c0 | c1 | c5000 | c5001 | c9998 | c9999 | |
---|---|---|---|---|---|---|
2020-06-13 08:00:00 | 3945.0 | 7890.0 | 19728945.0 | 19732890.0 | 39446055.0 | 39450000.0 |
2020-06-13 09:00:00 | 3946.0 | 7892.0 | 19733946.0 | 19737892.0 | 39456054.0 | 39460000.0 |
2020-06-13 10:00:00 | 3947.0 | 7894.0 | 19738947.0 | 19742894.0 | 39466053.0 | 39470000.0 |
2020-06-13 11:00:00 | 3948.0 | 7896.0 | 19743948.0 | 19747896.0 | 39476052.0 | 39480000.0 |
2020-06-13 12:00:00 | 3949.0 | 7898.0 | 19748949.0 | 19752898.0 | 39486051.0 | 39490000.0 |
2020-06-13 13:00:00 | 3950.0 | 7900.0 | 19753950.0 | 19757900.0 | 39496050.0 | 39500000.0 |
In [17]
已复制!
n = 100_000_000
long = pd.DataFrame(np.linspace(1, n, n), columns=['Price'], index=pd.date_range('1/1/2020', periods=n, freq="S"))
long.tail()
n = 100_000_000 long = pd.DataFrame(np.linspace(1, n, n), columns=['Price'], index=pd.date_range('1/1/2020', periods=n, freq="S")) long.tail()
Out[17]
Price | |
---|---|
2023-03-03 09:46:35 | 99999996.0 |
2023-03-03 09:46:36 | 99999997.0 |
2023-03-03 09:46:37 | 99999998.0 |
2023-03-03 09:46:38 | 99999999.0 |
2023-03-03 09:46:39 | 100000000.0 |
In [18]
已复制!
t1 = time.time()
lib.write('long', long)
t2 = time.time()
print(f'Wrote {n/(t2-t1)/1e6:.2f} million floats per second.')
t1 = time.time() lib.write('long', long) t2 = time.time() print(f'Wrote {n/(t2-t1)/1e6:.2f} million floats per second.')
Wrote 12.20 million floats per second.
您可以使用熟悉的 Pandas 语法以及高效的 C++ 性能查询数据¶
有关更多信息,请查阅我们的 LazyDataFrame 和 QueryBuilder 文档。
In [19]
已复制!
%%time
lazy_df = lib.read("long", lazy=True)
lazy_df = lazy_df[(lazy_df["Price"] > 49e6) & (lazy_df["Price"] < 51e6)]
filtered = lazy_df.collect().data
%%time lazy_df = lib.read("long", lazy=True) lazy_df = lazy_df[(lazy_df["Price"] > 49e6) & (lazy_df["Price"] < 51e6)] filtered = lazy_df.collect().data
CPU times: user 3.07 s, sys: 525 ms, total: 3.6 s Wall time: 2.3 s
In [20]
已复制!
len(filtered)
len(filtered)
Out[20]
1999999
In [21]
已复制!
filtered.tail()
filtered.tail()
Out[21]
Price | |
---|---|
2021-08-13 06:39:54 | 50999995.0 |
2021-08-13 06:39:55 | 50999996.0 |
2021-08-13 06:39:56 | 50999997.0 |
2021-08-13 06:39:57 | 50999998.0 |
2021-08-13 06:39:58 | 50999999.0 |