Binder Colab Deepnote Kaggle

Encoding ODB-2 Data

Trivial Example

Given a pandas DataFrame to encode it, the data should simply be passed to encode_odb() function:

[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc

df = pd.read_csv('data-1.csv')

odc.encode_odb(df, 'example-1.odb')

File Type Object

Encoding of ODB-2 data works with file-like objects as well as with file names:

[3]:
with open('example-1.odb', 'wb') as f:
    odc.encode_odb(df, f)

Configuring Encoded Columns

By default, pyodc will always encode ODB-2 data in a lossless manner. In particular, most values are encoded as 8-byte DOUBLE values.

Typically, the encoder will automatically select a data type and corresponding encoder to use. This data type can be overridden by supplying a types dictionary, for example to encode a column as a 4-byte REAL value:

[4]:
odc.encode_odb(df, 'example-3.odb', types={'obsvalue@body': odc.REAL})

The interrogation of the frame headers shows that the data type has changed:

[5]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r3 = odc.Reader('example-3.odb', aggregated=False)

print('original:', r1.frames[0].column_dict['obsvalue@body'].dtype)
print('updated: ', r3.frames[0].column_dict['obsvalue@body'].dtype)
original: DataType.DOUBLE
updated:  DataType.REAL

Decoded data also confirms that the precision has been appropriately reduced:

[6]:
df_decoded = odc.read_odb('example-3.odb', single=True)
print(df_decoded)
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890       0.000000
1       1  20210420     stat01  0-12345-0-67891      12.345600
2       1  20210420     stat02  0-12345-0-67892      24.691200
3       1  20210420     stat03  0-12345-0-67893      37.036800
4       1  20210420     stat04  0-12345-0-67894      49.382401
5       1  20210420     stat05  0-12345-0-67895      61.728001
6       1  20210420     stat06  0-12345-0-67896      74.073601
7       1  20210420     stat07  0-12345-0-67897      86.419197
8       1  20210420     stat08  0-12345-0-67898      98.764801
9       1  20210420     stat09  0-12345-0-67899     111.110397

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
3           1234.0           12.34
4           4321.0           43.21
5              NaN             NaN
6           1234.0           12.34
7           4321.0           43.21
8              NaN             NaN
9           1234.0           12.34

Configuring Frame Structure

ODB-2 data is broken down into frames. By default a maximum of 10 000 rows of data will be encoded into each frame. If more than 10 000 rows are supplied, then the data will be split into a sequence of frames with at maximum 10 000 rows.

To modify the threshold, pass rows_per_frame argument:

[7]:
odc.encode_odb(df, 'example-4.odb', rows_per_frame=3)

Examination of the frame structure clearly shows that the data now contains multiple frames:

[8]:
r1 = odc.Reader('example-1.odb', aggregated=False)
r4 = odc.Reader('example-4.odb', aggregated=False)

print('original frames:', r1.frames)
print('updated  frames:', r4.frames)

print('original row counts:', [f.nrows for f in r1.frames])
print('updated  row counts:', [f.nrows for f in r4.frames])
original frames: [<pyodc.frame.Frame object at 0x7f85a83d99d0>]
updated  frames: [<pyodc.frame.Frame object at 0x7f85b8818df0>, <pyodc.frame.Frame object at 0x7f85a83b6c70>, <pyodc.frame.Frame object at 0x7f85a8570940>, <pyodc.frame.Frame object at 0x7f85a8570be0>]
original row counts: [10]
updated  row counts: [3, 3, 3, 1]

Despite these differences, if decoded the data is the same:

[9]:
df_decoded = odc.read_odb('example-4.odb', single=True)
print(df_decoded)
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912
3       1  20210420     stat03  0-12345-0-67893        37.0368
4       1  20210420     stat04  0-12345-0-67894        49.3824
5       1  20210420     stat05  0-12345-0-67895        61.7280
6       1  20210420     stat06  0-12345-0-67896        74.0736
7       1  20210420     stat07  0-12345-0-67897        86.4192
8       1  20210420     stat08  0-12345-0-67898        98.7648
9       1  20210420     stat09         0-12345-       111.1104

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
3           1234.0           12.34
4           4321.0           43.21
5              NaN             NaN
6           1234.0           12.34
7           4321.0           43.21
8              NaN             NaN
9           1234.0           12.34

Additional Properties

To encode additional properties as part of frame’s data, specify properties parameter to encode_odb() function with a dictionary value you want to include:

[10]:
metadata = {
    'encoded_by': 'ECMWF',
    'data_source': 'pyodc_docs',
}
odc.encode_odb(df, 'example-1.odb', properties=metadata)

Encoded properties are accessible via properties key of the frame object:

[11]:
r1 = odc.Reader('example-1.odb')
print([f.properties for f in r1.frames])
[{}]

A Sequence of (Unrelated) Data

ODB-2 frames are self-contained and passed as a stream of data, which means there is no requirement that they are related with each other.

For example, we can encode frames of two different structures (also known as incompatible data):

[12]:
df2 = pd.read_csv('data-2.csv')

with open('example-2.odb', 'wb') as f:
   odc.encode_odb(df, f)
   odc.encode_odb(df2, f)

The trivial decoder will now result in a DataFrame with a substantial number of missing values:

[13]:
with open('example-2.odb', 'rb') as f:
    df_decoded = odc.read_odb(f, single=True)

print(df_decoded)
    expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0        1  20210420     stat00  0-12345-0-67890         0.0000
1        1  20210420     stat01  0-12345-0-67891        12.3456
2        1  20210420     stat02  0-12345-0-67892        24.6912
3        1  20210420     stat03  0-12345-0-67893        37.0368
4        1  20210420     stat04  0-12345-0-67894        49.3824
5        1  20210420     stat05  0-12345-0-67895        61.7280
6        1  20210420     stat06  0-12345-0-67896        74.0736
7        1  20210420     stat07  0-12345-0-67897        86.4192
8        1  20210420     stat08  0-12345-0-67898        98.7648
9        1  20210420     stat09  0-12345-0-67899       111.1104
10       2  20210420     stat00             None         0.0000
11       2  20210420     stat01             None        12.3456
12       2  20210420     stat02             None        24.6912
13       2  20210420     stat03             None        37.0368
14       2  20210420     stat04             None        49.3824
15       2  20210420     stat05             None        61.7280
16       2  20210420     stat06             None        74.0736
17       2  20210420     stat07             None        86.4192
18       2  20210420     stat08             None        98.7648
19       2  20210420     stat09             None       111.1104

    integer_missing  double_missing
0            1234.0           12.34
1            4321.0           43.21
2               NaN             NaN
3            1234.0           12.34
4            4321.0           43.21
5               NaN             NaN
6            1234.0           12.34
7            4321.0           43.21
8               NaN             NaN
9            1234.0           12.34
10              NaN             NaN
11              NaN             NaN
12              NaN             NaN
13              NaN             NaN
14              NaN             NaN
15              NaN             NaN
16              NaN             NaN
17              NaN             NaN
18              NaN             NaN
19              NaN             NaN