Binder Colab Deepnote Kaggle

Decoding ODB-2 Data

The high-level decoding API in pyodc is compatible with pandas and is designed to be as straightforward as possible.

Trivial Example

To decode the data, read it directly via read_odb() function:

[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc

df_decoded = odc.read_odb('example-1.odb', single=True)
print(df_decoded)
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912
3       1  20210420     stat03  0-12345-0-67893        37.0368
4       1  20210420     stat04  0-12345-0-67894        49.3824
5       1  20210420     stat05  0-12345-0-67895        61.7280
6       1  20210420     stat06  0-12345-0-67896        74.0736
7       1  20210420     stat07  0-12345-0-67897        86.4192
8       1  20210420     stat08  0-12345-0-67898        98.7648
9       1  20210420     stat09  0-12345-0-67899       111.1104

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
3           1234.0           12.34
4           4321.0           43.21
5              NaN             NaN
6           1234.0           12.34
7           4321.0           43.21
8              NaN             NaN
9           1234.0           12.34

Note

By passing single=True argument to read_odb(), you are making sure that the data is aggregated and returned as a single frame if possible. For more information on aggregation capabilities, please see the following section.

File Type Object

Decoding of ODB-2 data works with file-like objects as well as with file names:

[3]:
with open('example-1.odb', 'rb') as f:
    odc.read_odb(f, single=True)

Decoding a Subset of the Data

For large ODB-2 files, it can be very valuable to not decode all of the data. The decode functions accept a list or tuple specifying the columns to decode.

This is especially helpful when the structure of ODB-2 frames in a file is not constant, but all of the frames supply the desired data:

[4]:
df_decoded = odc.read_odb('example-1.odb', single=True, columns=('statid@hdr', 'obsvalue@body'))
print(df_decoded)
  statid@hdr  obsvalue@body
0     stat00         0.0000
1     stat01        12.3456
2     stat02        24.6912
3     stat03        37.0368
4     stat04        49.3824
5     stat05        61.7280
6     stat06        74.0736
7     stat07        86.4192
8     stat08        98.7648
9     stat09       111.1104

Note

For historical reasons, column references can omit the @ sign and everything after it, but only in case the column name is unique and unambigious. For example, the line above could also refer to the two columns in the following format:

columns=('statid', 'obsvalue')

Decoding a Sequence of Frames

If ODB-2 data is extremely large, it is undesirable to attempt to decode it into memory in its entirety. Furthermore, if the frames within the file are not compatible, it may be a better idea to consider each of the frames separately.

By default, read_odb() function returns an iterable sequence that lazily decodes ODB-2 frames as they are needed:

[5]:
for idx, df_decoded in enumerate(odc.read_odb('example-2.odb')):
   if idx > 0: print()
   print('Decoded data frame:', idx)
   print(df_decoded)
Decoded data frame: 0
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912
3       1  20210420     stat03  0-12345-0-67893        37.0368
4       1  20210420     stat04  0-12345-0-67894        49.3824
5       1  20210420     stat05  0-12345-0-67895        61.7280
6       1  20210420     stat06  0-12345-0-67896        74.0736
7       1  20210420     stat07  0-12345-0-67897        86.4192
8       1  20210420     stat08  0-12345-0-67898        98.7648
9       1  20210420     stat09  0-12345-0-67899       111.1104

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
3           1234.0           12.34
4           4321.0           43.21
5              NaN             NaN
6           1234.0           12.34
7           4321.0           43.21
8              NaN             NaN
9           1234.0           12.34

Decoded data frame: 1
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat00         0.0000
1       2  20210420     stat01        12.3456
2       2  20210420     stat02        24.6912
3       2  20210420     stat03        37.0368
4       2  20210420     stat04        49.3824
5       2  20210420     stat05        61.7280
6       2  20210420     stat06        74.0736
7       2  20210420     stat07        86.4192
8       2  20210420     stat08        98.7648
9       2  20210420     stat09       111.1104

Aggregated or Non-aggregated Decoding

To page the data through memory without consuming more resources than exist, a sequence of frames may be considered to be one frame that has been split for technical reasons. The library is able to group these frames together into one logical, aggregated frame (and, indeed, it does this by default). Decoding aggregated logical frames in one step significantly improves performance of the decoder if offloading to odc.

Both Reader and read_odb() functionality take two arguments:

  • aggregated - (default: True) enables or disables aggregation of compatible frames.

  • max_aggregated - (default: None) sets a maximum number of rows to be combined into one logical frame, before the library will split them anyway (for pagination purposes).

For example, first we build a decoder with several real and a smaller number of logical frames:

[6]:
df = pd.read_csv('data-1.csv')
df2 = pd.read_csv('data-2.csv')

with open('example-5.odb', 'wb') as f:
   odc.encode_odb(df, f, rows_per_frame=3)
   odc.encode_odb(df2, f, rows_per_frame=3)

Interrogation of the structure can be done by using two different readers:

[7]:
r5a = odc.Reader('example-5.odb')
r5b = odc.Reader('example-5.odb', aggregated=False)

print('aggregated row counts:', [f.nrows for f in r5a.frames])
print('separate   row counts:', [f.nrows for f in r5b.frames])
aggregated row counts: [10, 10]
separate   row counts: [3, 3, 3, 1, 3, 3, 3, 1]

By default, data is decoded in an aggregated fashion:

[8]:
for idx, df_decoded in enumerate(odc.read_odb('example-5.odb')):
   if idx > 0: print()
   print('Decoded data frame:', idx)
   print(df_decoded)
Decoded data frame: 0
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912
0       1  20210420     stat03  0-12345-0-67893        37.0368
1       1  20210420     stat04  0-12345-0-67894        49.3824
2       1  20210420     stat05  0-12345-0-67895        61.7280
0       1  20210420     stat06  0-12345-0-67896        74.0736
1       1  20210420     stat07  0-12345-0-67897        86.4192
2       1  20210420     stat08  0-12345-0-67898        98.7648
0       1  20210420     stat09         0-12345-       111.1104

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN
0           1234.0           12.34

Decoded data frame: 1
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat00         0.0000
1       2  20210420     stat01        12.3456
2       2  20210420     stat02        24.6912
0       2  20210420     stat03        37.0368
1       2  20210420     stat04        49.3824
2       2  20210420     stat05        61.7280
0       2  20210420     stat06        74.0736
1       2  20210420     stat07        86.4192
2       2  20210420     stat08        98.7648
0       2  20210420     stat09       111.1104

But, the real frames can also be decoded separately:

[9]:
for idx, df_decoded in enumerate(odc.read_odb('example-5.odb', aggregated=False)):
      if idx > 0: print()
      print('Decoded data frame:', idx)
      print(df_decoded)
Decoded data frame: 0
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat00  0-12345-0-67890         0.0000
1       1  20210420     stat01  0-12345-0-67891        12.3456
2       1  20210420     stat02  0-12345-0-67892        24.6912

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN

Decoded data frame: 1
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat03  0-12345-0-67893        37.0368
1       1  20210420     stat04  0-12345-0-67894        49.3824
2       1  20210420     stat05  0-12345-0-67895        61.7280

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN

Decoded data frame: 2
   expver  date@hdr statid@hdr        wigos@hdr  obsvalue@body  \
0       1  20210420     stat06  0-12345-0-67896        74.0736
1       1  20210420     stat07  0-12345-0-67897        86.4192
2       1  20210420     stat08  0-12345-0-67898        98.7648

   integer_missing  double_missing
0           1234.0           12.34
1           4321.0           43.21
2              NaN             NaN

Decoded data frame: 3
   expver  date@hdr statid@hdr wigos@hdr  obsvalue@body  integer_missing  \
0       1  20210420     stat09  0-12345-       111.1104             1234

   double_missing
0           12.34

Decoded data frame: 4
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat00         0.0000
1       2  20210420     stat01        12.3456
2       2  20210420     stat02        24.6912

Decoded data frame: 5
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat03        37.0368
1       2  20210420     stat04        49.3824
2       2  20210420     stat05        61.7280

Decoded data frame: 6
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat06        74.0736
1       2  20210420     stat07        86.4192
2       2  20210420     stat08        98.7648

Decoded data frame: 7
   expver  date@hdr statid@hdr  obsvalue@body
0       2  20210420     stat09       111.1104