PyTOA5: Utilities for TOA5 Files

This library contains routines for the processing of data files in the TOA5 format. Since this format is basically a CSV file with a specific header, this library primarily provides functions to handle the header; the rest of the file can be read with Python’s csv module. A function to read a TOA5 file into a Pandas DataFrame is also provided.

[ Source code on GitHub | Author, Copyright, and License ]

TL;DR

Documentation

[ Index ]

TOA5 files are essentially CSV files that have four header rows:

  1. The “environment line”: EnvironmentLine

  2. The column header names: ColumnHeader.name

  3. The columns’ units: ColumnHeader.unit

  4. The columns’ “data process”: ColumnHeader.prc

The following two functions can be used to read files with this header:

toa5.read_header(csv_reader: Iterator[Sequence[str]], *, allow_dupes: bool = False) tuple[EnvironmentLine, tuple[ColumnHeader, ...]]

Read the header of a TOA5 file.

A common use case to read a TOA5 file would be the following; as you can see, the main difference between reading a regular CSV file and a TOA5 file is the additional call to this function.

>>> import csv, toa5
>>> with open('Example.dat', encoding='ASCII', newline='') as fh:
...     csv_rd = csv.reader(fh, strict=True)
...     env_line, columns = toa5.read_header(csv_rd)
...     print([ toa5.short_name(col) for col in columns ])
...     for row in csv_rd:
...         print(row)
['TIMESTAMP', 'RECORD', 'BattV_Min[V]']
['2021-06-19 00:00:00', '0', '12.99']
['2021-06-20 00:00:00', '1', '12.96']

This also works with csv.DictReader:

>>> import csv, toa5
>>> with open('Example.dat', encoding='ASCII', newline='') as fh:
...     env_line, columns = toa5.read_header(csv.reader(fh, strict=True))
...     for row in csv.DictReader(fh, strict=True,
...             fieldnames=[toa5.short_name(col) for col in columns]):
...         print(row)
{'TIMESTAMP': '2021-06-19 00:00:00', 'RECORD': '0', 'BattV_Min[V]': '12.99'}
{'TIMESTAMP': '2021-06-20 00:00:00', 'RECORD': '1', 'BattV_Min[V]': '12.96'}
Seealso:

short_name(), used in the examples above, is an alias for default_col_hdr_transform().

Parameters:
  • csv_reader – The csv.reader() object to read the header rows from. Only the header is read from the file, so after you call this function, you can use the reader to read the data rows from the input file.

  • allow_dupes – Whether or not to allow duplicates in the ColumnHeader.name values.

Returns:

Returns an EnvironmentLine object and a tuple of ColumnHeader objects.

Raises:

Toa5Error – In case any error is encountered while reading the TOA5 header.

toa5.read_pandas(filepath_or_buffer, *, encoding: str = 'UTF-8', encoding_errors: str = 'strict', col_trans: ~collections.abc.Callable[[~toa5.ColumnHeader], str] = <function default_col_hdr_transform>, **kwargs)

A helper function to read TOA5 files into a pandas.DataFrame. Uses read_header() and pandas.read_csv() internally.

>>> import toa5, pandas
>>> df = toa5.read_pandas('Example.dat', low_memory=False)
>>> print(df)  
            RECORD  BattV_Min[V]
TIMESTAMP                       
2021-06-19       0         12.99
2021-06-20       1         12.96
>>> print(df.attrs['toa5_env_line'])  
EnvironmentLine(station_name='TestLogger', logger_model='CR1000X',
    logger_serial='12342', logger_os='CR1000X.Std.03.02',
    program_name='CPU:TestLogger.CR1X', program_sig='2438',
    table_name='Example')
Parameters:
  • filepath_or_buffer

    A filename or file object from which to read the TOA5 data.

    Note

    Unlike pandas.read_csv(), URLs are not accepted, only such filenames that Python’s open() accepts.

  • col_trans – The ColumnHeaderTransformer to use to convert the ColumnHeader objects into column names. Defaults to default_col_hdr_transform()

  • kwargs – Any additional keyword arguments are passed through to pandas.read_csv(). It is not recommended to set header and names, since they are provided by this function. Other options that this function provides by default, such as na_values or index_col, may be overridden.

Returns:

A pandas.DataFrame. The EnvironmentLine is stored in pandas.DataFrame.attrs under the key "toa5_env_line".

Note

At the time of writing, pandas.DataFrame.attrs is documented as being experimental.

class toa5.EnvironmentLine(station_name: str, logger_model: str, logger_serial: str, logger_os: str, program_name: str, program_sig: str, table_name: str)

Named tuple representing a TOA5 “Environment Line”, giving details about the data logger and its program.

station_name: str

Station (data logger) name

logger_model: str

Model number of the data logger

logger_serial: str

Serial number of the data logger

logger_os: str

Data logger operating system and version

program_name: str

The name of the program on the data logger

program_sig: str

The program’s signature (checksum)

table_name: str

The name of the table contained in this TOA5 file

class toa5.ColumnHeader(name: str, unit: str = '', prc: str = '')

Named tuple representing a column header.

This class represents a column header as it would be read from a text (CSV) file, therefore, when optional fields are empty, this is represented by empty strings, not by None.

name: str

Column name.

unit: str

Scientific/engineering units (optional)

prc: str

“Data process” (optional; examples: "Smp", "Avg", "Max", etc.)

simple_checks(*, strict: bool = True) str

Validates the values in this object against some rules mostly derived from experience:

  • name must start with letters, an underscore, or dollar sign, and otherwise only consist of letters, numbers, underscores, and dollar sign, optionally followed by indices (integers separated by commas) in parentheses. May not be longer than 255 characters in total.

  • unit is fairly lenient and currently allows most printable ASCII characters except backslash. May not be longer than 64 characters in total.

  • prc is fairly strict and currently allows only up to 32 letters, numbers, underscores, and dashes.

Important

Since these rules are derived from experience, they may be adapted in the future, and they may not accurately reflect the rules your data logger imposes on the values. This should normally not be a problem, because within this library, this function is currently only used to generate warnings in default_col_hdr_transform(), and you are free to disable its strict option.

Please feel free to suggest changes.

Parameters:

strict – Whether or not to raise an error for invalid values.

Returns:

Returns the empty string if there are no problems. When strict is off and problems are detected, returns a string describing the problems.

Raises:

ValueError – When strict is on and any unusual values are detected.

toa5.write_header(env_line: EnvironmentLine, columns: Sequence[ColumnHeader]) Generator[Sequence[str], None, None]

Convert an EnvironmentLine and sequence of ColumnHeader objects back into the four TOA5 header rows, suitable for use in e.g. writerows().

toa5.ColumnHeaderTransformer

A type for a function that takes a ColumnHeader and turns it into a single string. See default_col_hdr_transform().

toa5.default_col_hdr_transform(col: ColumnHeader, *, short_units: dict[str, str] | None = None, strict: bool = True) str

The default function used to transform a ColumnHeader into a single string.

This conversion is slightly opinionated and will:

  • strip all whitespace from ColumnHeader values,

  • append ColumnHeader.prc to the name with a slash (unless the name already ends with it),

  • append the units in square brackets shorten some units, and

  • ignore the “TS” and “RN” units on the “TIMESTAMP” and “RECORD” columns, respectively.

Parameters:
  • col – The ColumnHeader to process.

  • short_units – A lookup table in which the keys are the original unit names as they appear in the TOA5 file, and the values are a shorter version of that unit. If not provided, defaults to SHORTER_UNITS.

  • strict – When this is enabled (the default), raise a ValueError if the column name contains the characters /[], which might cause duplicate column names in a table, and warn if ColumnHeader.simple_checks() fails.

toa5.short_name(col: ColumnHeader, *, short_units: dict[str, str] | None = None, strict: bool = True) str

A short alias for default_col_hdr_transform().

toa5.SHORTER_UNITS: dict[str, str]

A table of shorter versions of common units, used in default_col_hdr_transform().

toa5.sql_col_hdr_transform(col: ColumnHeader) str

An alternative function that transforms a ColumnHeader to a string suitable for use in SQL.

  • appends ColumnHeader.prc (unless the name already ends with it)

  • any characters that are not ASCII letters or numbers are converted to underscores (and consecutive underscores are reduced to a single one)

  • the returned name is all-lowercase

  • units are omitted (these could be stored in an SQL column comment, for example)

Warning

This transformation can potentially result in two columns on the same table having the same name, for example, this would be the case with ColumnHeader("Test_1","Volts","Smp") and ColumnHeader("Test(1)","","Smp"), which would both result in "test_1_smp".

Therefore, it is strongly recommended that you check for duplicate column names after using this transformer. For example, see more_itertools.classify_unique().

Parameters:

col – The ColumnHeader to process.

exception toa5.Toa5Error

An error class for read_header().

Command-Line TOA5-to-CSV Tool

The following is a command-line interface to convert a TOA5 file’s headers to a single row, which makes it more suitable for processing in other programs that expect CSV files with a single header row.

If this module and its scripts have been installed correctly, you should be able to run toa5-to-csv --help or python -m toa5.to_csv --help for details.

usage: toa5.to_csv [-h] [-o OUT_FILE] [-l ENV_LINE_FILE]
                   [-d {excel,excel-tab,unix}] [-n] [-s] [-a] [-e IN_ENCODING]
                   [-c OUT_ENCODING] [-t] [-j]
                   [TOA5FILE]

TOA5 to CSV Converter

positional arguments:
  TOA5FILE              The TOA5 file to process ("-"=STDIN)

options:
  -h, --help            show this help message and exit
  -o, --out-file OUT_FILE
                        Output filename ("-"=STDOUT)
  -l, --env-line ENV_LINE_FILE
                        JSON file for environment line ("-"=STDOUT)
  -d, --out-dialect {excel,excel-tab,unix}
                        Output CSV dialect (see Python `csv` module)
  -n, --simple-names    Simpler column names (no units etc.)
  -s, --sql-names       Transform column names to be suitable for SQL
  -a, --allow-dupes     Allow duplicate column names (in input and output)
  -e, --in-encoding IN_ENCODING
                        Input file encoding (default UTF-8)
  -c, --out-encoding OUT_ENCODING
                        Output file encoding (default UTF-8)
  -t, --require-timestamp
                        Require first column to be TIMESTAMP
  -j, --allow-jagged    Allow rows to have differing column counts

Details can be found at https://haukex.github.io/pytoa5/

Changelog

v0.9.2 - 2024-10-21

v0.9.1 - 2024-10-19

  • Actually allow overriding toa5.read_pandas() arguments (didn’t work as documented)

  • Made toa5.read_pandas() arguments more flexible: accept filename as well, and allow overriding all arguments.

  • Added --sql-names and --allow-dupes to CLI

  • A few documentation updates.

v0.9.0 - 2024-10-18

  • Initial release