PyTOA5: Utilities for TOA5 Files¶

This library contains routines for the processing of data files in the TOA5 format. Since this format is basically a CSV file with a specific header, this library primarily provides functions to handle the header; the rest of the file can be read with Python’s csv module. A function to read a TOA5 file into a Pandas DataFrame is also provided.

[ Source code on GitHub | Author, Copyright, and License ]

TL;DR¶

Code examples with csv: toa5.read_header()
Code example with pandas: toa5.read_pandas()
Command-Line TOA5-to-CSV Tool

Documentation¶

[ Index ]

TOA5 files are essentially CSV files that have four header rows:

The “environment line”: EnvironmentLine
The column header names: ColumnHeader.name
The columns’ units: ColumnHeader.unit
The columns’ “data process”: ColumnHeader.prc

The following two functions can be used to read files with this header:

toa5.read_header(csv_reader: Iterator[Sequence[str]], *, allow_dupes: bool = False) → tuple[EnvironmentLine, tuple[ColumnHeader, ...]]¶

Read the header of a TOA5 file.

A common use case to read a TOA5 file would be the following; as you can see, the main difference between reading a regular CSV file and a TOA5 file is the additional call to this function.

>>> import csv, toa5
>>> with open('Example.dat', encoding='ASCII', newline='') as fh:
...     csv_rd = csv.reader(fh, strict=True)
...     env_line, columns = toa5.read_header(csv_rd)
...     print([ toa5.short_name(col) for col in columns ])
...     for row in csv_rd:
...         print(row)
['TIMESTAMP', 'RECORD', 'BattV_Min[V]']
['2021-06-19 00:00:00', '0', '12.99']
['2021-06-20 00:00:00', '1', '12.96']

This also works with csv.DictReader:

>>> import csv, toa5
>>> with open('Example.dat', encoding='ASCII', newline='') as fh:
...     env_line, columns = toa5.read_header(csv.reader(fh, strict=True))
...     for row in csv.DictReader(fh, strict=True,
...             fieldnames=[toa5.short_name(col) for col in columns]):
...         print(row)
{'TIMESTAMP': '2021-06-19 00:00:00', 'RECORD': '0', 'BattV_Min[V]': '12.99'}
{'TIMESTAMP': '2021-06-20 00:00:00', 'RECORD': '1', 'BattV_Min[V]': '12.96'}

Seealso:

short_name(), used in the examples above, is an alias for default_col_hdr_transform().

Parameters:

csv_reader – The csv.reader() object to read the header rows from. Only the header is read from the file, so after you call this function, you can use the reader to read the data rows from the input file.
allow_dupes – Whether or not to allow duplicates in the ColumnHeader.name values.

Returns:

Returns an EnvironmentLine object and a tuple of ColumnHeader objects.

Raises:

Toa5Error – In case any error is encountered while reading the TOA5 header.

toa5.read_pandas(filepath_or_buffer, *, encoding: str = 'UTF-8', encoding_errors: str = 'strict', col_trans: ~collections.abc.Callable[[~toa5.ColumnHeader], str] = <function default_col_hdr_transform>, **kwargs)¶

A helper function to read TOA5 files into a pandas.DataFrame. Uses read_header() and pandas.read_csv() internally.

>>> import toa5, pandas
>>> df = toa5.read_pandas('Example.dat', low_memory=False)
>>> print(df)  
            RECORD  BattV_Min[V]
TIMESTAMP                       
2021-06-19       0         12.99
2021-06-20       1         12.96
>>> print(df.attrs['toa5_env_line'])  
EnvironmentLine(station_name='TestLogger', logger_model='CR1000X',
    logger_serial='12342', logger_os='CR1000X.Std.03.02',
    program_name='CPU:TestLogger.CR1X', program_sig='2438',
    table_name='Example')

Parameters:

filepath_or_buffer –
A filename or file object from which to read the TOA5 data.

Note

Unlike pandas.read_csv(), URLs are not accepted, only such filenames that Python’s open() accepts.
col_trans – The ColumnHeaderTransformer to use to convert the ColumnHeader objects into column names. Defaults to default_col_hdr_transform()
kwargs – Any additional keyword arguments are passed through to pandas.read_csv(). It is not recommended to set header and names, since they are provided by this function. Other options that this function provides by default, such as na_values or index_col, may be overridden.

Returns:

A pandas.DataFrame. The EnvironmentLine is stored in pandas.DataFrame.attrs under the key "toa5_env_line".

Note

At the time of writing, pandas.DataFrame.attrs is documented as being experimental.

class toa5.EnvironmentLine(station_name: str, logger_model: str, logger_serial: str, logger_os: str, program_name: str, program_sig: str, table_name: str)¶

Named tuple representing a TOA5 “Environment Line”, giving details about the data logger and its program.

station_name: str¶: Station (data logger) name

logger_model: str¶: Model number of the data logger

logger_serial: str¶: Serial number of the data logger

logger_os: str¶: Data logger operating system and version

program_name: str¶: The name of the program on the data logger

program_sig: str¶: The program’s signature (checksum)

table_name: str¶: The name of the table contained in this TOA5 file

class toa5.ColumnHeader(name: str, unit: str = '', prc: str = '')¶

Named tuple representing a column header.

This class represents a column header as it would be read from a text (CSV) file, therefore, when optional fields are empty, this is represented by empty strings, not by None.

name: str¶: Column name.

unit: str¶: Scientific/engineering units (optional)

prc: str¶: “Data process” (optional; examples: "Smp", "Avg", "Max", etc.)

simple_checks(*, strict: bool = True) → str¶

Validates the values in this object against some rules mostly derived from experience:

name must start with letters, an underscore, or dollar sign, and otherwise only consist of letters, numbers, underscores, and dollar sign, optionally followed by indices (integers separated by commas) in parentheses. May not be longer than 255 characters in total.
unit is fairly lenient and currently allows most printable ASCII characters except backslash. May not be longer than 64 characters in total.
prc is fairly strict and currently allows only up to 32 letters, numbers, underscores, and dashes.

Important

Since these rules are derived from experience, they may be adapted in the future, and they may not accurately reflect the rules your data logger imposes on the values. This should normally not be a problem, because within this library, this function is currently only used to generate warnings in default_col_hdr_transform(), and you are free to disable its strict option.

Please feel free to suggest changes.

Parameters:: strict – Whether or not to raise an error for invalid values.
Returns:: Returns the empty string if there are no problems. When strict is off and problems are detected, returns a string describing the problems.
Raises:: ValueError – When strict is on and any unusual values are detected.

toa5.write_header(env_line: EnvironmentLine, columns: Sequence[ColumnHeader]) → Generator[Sequence[str], None, None]¶: Convert an EnvironmentLine and sequence of ColumnHeader objects back into the four TOA5 header rows, suitable for use in e.g. writerows().

toa5.ColumnHeaderTransformer¶: A type for a function that takes a ColumnHeader and turns it into a single string. See default_col_hdr_transform().

toa5.default_col_hdr_transform(col: ColumnHeader, *, short_units: dict[str, str] | None = None, strict: bool = True) → str¶

The default function used to transform a ColumnHeader into a single string.

This conversion is slightly opinionated and will:

strip all whitespace from ColumnHeader values,
append ColumnHeader.prc to the name with a slash (unless the name already ends with it),
append the units in square brackets shorten some units, and
ignore the “TS” and “RN” units on the “TIMESTAMP” and “RECORD” columns, respectively.

Parameters:

col – The ColumnHeader to process.
short_units – A lookup table in which the keys are the original unit names as they appear in the TOA5 file, and the values are a shorter version of that unit. If not provided, defaults to SHORTER_UNITS.
strict – When this is enabled (the default), raise a ValueError if the column name contains the characters /[], which might cause duplicate column names in a table, and warn if ColumnHeader.simple_checks() fails.

toa5.short_name(col: ColumnHeader, *, short_units: dict[str, str] | None = None, strict: bool = True) → str¶: A short alias for default_col_hdr_transform().

toa5.SHORTER_UNITS: dict[str, str]¶: A table of shorter versions of common units, used in default_col_hdr_transform().

toa5.sql_col_hdr_transform(col: ColumnHeader) → str¶

An alternative function that transforms a ColumnHeader to a string suitable for use in SQL.

appends ColumnHeader.prc (unless the name already ends with it)
any characters that are not ASCII letters or numbers are converted to underscores (and consecutive underscores are reduced to a single one)
the returned name is all-lowercase
units are omitted (these could be stored in an SQL column comment, for example)

Warning

This transformation can potentially result in two columns on the same table having the same name, for example, this would be the case with ColumnHeader("Test_1","Volts","Smp") and ColumnHeader("Test(1)","","Smp"), which would both result in "test_1_smp".

Therefore, it is strongly recommended that you check for duplicate column names after using this transformer. For example, see more_itertools.classify_unique().

Parameters:: col – The ColumnHeader to process.

exception toa5.Toa5Error¶: An error class for read_header().

Command-Line TOA5-to-CSV Tool¶

The following is a command-line interface to convert a TOA5 file’s headers to a single row, which makes it more suitable for processing in other programs that expect CSV files with a single header row.

If this module and its scripts have been installed correctly, you should be able to run toa5-to-csv --help or python -m toa5.to_csv --help for details.

usage: toa5.to_csv [-h] [-o OUT_FILE] [-l ENV_LINE_FILE]
                   [-d {excel,excel-tab,unix}] [-n] [-s] [-a] [-e IN_ENCODING]
                   [-c OUT_ENCODING] [-t] [-j]
                   [TOA5FILE]

TOA5 to CSV Converter

positional arguments:
  TOA5FILE              The TOA5 file to process ("-"=STDIN)

options:
  -h, --help            show this help message and exit
  -o, --out-file OUT_FILE
                        Output filename ("-"=STDOUT)
  -l, --env-line ENV_LINE_FILE
                        JSON file for environment line ("-"=STDOUT)
  -d, --out-dialect {excel,excel-tab,unix}
                        Output CSV dialect (see Python `csv` module)
  -n, --simple-names    Simpler column names (no units etc.)
  -s, --sql-names       Transform column names to be suitable for SQL
  -a, --allow-dupes     Allow duplicate column names (in input and output)
  -e, --in-encoding IN_ENCODING
                        Input file encoding (default UTF-8)
  -c, --out-encoding OUT_ENCODING
                        Output file encoding (default UTF-8)
  -t, --require-timestamp
                        Require first column to be TIMESTAMP
  -j, --allow-jagged    Allow rows to have differing column counts

Details can be found at https://haukex.github.io/pytoa5/

Changelog¶

v0.9.2 - 2024-10-21¶

Added toa5.ColumnHeader.simple_checks()
Potentially incompatible changes:
Added strict to toa5.default_col_hdr_transform() and enabled it by default, so the characters /[] are now not allowed in column names
toa5.default_col_hdr_transform() now strips whitespace
toa5.default_col_hdr_transform() and toa5.sql_col_hdr_transform() now no longer drop “Smp” from toa5.ColumnHeader.prc
Therefore, temporarily marked this project as “Beta”

v0.9.1 - 2024-10-19¶

Actually allow overriding toa5.read_pandas() arguments (didn’t work as documented)
Made toa5.read_pandas() arguments more flexible: accept filename as well, and allow overriding all arguments.
Added --sql-names and --allow-dupes to CLI
A few documentation updates.

v0.9.0 - 2024-10-18¶

Initial release

Author, Copyright, and License¶

This library is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see https://www.gnu.org/licenses/