comprehensive binary encoding format for dates and times
Temporenc encodes any combination of date, time, and time zone offset information: year, month, day, hour, minute, second, sub-second precision, and UTC offset. Each field is optional.
Temporenc uses a compact binary encoding scheme. Encoded values have a variable size between 3 and 10 bytes. Values of the same type always have the same size.
Encoded temporenc values are self-describing, and can be consumed from a stream without framing. Lexicographically sorting encoded values of the same type puts the values in chronological order.
Temporenc is a comprehensive binary encoding format for dates and times. It provides a low level building block for higher level protocols and file formats.
Temporenc only deals with the encoding of date and time related information, and is designed for embedding into other encoding schemes. The only requirement is that embedding formats have support for arbitrary byte strings. This makes temporenc a perfect companion for encoding schemes that encode arbitrary data structures but lack a flexible date/time type (if any), such as MessagePack, Protocol Buffers, and Thrift. Due to its compactness and ordering properties, temporenc is also a perfect fit for (partial) keys in key/value stores.
The format is very flexible and supports any combination of a date, a time, and a time zone offset. Within each of these components, each field is also optional, e.g. it is possible to encode a year and a month without a day. Times can have sub-second precision using either milliseconds, microseconds, or nanoseconds. Time zones use an UTC offset with 15 minute granularity, allowing any time zone in use in the world to be represented.
Temporenc values have a variable size between 3 and 10 bytes, depending on the components being included. Values of the same type (and precision) always have the same size. For example, an encoded date uses 3 bytes, an encoded time also takes 3 bytes, and an encoded date with time uses 5 bytes. At the other extreme, it takes only 10 bytes to encode a date with time using nanosecond precision and a time zone offset.
Temporenc values are self-describing; consuming applications do not need to know which variant was used for encoding. Since all information needed for decoding can be derived from the first byte, values can be read from streams without framing. Encoded values of the same type (and precision) can be sorted using normal lexicographical sorting routines, i.e. without decoding. Earlier dates sort first, missing values sort last. This makes temporenc values very suited for use in search trees or as (partial) keys in key/value stores.
Temporenc is built around two main concepts: components and types. This specification defines four components, each representing a single aspect of the temporenc date/time model:
Component D (date)
This component contains year, month, and day information. Each field is optional.
Component T (time)
This component contains hour, minute, and second information. Each field is optional.
Component S (sub-second precision)
This is a refinement to the time component that allows for a more precise time representation, expressed as either milliseconds, microseconds, or nanoseconds.
Component Z (time zone offset)
This component specifies the UTC offset.
The above components can be combined to create a complete date/time value. Temporenc defines six types for common combinations. Each type represents a particular combination of date, time, sub-second time precision, and time zone offset information:
Date only, encoded as 3 bytes.
Time only, encoded as 3 bytes.
Date + time, encoded as 5 bytes.
Date + time + time zone offset, encoded as 6 bytes.
Date + time with sub-second precision, encoded as 6–9 bytes (precision dependent).
Date + time with sub-second precision + time zone offset, encoded as 7–10 bytes (precision dependent).
The canonical type DTSZ contains all components, making it a superset of the other types. Since any component (and each field within) can be left blank, it can represent all possible combinations of components (dates, times, and so on). This makes the DTSZ type the most flexible , but also the most space-consuming.
Applications can use a different type to save space, at the cost of reduced expressiveness. The types are chosen in such a way that both sub-second precision and time zone support are completely optional. By using the correct type the storage overhead for unused components can be eliminated completely, since temporenc uses a different packing format for each type.
This section describes how the components and types of the temporenc model are encoded into a byte string. Encoding is done in two stages: encoding individual components, followed by packing the encoded components together to construct the encoded value as a byte string.
In the first stage, each component is encoded separately, resulting in an array of bits. The rules for encoding components are the same for all types. For representing numbers as bit strings, temporenc always uses unsigned big-endian notation, e.g. encoding the number 13 into 5 bits results in the bit string 01101 (8 + 4 + 1).
The date component (D) always uses 21 bits, divided in three groups:
Year (12 bits)
An integer in the range 0–4094 (both inclusive); the special value 4095 means no value is set.
Month (4 bits)
An integer in the range 0–11 (both inclusive); the special value 15 means no value is set. January is encoded as 0, February as 1, and so on. Note that this is off-by-one compared to human month numbering.
Day (5 bits)
An integer in the range 0–30 (both inclusive); the special value 31 means no value is set. The first day of the month is encoded as 0, the next as 1. Note that this is off-by-one compared to human day numbering.
|year, month, day||1983-01-15||011110111111||0000||01110|
The time component (T) always uses 17 bits, divided in three groups:
Hour (5 bits)
An integer in the range 0–23 (both inclusive); the special value 31 means no value is set.
Minute (6 bits)
An integer in the range 0–59 (both inclusive); the special value 63 means no value is set.
Second (6 bits)
An integer in the range 0–60 (both inclusive); the special value 63 means no value is set. Note that the value 60 is supported because it is required to correctly represent leap seconds.
|hour, minute, second||18:25:12||10010||011001||001100|
The sub-second time precision component (S) is expressed as either milliseconds (ms), microseconds (µs), or nanoseconds (ns). Each precision requires a different number of bits of storage space. This means that unlike the other components, this component uses a variable number of bits, indicated by a 2-bit precision tag, referred to as P.
Milliseconds (10 bits value, 2 bits tag, 12 bits in total)
An integer in the range 0–999 (both inclusive) represented as 10 bits. The precision tag P is 00.
Microseconds (20 bits value, 2 bits tag, 22 bits in total)
An integer in the range 0–999999 (both inclusive) represented as 20 bits. The precision tag P is 01.
Nanoseconds (30 bits value, 2 bits tag, 32 bits in total)
An integer in the range 0–999999999 (both inclusive) represented as 30 bits. The precision tag P is 10.
Empty sub-second precision (0 bits value, 2 bits tag, 2 bits in total)
The precision tag P is 11, and no additional information is encoded. Note that if no sub-second precision time component is required, using a type that does not include this component at all is more space efficient, e.g. DTZ instead of DTSZ.
The time zone offset component (Z) always uses 7 bits. When a temporenc type with a time zone offset component is used, the date (D) and time (T) components are stored in UTC. This means that implementations must convert a date/time value to its UTC equivalent first. This ensures that the encoded values can be sorted properly, regardless of their time zone.
Temporenc uses UTC offsets (usually written as ±HH:MM) to represent time zone information. The UTC offset is expressed as the number of 15 minute increments from UTC, with the constant 64 added to it to produce a positive integer, i.e. (offset_in_minutes / 15) + 64. The resulting number must be in the range 0–125 (both inclusive). The special value 127 means no value is set.
The special value 126 means that this value does carry time zone information, but that it is not expressed as an embedded UTC offset. This makes it possible to use more elaborate time zone handling with temporenc values, for example using geographical identifiers from the tzdata project. The actual inclusion of additional time zone information is outside the scope of temporenc; the value 126 is just an indicator that time zone information is handled externally.
|Offset||Offset||Encoded value||Encoded value|
The second encoding stage is about packing the encoded components into the final byte string. An encoded temporenc value is basically a concatenation of the bit strings for each component. The exact packing format depends on the type, which means each type has its own bit packing rules. Each type is assigned a unique type tag, which is a short identifying bit string included in the first byte of the encoded value. The advantages of this approach are:
The table below specifies the type tag for each type, and the order used for the concatenation of the encoded components:
|DTS||01||✓||✓||✓||✓||✓ (if needed)|
|DTSZ||111||✓||✓||✓||✓||✓||✓ (if needed)|
The general approach for creating the final byte strings, as detailed in the next subsection, is as follows:
The remainder of this section specifies the exact byte layout for each encoded temporenc type, including examples showing both bit strings and bytes (hexadecimal notation).
The type tag is 100. Encoded values use 3 bytes in this format:
100DDDDD DDDDDDDD DDDDDDDD
Example: 1983-01-15 is encoded as 10001111 01111110 00001110 (bits) or 8f 7e 0e (hex bytes).
The type tag is 1010000. Encoded values use 3 bytes in this format:
1010000T TTTTTTTT TTTTTTTT
Example: 18:25:12 is encoded as 10100001 00100110 01001100 (bits) or a1 26 4c (hex bytes).
The type tag is 00. Encoded values use 5 bytes in this format:
00DDDDDD DDDDDDDD DDDDDDDT TTTTTTTT TTTTTTTT
Example: 1983-01-15T18:25:12 is encoded as 00011110 11111100 00011101 00100110 01001100 (bits) or 1e fc 1d 26 4c (hex bytes).
The type tag is 110. Encoded values use 6 bytes in this format:
110DDDDD DDDDDDDD DDDDDDDD TTTTTTTT TTTTTTTT TZZZZZZZ
Note that the D and T components must be stored as UTC.
Example: 1983-01-15T18:25:12+01:00 is encoded as 11001111 01111110 00001110 10001011 00100110 01000100 (bits) or cf 7e 0e 8b 26 44 (hex bytes).
The type tag is 01, followed by the precision tag P. Values are zero-padded on the right up to the first byte boundary.
For millisecond (ms) precision, encoded values use 7 bytes in this format:
01PPDDDD DDDDDDDD DDDDDDDD DTTTTTTT TTTTTTTT TTSSSSSS SSSS0000
Example: 1983-01-15T18:25:12.123 (millisecond precision) is encoded as 01000111 10111111 00000111 01001001 10010011 00000111 10110000 (bits) or 47 bf 07 49 93 07 b0 (hex bytes).
For microsecond (µs) precision, encoded values use 8 bytes in this format:
01PPDDDD DDDDDDDD DDDDDDDD DTTTTTTT TTTTTTTT TTSSSSSS SSSSSSSS SSSSSS00
Example: 1983-01-15T18:25:12.123456 (microsecond precision) is encoded as 01010111 10111111 00000111 01001001 10010011 00000111 10001001 00000000 (bits) or 57 bf 07 49 93 07 89 00 (hex bytes).
For nanosecond (ns) precision, encoded values use 9 bytes in this format:
01PPDDDD DDDDDDDD DDDDDDDD DTTTTTTT TTTTTTTT TTSSSSSS SSSSSSSS SSSSSSSS SSSSSSSS
Example: 1983-01-15T18:25:12.123456789 (nanosecond precision) is encoded as 01100111 10111111 00000111 01001001 10010011 00000111 01011011 11001101 00010101 (bits) or 67 bf 07 49 93 07 5b cd 15 (hex bytes).
In case the sub-second precision component has no value, encoded values use 6 bytes in this format:
01PPDDDD DDDDDDDD DDDDDDDD DTTTTTTT TTTTTTTT TT000000
Example: 1983-01-15T18:25:12 (no precision) is encoded as 01110111 10111111 00000111 01001001 10010011 00000000 (bits) or 77 bf 07 49 93 00 (hex bytes).
The type tag is 111, followed by the precision tag P. Values are zero-padded on the right up to the first byte boundary. Note that the D and T components must be stored as UTC.
For millisecond (ms) precision, encoded values use 8 bytes in this format:
111PPDDD DDDDDDDD DDDDDDDD DDTTTTTT TTTTTTTT TTTSSSSS SSSSSZZZ ZZZZ0000
Example: 1983-01-15T18:25:12.123+01:00 (millisecond precision) is encoded as 11100011 11011111 10000011 10100010 11001001 10000011 11011100 01000000 (bits) or e3 df 83 a2 c9 83 dc 40 (hex bytes).
For microsecond (µs) precision, encoded values use 9 bytes in this format:
111PPDDD DDDDDDDD DDDDDDDD DDTTTTTT TTTTTTTT TTTSSSSS SSSSSSSS SSSSSSSZ ZZZZZZ00
Example: 1983-01-15T18:25:12.123456+01:00 (microsecond precision) is encoded as 11101011 11011111 10000011 10100010 11001001 10000011 11000100 10000001 00010000 (bits) or eb df 83 a2 c9 83 c4 81 10 (hex bytes).
For nanosecond (ns) precision, encoded values use 10 bytes in this format:
111PPDDD DDDDDDDD DDDDDDDD DDTTTTTT TTTTTTTT TTTSSSSS SSSSSSSS SSSSSSSS SSSSSSSS SZZZZZZZ
Example: 1983-01-15T18:25:12.123456789+01:00 (nanosecond precision) is encoded as 11110011 11011111 10000011 10100010 11001001 10000011 10101101 11100110 10001010 11000100 (bits) or f3 df 83 a2 c9 83 ad e6 8a c4 (hex bytes).
In case the sub-second precision component has no value, encoded values use 7 bytes in this format:
111PPDDD DDDDDDDD DDDDDDDD DDTTTTTT TTTTTTTT TTTZZZZZ ZZ000000
Example: 1983-01-15T18:25:12+01:00 (no precision) is encoded as 11111011 11011111 10000011 10100100 11001001 10010001 00000000 (bits) or fb df 83 a2 c9 91 00 (hex bytes).
A Python library for temporenc, conveniently named temporenc, is available from PyPI. The online documentation is a good place to start.
Implementations for other languages are most welcome!
Why the name temporenc?
Temporenc is a contraction of the words tempore (declension of Latin tempus, meaning time) and enc (abbreviation for encoding). The name temporenc should only be capitalized when normal spelling rules dictate so, e.g. at the start of a sentence.
What's so novel about temporenc?
Not much. Many ancient civilizations had their methods for representing dates and times, and digital schemes for doing the same have been around for decades.
Temporenc is just an attempt to cleverly combine what others have been doing for a very long time. Temporenc uses common bit packing techniques and builds upon international standards for representing dates, times, and time zones. All temporenc is about is combining existing ideas into a comprehensive encoding format.
Why another format when there are already so many of them?
Indeed, there are many (semi-)standardized formats to represent dates and times. Examples include Unix time (elapsed time since an epoch), ISO 8601 strings (a very extensive ISO standard with many different string formats), and SQL DATETIME strings.
Each of these formats, including temporenc, have their own strengths and weaknesses. Some formats allow for missing values (e.g. temporenc), while others do not (e.g. Unix time). Some can represent leap seconds (e.g. ISO 8601) , while others cannot (e.g. Unix time). Some are human readable (e.g. ISO 8601), some are not (e.g. temporenc).
Temporenc provides just a different trade-off that favours encoded space and flexibility over human readability and parsing convenience.
Is temporenc just a binary ISO 8601 representation?
Yes and no. ISO 8601 is a very extensive standard that defines many string representations. The temporenc type DTSZ is conceptually similar to the canonical string format in ISO 8601, but differs in two important ways. First, temporenc allows any field to be empty (instead of only the least significant fields). Second, temporenc always uses UTC for time zone aware values, so you cannot blindly translate one into the other without date arithmetic.
Why does temporenc use so many variable-sized components?
The type tags and packing formats are designed to minimize the size of the encoded byte string. For example, by using a 2-bit type tag for DT values (date with time), the space required for representing the actual date (21 bits) and time (17 bits) fit exactly into 5 bytes (2 + 38 = 40 bits).
How does temporenc relate to other serialization formats like MessagePack, Thrift, or Protocol buffers?
Temporenc only concerns itself with the encoding of date and time information into byte strings, not with the serialization of nested data structures. This means encoded temporenc values can simply be used inside larger data structures, which can then be serialized using a generic serialization format like MessagePack (which supports raw byte strings). Upon decoding, the raw byte string is made available again, which a temporenc decoder can then parse into the original date and time information.
Who came up with this format?
Temporenc was created by Wouter Bolsterlee. I'm wbolster on Github (star my repositories!), and @wbolster on Twitter (follow me!).
How can I contribute to temporenc?
By using it! The temporenc specification itself is maintained in the temporenc repository on Github. Do get in touch if you feel like it!