cdxcore.uniquehash#

Framework for producing unique hashes for various Python elements. Hashing is key for caching strategies and managing data pipelines effectively. The module contains a range of utility functions to ease implementation of pipelines and other tasks where hashes of data are required.

Overview#

The functionality here follows by default important design principles which are discussed in cdxcore.uniquehash.UniqueHash(), such as

Members of objects, and elements of dictionaries which start with “_” are ignored.
Member functions of objects or dictionaries are ignored.
Dictionaries are assumed to be order-invariant, even though Python now maintains construction order for objects and therefore also objects.

Example:

class A(object):
    def __init__(self, x):
        self.x = x
        self._y = x*2  # protected member will not be hashed by default

from cdxcore.uniquehash import UniqueHash
uniqueHash = UniqueHash(length=12)
a = A(2)
print( uniqueHash(a) ) # --> "2d1dc3767730"

The module contains a few pre-defined hash functions with different hash lengths:

cdxcore.uniquehash.unique_hash8()
cdxcore.uniquehash.unique_hash16()
cdxcore.uniquehash.unique_hash32()
cdxcore.uniquehash.unique_hash48()
cdxcore.uniquehash.unique_hash64()

Import#

import cdxcore.uniquehash as uniquehash

Documentation#

Functions

`NamedUniqueHash`([max_length, id_length, ...])	Generate user-readable unique hashes and filenames.
`UniqueLabel`([max_length, id_length, ...])	Returns a function.
`named_unique_filename48_8`(label, args, *kwargs)	Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs.
`unique_hash16`(args, *kwargs)	Short-cut for the hash function returned by `cdxcore.uniquehash.UniqueHash` with parameter `length=16`.
`unique_hash32`(args, *kwargs)	Short-cut for the hash function returned by `cdxcore.uniquehash.UniqueHash` with parameter `length=32`.
`unique_hash48`(args, *kwargs)	Short-cut for the hash function returned by `cdxcore.uniquehash.UniqueHash` with parameter `length=48`.
`unique_hash64`(args, *kwargs)	Short-cut for the hash function returned by `cdxcore.uniquehash.UniqueHash` with parameter `length=64`.
`unique_hash8`(args, *kwargs)	Short-cut for the hash function returned by `cdxcore.uniquehash.UniqueHash` with parameter `length=8`.

Classes

`DebugTrace`()	Base class for tracing hashing operations.
`DebugTraceCollect`([tostr])	Keep track of everything parsed during hashing.
`DebugTraceVerbose`([strsize, verbose])	Live printing of tracing information with `cdxcore.verbose.Context`.
`UniqueHash`([length, parse_underscore, ...])	A calculator class which computes unique hashes of a fixed length.

class cdxcore.uniquehash.DebugTrace[source]#

Bases: object

Base class for tracing hashing operations.

Use either cdxcore.uniquehash.DebugTraceCollect or cdxcore.uniquehash.DebugTraceVerbose for debugging. The latter prints out tracing during the computation of a hash, while to former collects all this information in a simplistic data structure. Note that this can be quite memory intensive.

class cdxcore.uniquehash.DebugTraceCollect(tostr=None)[source]#

Bases: DebugTrace

Keep track of everything parsed during hashing.

The result of the trace is contained in cdxcore.uniquehash.DebugTraceCollect.trace.

Note that DebugTraceCollect itself implements Collection and Sequence semantics so you can iterate it directly.

Parameters:

tostr: int

If set to a positive integer, then any object encountered will be represented as a string with repr(), and the length of the string will be limited to tostr. This avoids generation of large amounts of data if the objects hashed are large (e.g. numpy arrays).

If set to None then the function collects the actual elements.

trace#

Trace of the hashing operation. Upon completion of cdxcore.uniquehash.UniqueHash.__call__() this list contains elemenets of the following type:

if tostr is a positive integer:
- typex: type of the element
- reprx: repr of the element, up to tostr length.
- msg: message occured during hashing if any
- child: if the element was a container or object
if tostr is None:
- x: the element
- msg: message occured during hashing if any
- child: if the element was a container or object

class cdxcore.uniquehash.DebugTraceVerbose(strsize=50, verbose=None)[source]#

Bases: DebugTrace

Live printing of tracing information with cdxcore.verbose.Context. for some formatting. All objects will be reported by type and their string representation, sufficiently reduced if necessary.

Parameters:

strsizeint, optional: Maximum string size when using repr() on reported objects. Default is 50.
verbosecdxcore.verbose.Context, optional: Context object or None for a new context object with full visibility (it prints everything).

cdxcore.uniquehash.NamedUniqueHash(max_length=60, id_length=16, *, separator=' ', filename_by=None, **unique_hash_arguments)[source]#

Generate user-readable unique hashes and filenames.

Returns a function:

f( label, *args, **kwargs )

which generates unique strings of at most a length of max_length of the format label + separator + ID where ID has length id_length. Since label heads the resulting string this function is suited for use cases where a user might want an indication what a hash refers to.

This function does not suppose that label is unqiue, hence the ID is prioritized. See cdxcore.uniquehash.UniqueLabel() for a function which assumes the label is unique.

The maximum length of the returned string is max_length; if need be label will be truncated: the returned string will always end in ID.

The function optionally makes sure that the returned string is a valid file name using cdxcore.util.fmt_filename().

Short Cut

Consider cdxcore.verbose.named_unique_filename48_8() if the defaults used for that function are suitable for your use case.

Important

It is strongly recommended to read the documentation for cdxcore.uniquehash.UniqueHash for details on hashing logic and the available parameters

Parameters:

max_lengthint, optional: Total length of the returned string including the ID. Defaults to 60 to allow file names with extensions of up to three letters.
id_lengthint, optional: Intended length of the hash ID, default 16.
separatorstr, optional: Separator between label and id_length. Note that the separator will be included in the ID calculation, hence different separators lead to different IDs. Default ' '.
filename_bystr, optional: If not None, use cdxcore.util.fmt_filename with by=filename_by to ensure the returned string is a valid filename for both windows and linux, of at most max_length size. If set to the string default, use cdxcore.util.DEF_FILE_NAME_MAP as the default mapping for cdxcore.util.fmt_filename().
** unique_hash_arguments, optional: Parameters passed to cdxcore.uniquehash.UniqueHash.

Returns:

uniqueHashCallable: hash function with signature (label, *args, **kwargs).

class cdxcore.uniquehash.UniqueHash(length=32, *, parse_underscore='none', sort_dicts=True, parse_functions=False, pd_ignore_column_order=True, np_nan_equal=False, f_include_defaults=True, f_include_closure=True, f_include_globals=True)[source]#

Bases: object

A calculator class which computes unique hashes of a fixed length.

There are a number of parameters which control the exact semantics of the hashing algorithm as it iterates through collections and objects which are are discussed with cdxcore.uniquehash.UniqueHash.

The base use case is to only specify the length of the unique ID string to be computed:

class A(object):
    def __init__(self, x):
        self.x = x
        self._y = x*2  # protected member will not be hashed by default

from cdxcore.uniquehash import UniqueHash
uniqueHash = UniqueHash(length=12)
a = A(2)
print( uniqueHash(a) ) # --> "2d1dc3767730"

The callable uniquehash can be applied to “any” Python construct.

Private and Protected members

When an object is passed to this functional its members are iterated using __dict__ or __slots__, respectively. By default this process ignores any fields in objects or dictionaries which starts with “_”. The idea here is that “functional” parameters are stored as members, but any derived data is stored in protected members. This behaviour can be changed with parse_underscore.

Objects can optionally implement their own hashing scheme by implementing:

__unique_hash__( self, uniqueHash : UniqueHash, debug_trace : DebugTrace  )

This function may return a unique string, or any other non-None Python object which will then again be hashed. A common use case is to ignore the parameters to this function and return a tuple of members of the class which are pertinent for hashing.

Dictionaries

Since Python 3.6 dictionaries preserve the order in which they were constructed. However, Python semantics remain otherwise order-invariant, i.e. {'x':1, 'y':2} tests equal to {'y':2',x':1}. For this reasom the default behaviour here for dictonaries is to sort them before hasing their content. This also applies to objects processed via their __dict__.

This can be turned off with sort_dicts. OrderedDicts or any classes derived from them (such as cdxcore.prettydict.pdct) are processed in order and not sorted in any case.

Functions

By default function members of objects and dictionaries (which include @properties) are ignored. You can set parse_functions to True to parse a reduced text of the function code. There are a number of additional expert settings for handling functions, see below.

Numpy, Pandas

Hashing of large datasets is not advised. Use hashes on the generating parameter set instead where possible.

Parameters:

lengthint, optional

Intended length of the hash function. Default is 32.

parse_underscorebool, optional

How to handle object members starting with “_”.

"none" : ignore members starting with “_” (the default).
"protected" : ignore ‘private’ members declared starting with “_” and containing “__”.
"private" : consider all members.

Default is none.

sort_dictsbool, optional

Since Python 3.6 dictionaries are ordered. That means that strictly speaking the two dictionaries {'x':1, 'y':2} and {'y':2, 'x':1} are not indentical; however Python will sematicallly still assume they are as == between the two will return True. Accordingly, by default this hash function assumes the order of dictionaries does _not_ matter unless the are, or are derived from, OrderedDict (as is cdxcore.prettydict.pdct). Practically that means the function first sorts the keys of mappings before hashing their items.

This can be turned off by setting sort_dicts=False. Default is True.

parse_functionsbool, optional

If True, then the function will attempt to generate unique hashes for functions. Default is False.

pd_ignore_column_orderbool, optional

(Advanced parameter). Whether to ingore the order of panda columns. The default is True.

np_nan_equalbool, optional

(Advanced parameter). Whether to ignore the specific type of a NaN. The default is False.

f_include_defaultsbool, optional

(Advanced parameter). When parsing functions whether to include default values. Default is True`.

f_include_closurebool, optional

(Advanced parameter). When parsing functions whether to include the function colusure. This can be expensive. Default is True`.

f_include_globalsbool, optional

(Advanced parameter). When parsing functions whether to include globals used by the function. This can be expensicve. Default is False.

Attributes:

name: Returns a descriptive name of self.

Methods

`__call__`(*args[, debug_trace])
`clone`()	Return copy of self.

__call__(*args, debug_trace=None, **kwargs)[source]#

Returns a unique hash for the arg and kwargs parameters passed to this function.

Example:

class A(object):
    def __init__(self, x):
        self.x = x
        self._y = x*2  # protected member will not be hashed by default

from cdxcore.uniquehash import UniqueHash
uniqueHash = UniqueHash(12)
a = A(2)
print( uniqueHash(a) ) # --> "2d1dc3767730"

Parameters:

args, kwargs:

Parameters to hash.

debug_tracecdxcore.uniquehash.DebugTrace

Allows tracing of hashing activity for debugging purposes. Two implementations of DebugTrace are available:

cdxcore.uniquehash.DebugTraceVerbose simply prints out hashing activity to stdout.
cdxcore.uniquehash.DebugTraceCollect collects an array of tracing information. The object itself is an iterable which contains the respective tracing information once the hash function has returned.

Returns:

Hashstr: String of at most length

clone()[source]#: Return copy of self.

property name: str#: Returns a descriptive name of self.

cdxcore.uniquehash.UniqueLabel(max_length=60, id_length=8, separator=' ', filename_by=None)[source]#

Returns a function:

f( unique_label )

which generates strings of at most max_length based on a provided unique_label; essentially:

If len(unique_label) <= max_length:
    unique_label
else:
    unique_label + separator + ID

where ID is a unqiue hash computed from unique_label of maximum length id_length.

This function assumes that unique_label is unique, hence the ID is dropped if unique_label is less than max_length. Use cdxcore.uniquehash.NamedUniqueHash() if the label is not unique, and which therefore always appends the dynamically calculated unique ID.

Note that if filename_by conversion is used, then this function will always attach the unique ID to the filename because after the conversion of the label to a filename it is no longer guaranteed that the result is unique. If your label is unique as a filename, do not use filename_by. The function will return valid file names if label is a valid file name.

Parameters:

max_lengthint

Total length of the returned string including the ID. Defaults to 60 to allow file names with extensions with three letters.

id_lengthint

Indicative length of the hash function, default 8. id_length will be reduced to max_length if neccessary.

separatorstr

Separator between the label and the unique ID.

Note that the separator will be included in the ID calculation, hence different separators lead to different IDs.

filename_bystr

If not None, use cdxcore.util.fmt_filename() with by=filename_by to ensure the returned string is a valid filename for both windows and linux, of at most max_length size. If set to the string "default", cdxcore.util.DEF_FILE_NAME_MAP as the default mapping for cdxcore.util.fmt_filename().

Returns:

Hash functionCallable: Hash function with signature (unique_label).

cdxcore.uniquehash.named_unique_filename48_8(label, *args, **kwargs)[source]#

Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs.

Consider a use cases where an experiment defined by definition has produced results which we wish to pickle to disk. Assume further that str(definition) provides an informative user-readable but not necessarily unique description of definition.

Pseudo-Code:

def store_experiment( num : int, definition : object, results : object ):
    label    = f"Experiment {str(definition)}"
    filename = named_unique_hash48_8( label, (num, definition) )
    with open(filename, "wb") as f:
        pickle.dumps(results)

This is the hash function returned by cdxcore.uniquehash.NamedUniqueHash with parameters max_length=48, id_length=8, filename_by="default".