cdxcore.uniquehash#
Framework for producing unique hashes for various Python elements. Hashing is key for caching strategies and managing data pipelines effectively. The module contains a range of utility functions to ease implementation of pipelines and other tasks where hashes of data are required.
Overview#
The functionality here follows by default important design principles which are discussed in cdxcore.uniquehash.UniqueHash()
,
such as
Members of objects, and elements of dictionaries which start with “_” are ignored.
Member functions of objects or dictionaries are ignored.
Dictionaries are assumed to be order-invariant, even though Python now maintains construction order for objects and therefore also objects.
Example:
class A(object):
def __init__(self, x):
self.x = x
self._y = x*2 # protected member will not be hashed by default
from cdxcore.uniquehash import UniqueHash
uniqueHash = UniqueHash(length=12)
a = A(2)
print( uniqueHash(a) ) # --> "2d1dc3767730"
The module contains a few pre-defined hash functions with different hash lengths:
cdxcore.uniquehash.unique_hash8()
Import#
import cdxcore.uniquehash as uniquehash
Documentation#
Functions
|
Generate user-readable unique hashes and filenames. |
|
Returns a function. |
|
Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs. |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
Classes
Base class for tracing hashing operations. |
|
|
Keep track of everything parsed during hashing. |
|
Live printing of tracing information with |
|
A calculator class which computes unique hashes of a fixed length. |
- class cdxcore.uniquehash.DebugTrace[source]#
Bases:
object
Base class for tracing hashing operations.
Use either
cdxcore.uniquehash.DebugTraceCollect
orcdxcore.uniquehash.DebugTraceVerbose
for debugging. The latter prints out tracing during the computation of a hash, while to former collects all this information in a simplistic data structure. Note that this can be quite memory intensive.
- class cdxcore.uniquehash.DebugTraceCollect(tostr=None)[source]#
Bases:
DebugTrace
Keep track of everything parsed during hashing.
The result of the trace is contained in
cdxcore.uniquehash.DebugTraceCollect.trace
.Note that DebugTraceCollect itself implements
Collection
andSequence
semantics so you can iterate it directly.- Parameters:
- tostr: int
If set to a positive integer, then any object encountered will be represented as a string with
repr()
, and the length of the string will be limited to tostr. This avoids generation of large amounts of data if the objects hashed are large (e.g. numpy arrays).If set to
None
then the function collects the actual elements.
- trace#
Trace of the hashing operation. Upon completion of
cdxcore.uniquehash.UniqueHash.__call__()
this list contains elemenets of the following type:- if tostr is a positive integer:
typex: type of the element
reprx: repr of the element, up to tostr length.
msg: message occured during hashing if any
child: if the element was a container or object
- if tostr is
None
: x: the element
msg: message occured during hashing if any
child: if the element was a container or object
- if tostr is
- class cdxcore.uniquehash.DebugTraceVerbose(strsize=50, verbose=None)[source]#
Bases:
DebugTrace
Live printing of tracing information with
cdxcore.verbose.Context
. for some formatting. All objects will be reported by type and their string representation, sufficiently reduced if necessary.- Parameters:
- strsizeint, optional
Maximum string size when using
repr()
on reported objects. Default is50
.- verbose
cdxcore.verbose.Context
, optional Context object or
None
for a new context object with full visibility (it prints everything).
- cdxcore.uniquehash.NamedUniqueHash(max_length=60, id_length=16, *, separator=' ', filename_by=None, **unique_hash_arguments)[source]#
Generate user-readable unique hashes and filenames.
Returns a function:
f( label, *args, **kwargs )
which generates unique strings of at most a length of max_length of the format
label + separator + ID
where ID has length id_length. Since label heads the resulting string this function is suited for use cases where a user might want an indication what a hash refers to.This function does not suppose that label is unqiue, hence the ID is prioritized. See
cdxcore.uniquehash.UniqueLabel()
for a function which assumes the label is unique.The maximum length of the returned string is max_length; if need be label will be truncated: the returned string will always end in ID.
The function optionally makes sure that the returned string is a valid file name using
cdxcore.util.fmt_filename()
.Short Cut
Consider
cdxcore.verbose.named_unique_filename48_8()
if the defaults used for that function are suitable for your use case.Important
It is strongly recommended to read the documentation for
cdxcore.uniquehash.UniqueHash
for details on hashing logic and the available parameters- Parameters:
- max_lengthint, optional
Total length of the returned string including the ID. Defaults to
60
to allow file names with extensions of up to three letters.- id_lengthint, optional
Intended length of the hash ID, default
16
.- separatorstr, optional
Separator between label and id_length. Note that the separator will be included in the ID calculation, hence different separators lead to different IDs. Default
' '
.- filename_bystr, optional
If not
None
, usecdxcore.util.fmt_filename
withby=filename_by
to ensure the returned string is a valid filename for both windows and linux, of at most max_length size. If set to the stringdefault
, usecdxcore.util.DEF_FILE_NAME_MAP
as the default mapping forcdxcore.util.fmt_filename()
.- ** unique_hash_arguments, optional
Parameters passed to
cdxcore.uniquehash.UniqueHash
.
- Returns:
- uniqueHash
Callable
hash function with signature
(label, *args, **kwargs)
.
- uniqueHash
- class cdxcore.uniquehash.UniqueHash(length=32, *, parse_underscore='none', sort_dicts=True, parse_functions=False, pd_ignore_column_order=True, np_nan_equal=False, f_include_defaults=True, f_include_closure=True, f_include_globals=True)[source]#
Bases:
object
A calculator class which computes unique hashes of a fixed length.
There are a number of parameters which control the exact semantics of the hashing algorithm as it iterates through collections and objects which are are discussed with
cdxcore.uniquehash.UniqueHash
.The base use case is to only specify the length of the unique ID string to be computed:
class A(object): def __init__(self, x): self.x = x self._y = x*2 # protected member will not be hashed by default from cdxcore.uniquehash import UniqueHash uniqueHash = UniqueHash(length=12) a = A(2) print( uniqueHash(a) ) # --> "2d1dc3767730"
The callable
uniquehash
can be applied to “any” Python construct.Private and Protected members
When an object is passed to this functional its members are iterated using
__dict__
or__slots__
, respectively. By default this process ignores any fields in objects or dictionaries which starts with “_”. The idea here is that “functional” parameters are stored as members, but any derived data is stored in protected members. This behaviour can be changed with parse_underscore.Objects can optionally implement their own hashing scheme by implementing:
__unique_hash__( self, uniqueHash : UniqueHash, debug_trace : DebugTrace )
This function may return a unique string, or any other non-None Python object which will then again be hashed. A common use case is to ignore the parameters to this function and return a tuple of members of the class which are pertinent for hashing.
Dictionaries
Since Python 3.6 dictionaries preserve the order in which they were constructed. However, Python semantics remain otherwise order-invariant, i.e.
{'x':1, 'y':2}
tests equal to{'y':2',x':1}
. For this reasom the default behaviour here for dictonaries is to sort them before hasing their content. This also applies to objects processed via their__dict__
.This can be turned off with sort_dicts. OrderedDicts or any classes derived from them (such as
cdxcore.prettydict.pdct
) are processed in order and not sorted in any case.Functions
By default function members of objects and dictionaries (which include @properties) are ignored. You can set parse_functions to True to parse a reduced text of the function code. There are a number of additional expert settings for handling functions, see below.
Numpy, Pandas
Hashing of large datasets is not advised. Use hashes on the generating parameter set instead where possible.
- Parameters:
- lengthint, optional
Intended length of the hash function. Default is
32
.- parse_underscorebool, optional
How to handle object members starting with “_”.
"none"
: ignore members starting with “_” (the default)."protected"
: ignore ‘private’ members declared starting with “_” and containing “__”."private"
: consider all members.
Default is
none
.- sort_dictsbool, optional
Since Python 3.6 dictionaries are ordered. That means that strictly speaking the two dictionaries
{'x':1, 'y':2}
and{'y':2, 'x':1}
are not indentical; however Python will sematicallly still assume they are as==
between the two will return True. Accordingly, by default this hash function assumes the order of dictionaries does _not_ matter unless the are, or are derived from,OrderedDict
(as iscdxcore.prettydict.pdct
). Practically that means the function first sorts the keys of mappings before hashing their items.This can be turned off by setting sort_dicts=False. Default is
True
.- parse_functionsbool, optional
If True, then the function will attempt to generate unique hashes for functions. Default is
False
.- pd_ignore_column_orderbool, optional
(Advanced parameter). Whether to ingore the order of panda columns. The default is
True
.- np_nan_equalbool, optional
(Advanced parameter). Whether to ignore the specific type of a NaN. The default is
False
.- f_include_defaultsbool, optional
(Advanced parameter). When parsing functions whether to include default values. Default is True`.
- f_include_closurebool, optional
(Advanced parameter). When parsing functions whether to include the function colusure. This can be expensive. Default is True`.
- f_include_globalsbool, optional
(Advanced parameter). When parsing functions whether to include globals used by the function. This can be expensicve. Default is
False
.
- Attributes:
name
Returns a descriptive name of self.
Methods
- __call__(*args, debug_trace=None, **kwargs)[source]#
Returns a unique hash for the arg and kwargs parameters passed to this function.
Example:
class A(object): def __init__(self, x): self.x = x self._y = x*2 # protected member will not be hashed by default from cdxcore.uniquehash import UniqueHash uniqueHash = UniqueHash(12) a = A(2) print( uniqueHash(a) ) # --> "2d1dc3767730"
- Parameters:
- args, kwargs:
Parameters to hash.
- debug_trace
cdxcore.uniquehash.DebugTrace
Allows tracing of hashing activity for debugging purposes. Two implementations of
DebugTrace
are available:cdxcore.uniquehash.DebugTraceVerbose
simply prints out hashing activity to stdout.cdxcore.uniquehash.DebugTraceCollect
collects an array of tracing information. The object itself is an iterable which contains the respective tracing information once the hash function has returned.
- Returns:
- Hashstr
String of at most length
- cdxcore.uniquehash.UniqueLabel(max_length=60, id_length=8, separator=' ', filename_by=None)[source]#
Returns a function:
f( unique_label )
which generates strings of at most
max_length
based on a providedunique_label
; essentially:If len(unique_label) <= max_length: unique_label else: unique_label + separator + ID
where
ID
is a unqiue hash computed fromunique_label
of maximum lengthid_length
.This function assumes that
unique_label
is unique, hence the ID is dropped ifunique_label
is less thanmax_length
. Usecdxcore.uniquehash.NamedUniqueHash()
if the label is not unique, and which therefore always appends the dynamically calculated unique ID.Note that if
filename_by
conversion is used, then this function will always attach the unique ID to the filename because after the conversion of the label to a filename it is no longer guaranteed that the result is unique. If your label is unique as a filename, do not usefilename_by
. The function will return valid file names iflabel
is a valid file name.- Parameters:
- max_lengthint
Total length of the returned string including the ID. Defaults to 60 to allow file names with extensions with three letters.
- id_lengthint
Indicative length of the hash function, default 8. id_length will be reduced to max_length if neccessary.
- separatorstr
Separator between the label and the unique ID.
Note that the separator will be included in the ID calculation, hence different separators lead to different IDs.
- filename_bystr
If not
None
, usecdxcore.util.fmt_filename()
withby=filename_by
to ensure the returned string is a valid filename for both windows and linux, of at mostmax_length
size. If set to the string"default"
,cdxcore.util.DEF_FILE_NAME_MAP
as the default mapping forcdxcore.util.fmt_filename()
.
- Returns:
- Hash function
Callable
Hash function with signature
(unique_label)
.
- Hash function
- cdxcore.uniquehash.named_unique_filename48_8(label, *args, **kwargs)[source]#
Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs.
Consider a use cases where an experiment defined by
definition
has producedresults
which we wish topickle
to disk. Assume further thatstr(definition)
provides an informative user-readable but not necessarily unique description ofdefinition
.Pseudo-Code:
def store_experiment( num : int, definition : object, results : object ): label = f"Experiment {str(definition)}" filename = named_unique_hash48_8( label, (num, definition) ) with open(filename, "wb") as f: pickle.dumps(results)
This is the hash function returned by
cdxcore.uniquehash.NamedUniqueHash
with parametersmax_length=48, id_length=8, filename_by="default"
.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHash
around elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash16(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHash
with parameterlength=16
.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHash
around elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash32(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHash
with parameterlength=32
.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHash
around elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash48(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHash
with parameterlength=48
.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHash
around elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash64(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHash
with parameterlength=64
.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHash
around elements starting with _ or function members.