cdxcore.uniquehash#
Framework for producing unique hashes for various Python elements. Hashing is key for caching strategies and managing data pipelines effectively. The module contains a range of utility functions to ease implementation of pipelines and other tasks where hashes of data are required.
Overview#
The functionality here follows by default important design principles which are discussed in cdxcore.uniquehash.UniqueHash(),
such as
Members of objects, and elements of dictionaries which start with “_” are ignored.
Member functions of objects or dictionaries are ignored.
Dictionaries are assumed to be order-invariant, even though Python now maintains construction order for objects and therefore also objects.
Example:
class A(object):
def __init__(self, x):
self.x = x
self._y = x*2 # protected member will not be hashed by default
from cdxcore.uniquehash import UniqueHash
uniqueHash = UniqueHash(length=12)
a = A(2)
print( uniqueHash(a) ) # --> "2d1dc3767730"
The module contains a few pre-defined hash functions with different hash lengths:
Often it is desirable to generate readable labels or filenames which are still unique. This can be accomplished with
cdxcore.uniquehash.NamedUniqueHash()is used to append a unique to a descriptive name which does not need to be unique in itself.cdxcore.uniquehash.UniqueLabel()can be used to cut already unique labels down to a given maximum length. The function will essentially add a hash code to the end of the label, if it exceeds the maximum length.
Import#
import cdxcore.uniquehash as uniquehash
Documentation#
Functions
|
Generate user-readable unique hashes and filenames. |
|
Extend unique lables by a hash ID if they exceed |
|
Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs. |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Short-cut for the hash function returned by |
|
Returns a unique label. |
Classes
Base class for tracing hashing operations. |
|
|
Keep track of everything parsed during hashing. |
|
Live printing of tracing information with |
|
A calculator class which computes unique hashes of a fixed length. |
- class cdxcore.uniquehash.DebugTrace[source]#
Bases:
objectBase class for tracing hashing operations.
Use either
cdxcore.uniquehash.DebugTraceCollectorcdxcore.uniquehash.DebugTraceVerbosefor debugging. The latter prints out tracing during the computation of a hash, while to former collects all this information in a simplistic data structure. Note that this can be quite memory intensive.
- class cdxcore.uniquehash.DebugTraceCollect(tostr=None)[source]#
Bases:
DebugTraceKeep track of everything parsed during hashing.
The result of the trace is contained in
cdxcore.uniquehash.DebugTraceCollect.trace.Note that DebugTraceCollect itself implements
CollectionandSequencesemantics so you can iterate it directly.- Parameters:
- tostr: int
If set to a positive integer, then any object encountered will be represented as a string with
repr(), and the length of the string will be limited to tostr. This avoids generation of large amounts of data if the objects hashed are large (e.g. numpy arrays).If set to
Nonethen the function collects the actual elements.
- trace#
Trace of the hashing operation. Upon completion of
cdxcore.uniquehash.UniqueHash.__call__()this list contains elements of the following type:- if tostr is a positive integer:
typex: type of the element
reprx: repr of the element, up to tostr length.
msg: message occurred during hashing if any
child: if the element was a container or object
- if tostr is
None: x: the element
msg: message occurred during hashing if any
child: if the element was a container or object
- if tostr is
- class cdxcore.uniquehash.DebugTraceVerbose(strsize=50, verbose=None)[source]#
Bases:
DebugTraceLive printing of tracing information with
cdxcore.verbose.Context. for some formatting. All objects will be reported by type and their string representation, sufficiently reduced if necessary.- Parameters:
- strsizeint, optional
Maximum string size when using
repr()on reported objects. Default is50.- verbose
cdxcore.verbose.Context, optional Context object or
Nonefor a new context object with full visibility (it prints everything).
- cdxcore.uniquehash.NamedUniqueHash(max_length=60, id_length=16, *, separator=' ', filename_by=None, **unique_hash_arguments)[source]#
Generate user-readable unique hashes and filenames.
Returns a function:
f( label, *args, **kwargs )
which generates unique strings of at most a length of max_length of the format
label + separator + IDwhere ID has length id_length. Since label heads the resulting string this function is suited for use cases where a user might want an indication what a hash refers to.This function does not suppose that label is unqiue, hence the ID is prioritized. See
cdxcore.uniquehash.UniqueLabel()for a function which assumes the label is unique.The maximum length of the returned string is max_length; if need be label will be truncated: the returned string will always end in ID.
The function optionally makes sure that the returned string is a valid file name using
cdxcore.util.fmt_filename().Short Cut
Consider
cdxcore.verbose.named_unique_filename48_8()if the defaults used for that function are suitable for your use case.Important
It is strongly recommended to read the documentation for
cdxcore.uniquehash.UniqueHashfor details on hashing logic and the available parameters- Parameters:
- max_lengthint, default 60
Total length of the returned string including the ID. Defaults to
60to allow file names with extensions of up to three letters.- id_lengthint, defqult 16
Intended length of the hash ID, default
16.- separatorstr, default
' ' Separator between label and id_length. Note that the separator will be included in the ID calculation, hence different separators lead to different IDs. Default
' '.- filename_bystr | None, default
None If not
None, usecdxcore.util.fmt_filenamewithby=filename_byto ensure the returned string is a valid filename for both windows and linux, of at most max_length size.If set to the string
default, usecdxcore.util.DEF_FILE_NAME_MAPas the default mapping forcdxcore.util.fmt_filename().- ** unique_hash_arguments, optional
Parameters passed to
cdxcore.uniquehash.UniqueHash.
- Returns:
- uniqueHash
Callable hash function with signature
(label, *args, **kwargs).
- uniqueHash
- class cdxcore.uniquehash.UniqueHash(length=32, *, parse_underscore='none', sort_dicts=True, parse_functions=False, pd_ignore_column_order=True, np_nan_equal=False, f_include_defaults=True, f_include_closure=True, f_include_globals=True)[source]#
Bases:
objectA calculator class which computes unique hashes of a fixed length.
There are a number of parameters which control the exact semantics of the hashing algorithm as it iterates through collections and objects which are are discussed with
cdxcore.uniquehash.UniqueHash.The base use case is to only specify the length of the unique ID string to be computed:
class A(object): def __init__(self, x): self.x = x self._y = x*2 # protected member will not be hashed by default from cdxcore.uniquehash import UniqueHash uniqueHash = UniqueHash(length=12) a = A(2) print( uniqueHash(a) ) # --> "2d1dc3767730"
The callable
uniquehashcan be applied to “any” Python construct.Private and Protected members
When an object is passed to this functional its members are iterated using
__dict__or__slots__, respectively. By default this process ignores any fields in objects or dictionaries which starts with “_”. The idea here is that “functional” parameters are stored as members, but any derived data is stored in protected members. This behaviour can be changed with parse_underscore.Objects can optionally implement their own hashing scheme by implementing:
__unique_hash__( self, unique_hash : UniqueHash, debug_trace : DebugTrace )
This function may return a unique string, or any other non-None Python object which will then again be hashed. A common use case is to ignore the parameters to this function and return a tuple of members of the class which are pertinent for hashing:
class CustomHash(object): def __init__(self, x): self.x = x self.x2 = x*2 # dervied data; no need to hash def __unique_hash__( self, unique_hash : UniqueHash, debug_trace : DebugTrace ): return ( self.x, )
More generally,
uniqueHashcan be used to hash any elements in the object. If used, you should also passdebug_trace:class CustomHash(object): def __init__(self, x): self.x = x self.x2 = x*2 # dervied data; no need to hash def __unique_hash__( self, unique_hash : UniqueHash, debug_trace : DebugTrace ): return unique_hash(self.x, debug_trace=debug_trace)
Finally, users may also simply set
__unique_hash__to a given unique string computed ahead of time:class CustomHash(object): def __init__(self, x): self.x = x self.x2 = x*2 # dervied data; no need to hash self.__unique_hash__ = str(x)
Dictionaries
Since Python 3.6 dictionaries preserve the order in which they were constructed. However, Python semantics remain otherwise order-invariant, i.e.
{'x':1, 'y':2}tests equal to{'y':2',x':1}. For this reasom the default behaviour here for dictonaries is to sort them before hasing their content. This also applies to objects processed via their__dict__.This can be turned off by setting sort_dicts to
False. OrderedDicts or any classes derived from them (such ascdxcore.prettydict.pdct) are processed in order and not sorted in any case.Functions
By default function members of objects and dictionaries (which include @properties) are ignored. You can set parse_functions to
Trueto parse a reduced text of the function code. There are a number of additional expert settings for handling functions, see below.Numpy, Pandas
Hashing of large datasets is not advised. Use hashes on the generating parameter set instead where possible.
- Parameters:
- lengthint, optional
Intended length of the hash function. Default is
32.- parse_underscorebool, optional
How to handle object members starting with “_”.
"none": ignore members starting with “_” (the default)."protected": ignore ‘private’ members declared starting with “_” and containing “__”."private": consider all members.
Default is
none.- sort_dictsbool, optional
Since Python 3.6 dictionaries are ordered. That means that strictly speaking the two dictionaries
{'x':1, 'y':2}and{'y':2, 'x':1}are not indentical; however Python will sematicallly still assume they are as==between the two will return True. Accordingly, by default this hash function assumes the order of dictionaries does __not__ matter unless the are, or are derived from,OrderedDict(or have their own implementation such ascdxcore.prettydict.PrettyObject).Practically that means this function first sorts the keys of mappings before hashing their items.
This can be turned off by setting sort_dicts=False. Default is
True.- parse_functionsbool, optional
If True, then the function will attempt to generate unique hashes for functions. Default is
False.- pd_ignore_column_orderbool, optional
(Advanced parameter). Whether to ingore the order of panda columns. The default is
True.- np_nan_equalbool, optional
(Advanced parameter). Whether to ignore the specific type of a NaN. The default is
False.- f_include_defaultsbool, optional
(Advanced parameter). When parsing functions whether to include default values. Default is
True.- f_include_closurebool, optional
(Advanced parameter). When parsing functions whether to include the function colusure. This can be expensive. Default is
True.- f_include_globalsbool, optional
(Advanced parameter). When parsing functions whether to include globals used by the function. This can be expensicve. Default is
False.
- __call__(*args, debug_trace=None, **kwargs)[source]#
Returns a unique hash for the
argandkwargsparameters passed to this function.Example:
class A(object): def __init__(self, x): self.x = x self._y = x*2 # protected member will not be hashed by default from cdxcore.uniquehash import UniqueHash uniqueHash = UniqueHash(12) a = A(2) print( uniqueHash(a) ) # --> "2d1dc3767730"
- Parameters:
- args, kwargs:
Parameters to hash.
- debug_trace
cdxcore.uniquehash.DebugTrace| None, defaultNone Allows tracing of hashing activity for debugging purposes. Two implementations of
DebugTraceare available:cdxcore.uniquehash.DebugTraceVerbosesimply prints out hashing activity to stdout.cdxcore.uniquehash.DebugTraceCollectcollects an array of tracing information. The object itself is an iterable which contains the respective tracing information once the hash function has returned.
- Returns:
- Hashstr
String of at most
self.length.
- cdxcore.uniquehash.UniqueLabel(max_length=60, id_length=8, separator=' ', filename_by=None)[source]#
Extend unique lables by a hash ID if they exceed
max_length.This function returns a function:
f( unique_label )
which generates strings of at most
max_length, based on a providedunique_label. Esentially this function performs:If len(unique_label) <= max_length: unique_label else: unique_label + separator + ID
where
IDis a unqiue hash computed fromunique_labelof maximum lengthid_length.This function assumes that
unique_labelis unique, hence the ID is dropped ifunique_labelis less thanmax_length. Usecdxcore.uniquehash.NamedUniqueHash()if the label is not unique, and which therefore always appends the dynamically calculated unique ID.Note that if
filename_byconversion is used, then this function will always attach the unique ID to the filename because after the conversion of the label to a filename it is no longer guaranteed that the result is unique. If your label is unique as a filename, do not usefilename_by. The function will return valid file names iflabelis a valid file name.- Parameters:
- max_lengthint, default 60
Total length of the returned string including the ID. Defaults to 60 to allow file names with extensions with three letters.
- id_lengthint, default 8
Indicative length of the hash function, default 8. id_length will be reduced to max_length if neccessary.
- separatorstr, default
' ' Separator between the label and the unique ID.
Note that the separator will be included in the ID calculation, hence different separators lead to different IDs.
- filename_bystr | None, default
None If not
None, usecdxcore.util.fmt_filename()withby=filename_byto ensure the returned string is a valid filename for both windows and linux, of at mostmax_lengthsize. Note that iffilename_byis notNone, then the function will always append a hash tounique_labelbecause it cannot asscertain that the filename conversion creates overlapping labels.If set to the string
"default", usecdxcore.util.DEF_FILE_NAME_MAPas the default mapping forcdxcore.util.fmt_filename().
- Returns:
- Hash function
Callable Hash function with signature
(unique_label).
- Hash function
- cdxcore.uniquehash.named_unique_filename48_8(label, *args, **kwargs)[source]#
Returns a unique and valid filename which is composed of label and a unique ID computed using all of label, args, and kwargs.
labelis not assumed to be unique.Consider a use cases where an experiment defined by
definitionhas producedresultswhich we wish topickleto disk. Assume further thatstr(definition)provides an informative user-readable but not necessarily unique description ofdefinition.Pseudo-Code:
def store_experiment( num : int, definition : object, results : object ): label = f"Experiment {str(definition)}" filename = named_unique_hash48_8( label, (num, definition) ) with open(filename, "wb") as f: pickle.dumps(results)
This is the hash function returned by
cdxcore.uniquehash.NamedUniqueHashwith parametersmax_length=48, id_length=8, filename_by="default".Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash16(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHashwith parameterlength=16.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash32(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHashwith parameterlength=32.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash48(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHashwith parameterlength=48.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash64(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHashwith parameterlength=64.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_hash8(*args, **kwargs)[source]#
Short-cut for the hash function returned by
cdxcore.uniquehash.UniqueHashwith parameterlength=8.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.
- cdxcore.uniquehash.unique_label48_8(label, as_file_name=False)[source]#
Returns a unique label. This function assumes that
labelis inherently unique, but might exceed the maximum length of 48. In that case a unique hash of length 8 is added to the truncated label and returned.This function may also convert any unique label into a unique, valid file name if
as_file_nameisTrue. In that case it always adds an ID (because it cannot gurantee that the conversion to a file name is unique). By default the function does not convert to a file name.labelis assumed to be unique.This is the hash function returned by
cdxcore.uniquehash.UniqueLabelwith parametersmax_length=48, id_length=8, filename_by=filename_by if as_file_name else None.Important please make sure you aware of the functional considerations discussed in
cdxcore.uniquehash.UniqueHasharound elements starting with _ or function members.