⚠️ Stale document. This doc predates the v38 rename of ContentAddressable to HasContent and references module paths that no longer exist. The authoritative treatment of content-addressed identity is in engine/src/tangl/core/CORE_DESIGN.md (Trait Axes and Identity Model sections). This file is retained for historical context only and should not be used as a reference for current API usage.

Content-Addressable Records

Overview

ContentAddressable is a mixin for Record types that need content-based identity in addition to UID-based identity. It automatically computes a content_hash from the record’s content, enabling deduplication, provenance tracking, and content-based lookups.

When to Use

Use ContentAddressable when your Record type needs:

  1. Deduplication - Same content should be recognized as identical

  2. Provenance - Track exactly what content was used

  3. Content Lookups - Find records by their content, not just UID

  4. Immutability Verification - Detect if content changes

Usage

Basic Usage (Default Hashing)

from tangl.core.entity import Record
from tangl.core.record.content_addressable import ContentAddressable


class MyTemplate(Record, ContentAddressable):
    name: str
    archetype: str
    hp: int
    # content_hash auto-computed from all fields except uid

Custom Hashing

class MyResource(Record, ContentAddressable):
    path: Path
    metadata: dict
    
    @classmethod
    def _get_hashable_content(cls, data: dict):
        # Only hash the file content, not metadata
        if 'path' in data:
            from tangl.utils.hashing import compute_data_hash
            return compute_data_hash(Path(data['path']))
        return None

Accessing the Hash

template = MyTemplate(name="guard", archetype="soldier", hp=50)

# Full hash (bytes)
full_hash: bytes = template.content_hash

# Truncated hex (for display/logging)
short_id: str = template.get_content_identifier()  # First 16 hex chars

How It Works

Automatic Computation

  1. When you construct a Record with ContentAddressable:

    record = MyRecord(field1="value", field2=42)
    
  2. The @model_validator calls _get_hashable_content(data)

  3. Result is passed to hashing_func() (from tangl.utils.hashing)

  4. Computed hash is set as content_hash field

Default Behavior

By default, ContentAddressable hashes the entire record except:

  • uid (instance-specific)

  • content_hash (would be circular)

  • created_at, updated_at (temporal metadata)

Customization

Override _get_hashable_content() to:

  • Exclude additional fields (like scope, label for templates)

  • Include only specific fields

  • Hash external content (file data, URLs)

  • Skip hashing entirely (return None)

Examples

Template Hashing

class ActorScript(Record, ContentAddressable):
    name: str
    archetype: str
    hp: int
    scope: ScopeSelector = None  # Metadata, don't hash
    label: str = None  # Metadata, don't hash
    
    @classmethod
    def _get_hashable_content(cls, data: dict):
        # Hash structure, not metadata
        exclude = {'uid', 'content_hash', 'scope', 'label'}
        return {k: v for k, v in data.items() if k not in exclude}

Result: Templates with same name, archetype, hp get same hash, regardless of scope or label.

Media Resource Hashing

class MediaRIT(Entity, ContentAddressable):
    path: Path = None
    data: bytes = None
    
    @classmethod
    def _get_hashable_content(cls, data: dict):
        # Hash actual file/data content
        if 'data' in data:
            return data['data']
        elif 'path' in data:
            return compute_data_hash(Path(data['path']))
        raise ValueError("Must provide data or path")

Result: Files with same content get same hash, even if paths differ.

Integration with Registry

Because content_hash is marked as an identifier (is_identifier=True), Registry can find records by hash:

# Add templates to registry
template1 = ActorScript(name="guard", hp=50)
template2 = ActorScript(name="guard", hp=50)  # Same content
registry.add(template1)
registry.add(template2)  # Duplicate - same hash

# Find by content hash identifier
matches = registry.find_all(Selector.from_identifier(template1.content_hash()))
assert len(matches) == 2  # Both instances
assert matches[0].content_hash() == matches[1].content_hash()

Provenance Tracking

Use content_hash in BuildReceipts to track what was used:

# In provisioner
template = world.template_registry.find_one(Selector.from_identifier("guard"))

receipt = BuildReceipt(
    destination_uid=actor.uid,
    metadata={
        'template_ref': 'guard',
        'template_hash': template.get_content_identifier(),
        # Can verify later that exact template was used
    }
)

Best Practices

DO:

  • ✅ Use for immutable content (templates, resources)

  • ✅ Exclude metadata from hash (scope, labels, timestamps)

  • ✅ Document what fields are hashed in _get_hashable_content()

  • ✅ Use get_content_identifier() for logging

DON’T:

  • ❌ Use for frequently-mutating records (defeats caching)

  • ❌ Hash sensitive data without considering privacy

  • ❌ Assume hash uniqueness (collisions theoretically possible)

  • ❌ Use hash as primary key (UID is primary, hash is alias)

Performance Notes

  • Hash computation happens once at construction

  • Records are frozen (immutable), so hash never changes

  • No caching needed (computed once, stored forever)

  • hashing_func() is fast (Blake2b or SHA224)

Troubleshooting

“Hash not computed”

  • _get_hashable_content() returned None

  • Check your override implementation

“Same content, different hashes”

  • Metadata fields being included in hash

  • Add them to exclude set in _get_hashable_content()

“Different content, same hash” (collision)

  • Astronomically unlikely with Blake2b/SHA224

  • Report as bug if confirmed

See Also

  • MediaResourceInventoryTag - Example using ContentAddressable

  • BaseScriptItem - Templates using ContentAddressable

  • Registry - Content-based lookups