Creating Speckle class at runtime with pydantic & Inefficient serialization

Hi @gjedlicska, @Jedd,

Here I am again with two additional issues / topics to discuss regarding the specklepy client:

  • Creating Speckle classes during runtime using pydantic.create_model()
  • Inefficient serialization due to repeated traversal of the same objects

Creating Speckle classes during runtime
In our company we create some Speckle classes during runtime. These basically mimic the classes we setup ourselves, but then of course as a subclass of Base, so they can be serialized and sent to Speckle. To that purpose, we use pydantic.create_model(), see a basic example below:

import pydantic
from specklepy.api.client import SpeckleClient
from specklepy.api.operations import send
from specklepy.objects import Base
from specklepy.transports.server import ServerTransport

# Create new Speckle type with pydantic
speckle_type = pydantic.create_model(
    "TestClass",
    __base__=Base,
    __cls_kwargs__={"detachable": {"test"}},
    **{"test": (str, None)})

# Create object of new Speckle type
test_object = speckle_type(test="test")

# Create Speckle client
client = SpeckleClient(host="https://speckle.xyz")

# Authenticate Speckle client
client.authenticate_with_token(token="YOURTOKEN")

# Initiate server transport
transport = ServerTransport(client=client, stream_id="7c1140bb24")

# Send object to server
obj_id = send(test_object, transports=[transport], use_default_cache=False)

# Create commit
client.commit.create(stream_id="7c1140bb24", object_id=obj_id, branch_name="main")

The issue is that this doesn’t work anymore since pydantic version 2.0. That is because pydantic has added an additional parameter __pydantic_reset_parent_namespace__ in their procedure to create a new class, see this line:

This results in a TypeError, as the additional argument cannot be handled by the __init_subclass__() method called from _RegisteringBase:
TypeError: TestClass.__init_subclass__() takes no keyword arguments

An ugly solution is to simply pop the additional parameter out during _RegisteringBase.__init_subclass__(), which does seem to work. Anyway, it doesn’t seem like a very proper solution, so I was hoping you can provide some help in this.

I have added you both to a stream where I tested the script above, with the additional change in the _RegisteringBase class:

Inefficient serialization
Another issue I dealt with is a very, very slow serialization for large objects. The structure that we send is particularly nested, for example with results that contain lots of structural elements, mesh, etc. This literally caused the serialization to run for hours, without actually finishing it. We do have detaching and chunking in place so that wasn’t the issue. The issue is that, due to the nested structure that we have, objects are being serialized and hashed over and over again, causing serious delays, or even unsuccessful sending of data.

Looking into your source code I was a bit surprised that the nested objects that have been serialized and/or even detached at an earlier stage are still being serialized and hashed again each time. I also noticed that for recompose_base(), you do have caching in place, with a simple deserialized dictionary that stores previously deserialized objects. I now did a very similar thing for the serialization procedure, which instead of running for hours without success now finishes in 1.5 minutes! I store the serialized objects based on their applicationId. Maybe this could be handled in a different way, but I definitely think there should be some kind of caching in place to drastically speed up the serialization of nested objects. Of course, if I’m overlooking an important reason against implementing such mechanism, I’d also be happy to hear. See the updated traverse_base() method here:

    def _traverse_base(self, base: Base) -> Tuple[str, Dict]:
        if not self.detach_lineage:
            self.detach_lineage = [True]

        # ADDED
        # Return from cache if already traversed
        if base.applicationId in self.serialized:
            self.detach_lineage.pop()
            return self.serialized[base.applicationId]

        self.lineage.append(uuid4().hex)
        object_builder = {"id": "", "speckle_type": "Base", "totalChildrenCount": 0}
        object_builder.update(speckle_type=base.speckle_type)
        obj, props = base, base.get_serializable_attributes()

        while props:
            prop = props.pop(0)
            value = getattr(obj, prop, None)
            chunkable = False
            detach = False

            # skip props marked to be ignored with "__" or "_"
            if prop.startswith(("__", "_")):
                continue

            # don't prepopulate id as this will mess up hashing
            if prop == "id":
                continue

            # only bother with chunking and detaching if there is a write transport
            if self.write_transports:
                dynamic_chunk_match = prop.startswith("@") and re.match(
                    r"^@\((\d*)\)", prop
                )
                if dynamic_chunk_match:
                    chunk_size = dynamic_chunk_match.groups()[0]
                    base._chunkable[prop] = (
                        int(chunk_size) if chunk_size else base._chunk_size_default
                    )

                chunkable = prop in base._chunkable
                detach = bool(
                    prop.startswith("@") or prop in base._detachable or chunkable
                )

            # 1. handle None and primitives (ints, floats, strings, and bools)
            if value is None or isinstance(value, PRIMITIVES):
                object_builder[prop] = value
                continue

            # NOTE: for dynamic props, this won't be re-serialised as an enum but as an int
            if isinstance(value, Enum):
                object_builder[prop] = value.value
                continue

            # 2. handle Base objects
            elif isinstance(value, Base):
                child_obj = self.traverse_value(value, detach=detach)
                if detach and self.write_transports:
                    ref_id = child_obj["id"]
                    object_builder[prop] = self.detach_helper(ref_id=ref_id)
                else:
                    object_builder[prop] = child_obj

            # 3. handle chunkable props
            elif chunkable and self.write_transports:
                chunks = []
                max_size = base._chunkable[prop]
                chunk = DataChunk()
                for count, item in enumerate(value):
                    if count and count % max_size == 0:
                        chunks.append(chunk)
                        chunk = DataChunk()
                    chunk.data.append(item)
                chunks.append(chunk)

                chunk_refs = []
                for c in chunks:
                    self.detach_lineage.append(detach)
                    ref_id, _ = self._traverse_base(c)
                    ref_obj = self.detach_helper(ref_id=ref_id)
                    chunk_refs.append(ref_obj)
                object_builder[prop] = chunk_refs

            # 4. handle all other cases
            else:
                child_obj = self.traverse_value(value, detach)
                object_builder[prop] = child_obj

        closure = {}
        # add closures & children count to the object
        detached = self.detach_lineage.pop()
        if self.lineage[-1] in self.family_tree:
            closure = {
                ref: depth - len(self.detach_lineage)
                for ref, depth in self.family_tree[self.lineage[-1]].items()
            }
        object_builder["totalChildrenCount"] = len(closure)

        obj_id = hash_obj(object_builder)

        object_builder["id"] = obj_id
        if closure:
            object_builder["__closure"] = self.closure_table[obj_id] = closure

        # write detached or root objects to transports
        if detached and self.write_transports:
            for t in self.write_transports:
                t.save_object(id=obj_id, serialized_object=ujson.dumps(object_builder))

        del self.lineage[-1]

        # ADDED
        # Add to cache
        if obj.applicationId:
            self.serialized[obj.applicationId] = obj_id, object_builder

        return obj_id, object_builder

Let me know what you think about the topics!

3 Likes

Hi @Rob,
Thanks for the detailed report!

I’ll have to admit, I’m not super familiar with how pydantic classes are created at runtime, But if it’s as simple as ensuring that we don’t pass all kwargs through to the init_subclass call, this doesn’t seem too bad of a fix from our end. I’ll see what @gjedlicska thinks to this, maybe we can pop props without being pydantic specific?


Regarding your second topic on serialization.
It would be great if you could share some benchmark models/streams with us. Having a clear picture of the performance characteristics of your data would be very useful for informing what sort of optimisations we need to make.

I’m guessing the performance troubles you’re encountering are because of the structure of the data you’re sending.

Traditionally, most of our connectors have sent very Tree like data, with only simple objects like RenderMaterial being referenced by multiple other objects (still conforming to a directed acyclic topology). Thus, we’ve optimised our serializer for Trees and accepted the slight inefficiency that re-serializing a couple objects multiple times.

There’s definitely some optimisations we can make for non-tree like structured data such as yours.
We are also seeing need for sending some non-tree structures from our other connectors;
Instances, Networks, and Structural results, etc.

I would be a fan of us exploring the optimisations we can do for all these cases. Both in our python and C# SDKs. I think you’re on the right lines with some sort of internal cache for the serializer. Though we’d need to weigh up how this affects the serialization performance of other (more tree like) data that makes up the majority of the data exchanged through Speckle currently.

1 Like

Thanks @Jedd!

Happy to hear that you’re probably able to pass a small fix for creation at runtime. Our use of Speckle largely revolves about this functionality. It enables us to send any of our own classes to Speckle.


I follow your story on the structure of the data. Our data is definitely more nested than the case you’re describing. Especially our results store quite some references to other data (analysis steps, mesh, etc.) in order to retain the full meaning of the stored results. We also experienced that these results really made it slow and impossible to send our data without the proposed fix. However, I do think that our data mostly or completely complies with the directed acyclic topology. It’s strongly nested, but there are no strange mutual references or any other problematic patterns, as far as I know.

Great that you can have a look into it, as an efficiency increase is definitely needed. We’re currently working on setting up a test model with an extensive set of results that we can share with you, as we cannot share real project data. Will push that to the stream I’ve already invited you for, and send you a message once it’s there. Thanks in advance!

Hi @Jedd,

Sorry for the late reply. Took a bit of time to properly setup and run a model.
Guess you were busy preparing SpeckleCon anyway :stuck_out_tongue:

Anyway, I’ve pushed some example data to a stream I shared with you:
Speckle commit with example data

It’s a structural model containing results of an extensive structural analysis. For the structural objects (e.g. beams, columns, floors, loads, supports), we use native Speckle objects where possible. For objects that can’t be mapped so easily to native Speckle objects (e.g. results, connections), we dynamically create custom objects. Hope this helps!

To come back to the original issues. The issue with pydantic is a bug that prevents us from upgrading to a new specklepy version, which is a major problem. The second issue with serialization is also a big issue for models of medium to large size, especially with some results included, but at least we are able to send some smaller models, though it remains a big limitation for us.