Creating Speckle class at runtime with pydantic & Inefficient serialization

Rob · 27 October 2023 07:58

Here I am again with two additional issues / topics to discuss regarding the specklepy client:

Creating Speckle classes during runtime using pydantic.create_model()
Inefficient serialization due to repeated traversal of the same objects

Creating Speckle classes during runtime
In our company we create some Speckle classes during runtime. These basically mimic the classes we setup ourselves, but then of course as a subclass of Base, so they can be serialized and sent to Speckle. To that purpose, we use pydantic.create_model(), see a basic example below:

import pydantic
from specklepy.api.client import SpeckleClient
from specklepy.api.operations import send
from specklepy.objects import Base
from specklepy.transports.server import ServerTransport

# Create new Speckle type with pydantic
speckle_type = pydantic.create_model(
    "TestClass",
    __base__=Base,
    __cls_kwargs__={"detachable": {"test"}},
    **{"test": (str, None)})

# Create object of new Speckle type
test_object = speckle_type(test="test")

# Create Speckle client
client = SpeckleClient(host="https://speckle.xyz")

# Authenticate Speckle client
client.authenticate_with_token(token="YOURTOKEN")

# Initiate server transport
transport = ServerTransport(client=client, stream_id="7c1140bb24")

# Send object to server
obj_id = send(test_object, transports=[transport], use_default_cache=False)

# Create commit
client.commit.create(stream_id="7c1140bb24", object_id=obj_id, branch_name="main")

The issue is that this doesn’t work anymore since pydantic version 2.0. That is because pydantic has added an additional parameter __pydantic_reset_parent_namespace__ in their procedure to create a new class, see this line:

github.com

pydantic/pydantic/blob/419398d1dd9f3c0babdcfde1d52c249266f59ef0/pydantic/main.py#L1434


      
              if __validators__:
                  namespace.update(__validators__)
              namespace.update(fields)
              if __config__:
                  namespace['model_config'] = _config.ConfigWrapper(__config__).config_dict
              resolved_bases = types.resolve_bases(__base__)
              meta, ns, kwds = types.prepare_class(__model_name, resolved_bases, kwds=__cls_kwargs__)
              if resolved_bases is not __base__:
                  ns['__orig_bases__'] = __base__
              namespace.update(ns)
              return meta(__model_name, resolved_bases, namespace, __pydantic_reset_parent_namespace__=False, **kwds)
          
          

          
__getattr__ = getattr_migration(__name__)

This results in a TypeError, as the additional argument cannot be handled by the __init_subclass__() method called from _RegisteringBase:
TypeError: TestClass.__init_subclass__() takes no keyword arguments

An ugly solution is to simply pop the additional parameter out during _RegisteringBase.__init_subclass__(), which does seem to work. Anyway, it doesn’t seem like a very proper solution, so I was hoping you can provide some help in this.

I have added you both to a stream where I tested the script above, with the additional change in the _RegisteringBase class:

Inefficient serialization
Another issue I dealt with is a very, very slow serialization for large objects. The structure that we send is particularly nested, for example with results that contain lots of structural elements, mesh, etc. This literally caused the serialization to run for hours, without actually finishing it. We do have detaching and chunking in place so that wasn’t the issue. The issue is that, due to the nested structure that we have, objects are being serialized and hashed over and over again, causing serious delays, or even unsuccessful sending of data.

Looking into your source code I was a bit surprised that the nested objects that have been serialized and/or even detached at an earlier stage are still being serialized and hashed again each time. I also noticed that for recompose_base(), you do have caching in place, with a simple deserialized dictionary that stores previously deserialized objects. I now did a very similar thing for the serialization procedure, which instead of running for hours without success now finishes in 1.5 minutes! I store the serialized objects based on their applicationId. Maybe this could be handled in a different way, but I definitely think there should be some kind of caching in place to drastically speed up the serialization of nested objects. Of course, if I’m overlooking an important reason against implementing such mechanism, I’d also be happy to hear. See the updated traverse_base() method here:

    def _traverse_base(self, base: Base) -> Tuple[str, Dict]:
        if not self.detach_lineage:
            self.detach_lineage = [True]

        # ADDED
        # Return from cache if already traversed
        if base.applicationId in self.serialized:
            self.detach_lineage.pop()
            return self.serialized[base.applicationId]

        self.lineage.append(uuid4().hex)
        object_builder = {"id": "", "speckle_type": "Base", "totalChildrenCount": 0}
        object_builder.update(speckle_type=base.speckle_type)
        obj, props = base, base.get_serializable_attributes()

        while props:
            prop = props.pop(0)
            value = getattr(obj, prop, None)
            chunkable = False
            detach = False

            # skip props marked to be ignored with "__" or "_"
            if prop.startswith(("__", "_")):
                continue

            # don't prepopulate id as this will mess up hashing
            if prop == "id":
                continue

            # only bother with chunking and detaching if there is a write transport
            if self.write_transports:
                dynamic_chunk_match = prop.startswith("@") and re.match(
                    r"^@\((\d*)\)", prop
                )
                if dynamic_chunk_match:
                    chunk_size = dynamic_chunk_match.groups()[0]
                    base._chunkable[prop] = (
                        int(chunk_size) if chunk_size else base._chunk_size_default
                    )

                chunkable = prop in base._chunkable
                detach = bool(
                    prop.startswith("@") or prop in base._detachable or chunkable
                )

            # 1. handle None and primitives (ints, floats, strings, and bools)
            if value is None or isinstance(value, PRIMITIVES):
                object_builder[prop] = value
                continue

            # NOTE: for dynamic props, this won't be re-serialised as an enum but as an int
            if isinstance(value, Enum):
                object_builder[prop] = value.value
                continue

            # 2. handle Base objects
            elif isinstance(value, Base):
                child_obj = self.traverse_value(value, detach=detach)
                if detach and self.write_transports:
                    ref_id = child_obj["id"]
                    object_builder[prop] = self.detach_helper(ref_id=ref_id)
                else:
                    object_builder[prop] = child_obj

            # 3. handle chunkable props
            elif chunkable and self.write_transports:
                chunks = []
                max_size = base._chunkable[prop]
                chunk = DataChunk()
                for count, item in enumerate(value):
                    if count and count % max_size == 0:
                        chunks.append(chunk)
                        chunk = DataChunk()
                    chunk.data.append(item)
                chunks.append(chunk)

                chunk_refs = []
                for c in chunks:
                    self.detach_lineage.append(detach)
                    ref_id, _ = self._traverse_base(c)
                    ref_obj = self.detach_helper(ref_id=ref_id)
                    chunk_refs.append(ref_obj)
                object_builder[prop] = chunk_refs

            # 4. handle all other cases
            else:
                child_obj = self.traverse_value(value, detach)
                object_builder[prop] = child_obj

        closure = {}
        # add closures & children count to the object
        detached = self.detach_lineage.pop()
        if self.lineage[-1] in self.family_tree:
            closure = {
                ref: depth - len(self.detach_lineage)
                for ref, depth in self.family_tree[self.lineage[-1]].items()
            }
        object_builder["totalChildrenCount"] = len(closure)

        obj_id = hash_obj(object_builder)

        object_builder["id"] = obj_id
        if closure:
            object_builder["__closure"] = self.closure_table[obj_id] = closure

        # write detached or root objects to transports
        if detached and self.write_transports:
            for t in self.write_transports:
                t.save_object(id=obj_id, serialized_object=ujson.dumps(object_builder))

        del self.lineage[-1]

        # ADDED
        # Add to cache
        if obj.applicationId:
            self.serialized[obj.applicationId] = obj_id, object_builder

        return obj_id, object_builder

Let me know what you think about the topics!

Jedd · 30 October 2023 15:26

Hi @Rob,
Thanks for the detailed report!

I’ll have to admit, I’m not super familiar with how pydantic classes are created at runtime, But if it’s as simple as ensuring that we don’t pass all kwargs through to the init_subclass call, this doesn’t seem too bad of a fix from our end. I’ll see what @gjedlicska thinks to this, maybe we can pop props without being pydantic specific?

Regarding your second topic on serialization.
It would be great if you could share some benchmark models/streams with us. Having a clear picture of the performance characteristics of your data would be very useful for informing what sort of optimisations we need to make.

I’m guessing the performance troubles you’re encountering are because of the structure of the data you’re sending.

Traditionally, most of our connectors have sent very Tree like data, with only simple objects like RenderMaterial being referenced by multiple other objects (still conforming to a directed acyclic topology). Thus, we’ve optimised our serializer for Trees and accepted the slight inefficiency that re-serializing a couple objects multiple times.

There’s definitely some optimisations we can make for non-tree like structured data such as yours.
We are also seeing need for sending some non-tree structures from our other connectors;
Instances, Networks, and Structural results, etc.

I would be a fan of us exploring the optimisations we can do for all these cases. Both in our python and C# SDKs. I think you’re on the right lines with some sort of internal cache for the serializer. Though we’d need to weigh up how this affects the serialization performance of other (more tree like) data that makes up the majority of the data exchanged through Speckle currently.

Rob · 2 November 2023 14:24

Thanks @Jedd!

Happy to hear that you’re probably able to pass a small fix for creation at runtime. Our use of Speckle largely revolves about this functionality. It enables us to send any of our own classes to Speckle.

I follow your story on the structure of the data. Our data is definitely more nested than the case you’re describing. Especially our results store quite some references to other data (analysis steps, mesh, etc.) in order to retain the full meaning of the stored results. We also experienced that these results really made it slow and impossible to send our data without the proposed fix. However, I do think that our data mostly or completely complies with the directed acyclic topology. It’s strongly nested, but there are no strange mutual references or any other problematic patterns, as far as I know.

Great that you can have a look into it, as an efficiency increase is definitely needed. We’re currently working on setting up a test model with an extensive set of results that we can share with you, as we cannot share real project data. Will push that to the stream I’ve already invited you for, and send you a message once it’s there. Thanks in advance!

Rob · 29 November 2023 09:43

Hi @Jedd,

Sorry for the late reply. Took a bit of time to properly setup and run a model.
Guess you were busy preparing SpeckleCon anyway

Anyway, I’ve pushed some example data to a stream I shared with you:
Speckle commit with example data

It’s a structural model containing results of an extensive structural analysis. For the structural objects (e.g. beams, columns, floors, loads, supports), we use native Speckle objects where possible. For objects that can’t be mapped so easily to native Speckle objects (e.g. results, connections), we dynamically create custom objects. Hope this helps!

To come back to the original issues. The issue with pydantic is a bug that prevents us from upgrading to a new specklepy version, which is a major problem. The second issue with serialization is also a big issue for models of medium to large size, especially with some results included, but at least we are able to send some smaller models, though it remains a big limitation for us.

gjedlicska · 5 December 2023 15:33

Hey @Rob

we took a look at the issues you’ve raised. The serialization issue is inflated by the structure of your object model, where the objects for a graph referencing the same object potentially multiple times. This is not in line with how we usually try to structure our commit objects, where we’re aiming for tree structures for the sake of simplicity, wherever possible. As you also know, your structure is also supported and possible, but its not on our current hot path, so its super not optimized. We picked up the convo, to discuss this issue, but we need to take a look at other SKDs too when implementing a new behavior. So we’re not planning to do a quick fix on that, but we’re going to keep up the topic, see how we can implement something.

For the dynamic class creation issue we think there is an even easier fix than your suggestion.
We just do not pass the kwargs down to the super call, since we know the super of the _RegisteringBase is object itself, which doesnt take any arguments on its __init_subclass__ method call. So here’s a PR, that does just that: fix: remove unnecessary kwargs from init subclass call chain by gjedlicska · Pull Request #323 · specklesystems/specklepy · GitHub

Your dynamic class creation example however, raised a questions for us: What is the reason for using pydantic dynamic class creation? It mixes the SpeckleBase and pydantic’s BaseClass behavior, making a pydantic class, your dynamically generated class inherit from a regular python class, the speckle Base. While things seam work after our fix ( due to python being super flexible ), there can be some dragons hiding in the corners.
With the exception of runtime validation, you can achieve the same result with the python type function like so: Built-in Functions — Python 3.12.0 documentation

Rob · 8 December 2023 11:06

Hi @gjedlicska,

Thanks for looking into our data, it’s much appreciated!

First of all, your fix to allow dynamic class creation works and would enable us to upgrade to the latest version of specklepy, so we’re looking forward to have that released : )

I believe we use pydantic for class creation as your Base model used to inherit from the PydanticBase originally. At that point, pydantic was needed to create classes dynamically. This indeed also enforces type checking, which is a nice addition imo. I discussed this a few years ago with Izzy, see the discussion here: 🐍 PySpeckle 2.0 is starting to take shape...be an early tester! - #6 by Rob

So essentially it evolved into using pydantic and it’s not so much a deliberate choice. However, the fact that it performs type checking is definitely an advantage that’s worth keeping it as is, though it’s good to know that there are some caveats. We can transition to using the types option to generate the classes when we would bump into other issues.

Regarding serialization, I can definitely understand that our data structure is far more nested and cross-referenced than you generally intend, and I agree that this inflates the issue to a large extent. So it makes total sense you’re not looking into optimizing serialization for these more nested data structures. However, I do think that the root cause is the redundant, repeated serialization, which is not just an issue for our data structure. Also for data structures that adhere more to your intended structure, this will cause some delays, though of course less severe than for our data. Therefore I think that the proposed change to store already serialized objects would be of added value for any data structure. And in fact, you have the same mechanism already implemented for deserialization.

Altogether, I see it not so much as an update ‘tailored’ to our specific data structure, but an update that could benefit all users. So hope you’re willing to consider implementing some caching mechanism. Would be nice to hear your opinion on it!

gjedlicska · 8 December 2023 12:26

Hey @Rob,

on the serialization issue, I completely agree with you, that this will benefit not just your usecase, so we’re discussing this with @Jedd when we can put it into our list of improvements. We need to make sure our SDKs are aligned on this behavior, so that we do not increase ( the already too big ) discrepancy between our different toolkits, so we just need a bit of coordination on this.

As always, thanks for the constructive feedback

Rob · 8 December 2023 15:35

Okay, all clear! Thanks for discussing

Rob · 14 December 2023 14:18

Hi @gjedlicska,

Will the hotfix you mentioned be released soon? Our company Speckle server was moved to the latest version, but we can’t update our own python API with the latest specklepy version as long as the proposed fix is not included. This of course has several disadvantages, like an outdated StreamWrapper that can’t deal with the updated URLs from your new viewer. So we’re hoping you can make a quick release containing this fix.

gjedlicska · 15 December 2023 20:12

Hey @Rob

I just published a fix for you, should land on pypi with the version 2.17.17

Hope this fixes your issue

Rob · 18 December 2023 08:41

Perfect, thanks @gjedlicska!