Sending large matrices from py -> speckle -> gh

Joscha_During · 16 February 2024 08:57

Hey all,
I ran into some issues with chunking/ sending / receiving larger files.

Objective:
Receive large files/data from a commit thats assambled with referenced objects in grasshopper
Issue:
After a lot of trying I managed to send large matrices (numbers and metadata, ca. 30MB each) to speckle by:

manually chunking the matrices
sending individual chunks of rows to speckle
collect and reference object ids of chunks to “container” speckle object (representing the complete matrix)
referencing the container objects to another container object that holds references to all matrices and their corrosponding chunked rows.
send this container as a commit to speckle

On the webviewer this works, I do not only see the reference ids but the actuall data ( there is a icon indicating when the data is referenced). This works as intended. However, if i fetch that branch and commit in grasshopper I only get the referenced objectids.

My goal is to fetch the complete object/data and parse it back to its original form using python in gh.

question
Is there perhaps a smarter and simpler way to achieve the goal of sending and receiving larger objects from gh/python to gh/python ?
Example:
attached are two images of how the commit data looks in grasshopper and in speckles webviewer.

image978×650 38.9 KB

image316×948 20.4 KB
python code for chunking

def send_row_bundle(rows, indices, transport):
    bundle_object = Base()
    bundle_object.rows = rows
    bundle_object.indices = indices
    bundle_id = operations.send(base=bundle_object, transports=[transport])
    return bundle_id

def send_matrix(matrix_df, transport, rows_per_chunk):
    matrix_object = Base(metaData="Some metadata")
    batch_index = 0  # Maintain a separate counter for batch indexing

    # Bundle rows together
    rows = []
    indices = []
    for index, row in matrix_df.iterrows():
        rows.append([round(r,4) for r in row.tolist()])
        indices.append(str(index))
        if len(rows) == rows_per_chunk:
            bundle_id = send_row_bundle(rows, indices, transport)
            # Set the reference to the bundle in the matrix object using setattr
            setattr(matrix_object, f"@batch_{batch_index}", {"referencedId": bundle_id})
            rows, indices = [], []  # Reset for the next bundle
            batch_index += 1  # Increment the batch index
            print( str(rows_per_chunk) +" rows has been sent")

    #  send the last bundle if it's not empty
    if rows:
        bundle_id = send_row_bundle(rows, indices, transport)
        setattr(matrix_object, f"@batch_{batch_index}", {"referencedId": bundle_id})

    # Send the matrix object to Speckle
    matrix_object_id = operations.send(base=matrix_object, transports=[transport])
    return matrix_object_id

# Main function to send all matrices and create a commit
def send_matrices_and_create_commit(matrices, client, stream_id, branch_name, commit_message, rows_per_chunk, containerMetadata):
    transport = ServerTransport(client=client, stream_id=stream_id)
    matrix_ids = {}

    # Send each matrix row by row and store its object ID
    for k, df in matrices.items():
        matrix_ids[k] = send_matrix(df, transport, rows_per_chunk)
        print("object: " + k + " has been sent")

    #container object that will hold references to all the matrix objects
    container_object = Base()

    for k, v in containerMetadata.items():
      container_object[k] = v

    # reference matrix objects by their IDs in Speckle
    for k, obj_id in matrix_ids.items():
      print("obj_id", obj_id)
      container_object[k] = obj_id


    # Dynamically add references to the container object
    for matrix_name, matrix_id in matrix_ids.items():
        # This assigns a reference to the matrix object by its ID
        # You might need to adjust this based on how your Speckle server expects to receive references
        setattr(container_object, matrix_name, {"referencedId": matrix_id})


    # Send the container object
    container_id = operations.send(base=container_object, transports=[transport])

    # use the container_id when creating the commit
    commit_id = client.commit.create(
        stream_id=stream_id,
        object_id=container_id,  # Use the container's ID here
        branch_name=branch_name,
        message=commit_message,
      )

jonathon · 20 February 2024 15:56

In your pythonic Send routine, have you tried the dynamic chunking option?

You can declare a property as detachable and chunkable with a prefix like @(1000) or in your case @(rows_per_chunk)

The OOTB speckle serialize will then be able to manage a lot of that chunking for you.

The exact strategy for the structure of your objects can be nuanced a deal, but it will depend mostly on your expectation for onward use. I presume in an ideal world your payload has no batches?

Could we describe a little of the What/Why before getting to the how? I’m asking as I’m hesitant to go too far in creating too detailed an example of how you could achieve things.

Joscha_During · 23 February 2024 13:20

Hello, thank you for the answer.

Yes, in the ideal world I wouldnt need to chunk the payload. Let me try to describe the background a bit more in detail. For a project we compute lets say 5 large tables (dataframes) (distance and trip matrices, 10- 30mb when saved as .csv) with a python script. Now I want to store all of these tables on a speckle branch. Ideally grouped together in a structure like this:

{
"matrix1":[
  {"row":1, "attr1":34, "attr2":0},
  {"row":1, "attr1":34, "attr2":0},
 ] ,
"matrix2":[
  {"row":1, "attr1":24, "attr2":0},
  {"row":1, "attr1":24, "attr2":0},
 ] }

Where “matrix1” and “matrix2” are speckle objects holding a list of speckle objects with the row data.
Ideally I would like to be able to fetch all matrices and rows with the grasshopper speckle connector directly. I would iterate through the rows and assamble a matrix object.

Lets perhaps take the example of two of those tables/dataframes. Lets assume i have already split it into a list of dictionaries holding the data for each row, how would the syntax look like?

Joscha_During · 23 February 2024 14:02

update:
I tried a bit more, the brackets in @(rows_per_chunk) was the hint that I was missing in my first attempt. Now python side follows this pattern. When received in grasshopper it the “chunks” automatically are parsed into one object which is great.

# Initialize transport for the server communication
transport = ServerTransport(client=client, stream_id=stream_id)

# Create a container object to hold all data
container_object = Base()

# iterate through all matrices
for key, df in matrices.items():
    # Initialize a placeholder for each matrix within the container
    container_object[key] = Base()

    # extract rows 
    rows = [[r for r in row.tolist()] for _, row in df.iterrows()]
    container_object[key]["@(300)rows"] = rows

# Send the prepared container object to the server
container_object_id = operations.send(base=container_object, transports=[transport])

# Create a commit with the sent container object on the specified branch with a commit message
commit_id = client.commit.create(stream_id=stream_id, object_id=container_object_id, branch_name=branch_name, message=commit_message)

jonathon · 23 February 2024 18:20

While this “works” chunking was intended for primitives, to me your matrices make sense - particularly if it unblocks you.

At some point, as described in the docs, detaching too much will be a performance penalty of a different kind, but with custom data types such as yours, it makes sense to experiment.

i.e. is it better to have:

{ "matrix1": [
    { "row":1, "attr1":34, "attr2":0 },
    { "row":1, "attr1":34, "attr2":0 },
  ],
  "matrix2":[
    { "row":1, "attr1":24, "attr2":0 },
    { "row":1, "attr1":24, "attr2":0 },
  ]
}

or

{ "@matrices": [
  { "name": "matrix1",
    "@(300)rows": [
      { "row":1, "attr1":34, "attr2":0 },
      { "row":1, "attr1":34, "attr2":0 },
    ]
  },
  { "name": "matrix2", 
    "@(300)rows": [
      { "row":1, "attr1":24, "attr2":0 },
      { "row":1, "attr1":24, "attr2":0 },
    ]
  }
]

If you only have 2 matrices, then the former (as described in your code) approach may be fine.

For onward manipulation, I’d tend toward the latter if there are an arbitrary number of matrices and possibly detach these also - as I say, chunking was originally for primitives but can work for your use case. @dimitrie may descend on me like a tonne of bricks for this.

In case you’ve not seen it already, take a look at Decomposition API | Speckle Docs it describes the opportunities and also potential footguns when structuring data.