Visualizing Data → Process → Data → ... → Data → Process → Data Pipelines

JoostGevaert · 17 November 2024 15:35

Setting the stage

AEC projects can be mapped out into “pipelines”
Input data → process → data → … → data → process → Deliverables

A pipeline starts with input data (project requirements, drawings, GIS data, Ground Investigation data, etc.)
A pipeline ends with the deliverables (BIMs, drawings, reports, presentations, etc.)
Processes can have multiple Datasets as inputs and / or outputs.
A pipeline CANNOT start nor end with a Process.
Data CANNOT directly flow into Data.
Processes CANNOT flow into other Processes.
The whole project can be seen as one big process, but this huge process can be mapped out into smaller and smaller
data → process → data → … → data → process → data
pipelines. In software engineering this is called refactoring.
Of course, some parts of a pipeline are iterative, and others might be “messy”. AEC projects don’t have “unidirectional data flow”, and never will, but that is fine.
Of course, we’re going to put all our datasets on Speckle , which allows us to swap out cumbersome manual processes for scripts incrementally, one process at a time.
A script will then mature over the course of a or multiple project(s), and will be split (i.e. refactored) into multiple smaller scripts such that a script only serves a single purpose (this is called the Single Responsibility Principle in software engineering).
Because our scripts only serve a single purpose, they are more maintainable and they become reusable on other projects.
Once a script is mature, it can become a Speckle automation.
Once we have a library of modular, single purpose scripts on Speckle Automate, we can start chaining them into automated pipelines.

Data engineering / science pipelines

In data engineering / science it is common to talk about data processing pipelines. One open source Python library I like for building reproducible, maintainable, and modular pipelines is kedro [src 2.]:

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

I especially like kedro because of kedro-viz, which is fantastic for visualizing such data processing pipelines. See this demo:

What do you think?

I think that it could be very useful to be able to visualize AEC project pipelines, and the parts of such pipelines that are automated with Speckle Automate.

I discussed this stuff with @KatherineC earlier this week at SpeckleCon, and thought it would be worth bringing to the attention of the Speckle Community and @Automatons.

Credit where credit is due

Apart from kedro, @RamonvanderHeijden, Evan Levelle and Martin Riese (2015) must also be credited for proposing this concept under the name of “Building Information Generation”.

Sources:

Van Der Heijden, R., Levelle, E., & Riese, M. (2015). Parametric building information generation for design and construction. In Computational Ecologies: Design in the Anthropocene-35th Annual Conference of the Association for Computer Aided Design in Architecture (pp. 417-429).
Alam, S., Chan, N. L., Couto, L., Dada, Y., Danov, I., Datta, D., DeBold, T., Gundaniya, J., Honoré-Rougé, Y., Kaiser, S., Kanchwala, R., Katiyar, A., Pilla, R. K., Nguyen, H., Cano Rodríguez, J. L., Schwarzmann, J., Sorokin, D., Theisen, M., Zabłocki, M., & Brugman, S. (2024). Kedro (Version 0.19.9) [Computer software]. https://github.com/kedro-org/kedro

chuck · 17 November 2024 17:14

this deserves a more thorough response, but in short - yes, 1000%

“chaining functions” and visualizing your pipelines is very much in the mindshare, ty for the kedro ref. On a more sentimental note, I really hope that this is the place where we finally pull off a public library of these “module, single purpose scripts,” but we’re all well aware of the hurdles there. A damn good set of tools for doing this internally is a good step, imo, and would fill my heart a little