Graphing Your Speckle Contributions

jonathon · 5 February 2023 20:01

With a Speckle in-person retreat over, I thought I’d sun myself in glorious Athens for a few days. The snow and ice rain had other ideas. So my mind wandered, as I did through the gridded streets, around to Speckle Server metadata.

We have seen a few ways how you can access your Speckle data outside of 3D authoring apps. The PowerBI and Excel apps, the GraphQL and REST APIs, or the true data nerds knee-deep in the data using C# and specklepy SDKs.

But in addition to your data, there is also data about the data – how meta!

In this simple python or Jupyter notebook example, we will use the Speckle GraphQL API to get your server contributions and plot them in a nice graph akin to the GitHub contributions graph. This could be performed as part of a Project Stream activity dashboard.

Firstly some packages we’ll need:

%pip install graphqlclient
%pip install pandas
%pip install calmap

With all that boilerplate done with, we can now get our data from the API. We’ll use the userSearch query to get our user object and then the timeline query to get our contributions.

First, we can set up our client. This is a super easy method and doesn’t need us to mess around with an HTTP client or anything. We need to pass in the URL of the server we want to query and a Personal Access Token (PAT) that we can get from our user profile page.

NB: As ever, don’t share your PATs with anyone - this includes publishing on Github!

from graphqlclient import GraphQLClient
client = GraphQLClient('https://speckle.xyz/graphql') # notice the trailing endpoint
client.inject_token('58...long bit of string...f6')

Because I’m a special snowflake, I have multiple accounts on the same server, so I might need to specify which one I want to get the data for. I can do this by passing the user parameter to the userSearch query.

Or, brute force it by passing in all the usernames I want to get data for. I could then filter the results to get the one I want.

For those unfamiliar with GraphQL, it allows us to specify what data we want a server to respond with when we query it. It could still be a lot, but it puts you in control of the data shape coming back.

import json

emails = ['jonathon@st__p.com', 'jonathon@speckle.sys__s']
data = []

for email in emails:
    result = client.execute(f'''{{
      userSearch(query:"{email}"){{
        items {{
          id
          timeline(limit:999999,after:"2022-01-01T00:00:00Z") {{   
            items {{
              userId
              streamid
              actionType
              time
            }}
          }}
        }}
      }}
    }}''')

Get the JSON of the result and add it to the data list. Some actions in a user’s timeline are the actions of others. e should filter those out too.

    loaded = json.loads(result)
    filtered_data = [item for item in loaded['data']['userSearch']['items'][0]['timeline']['items']
         if item['userId'] == loaded['data']['userSearch']['items'][0]['id']]
    data = data + filtered_data

With the raw data grabbed, a helpful Panda DataFrame is created for us to play with.

As the comments in the code show, all those individual events are a bit granular to plot. In addition, we are interested in a day count, so we’ll group them by year, month and day and count them. This is to mimic the GitHub contributions graph.

The count column is the number of contributions on that day. In this case, any action on the Speckle server counts as a contribution. Because the query above includes the actionType parameter, we could filter the results only to include specific actions. For example, if we only wanted to count commit_created, we could. I have left it in the server query for this reason.

import pandas as pd
import datetime as dt

# Load it into a pandas dataframe
df = pd.DataFrame(data)

# Convert timestamp to datetime and extract year, month, and day
df['time'] = pd.to_datetime(df['time'])
df['year'] = df['time'].dt.year
df['month'] = df['time'].dt.month
df['day'] = df['time'].dt.day

# Group by year, month, and day, and count the number of events
df_grouped = df.groupby(['year', 'month', 'day'], as_index='count').size().reset_index()
df_grouped.rename(columns={0:'count'}, inplace=True)

I know a previous forum demo involving pesky Pokemon skews my results by a whopping 2000 contributions, so I’ll clip the data to the 90th percentile to get a more typical graph. You may not need this or find a different algorithm helpful to massage the result.

See where that phrase “Only trust the statistics you made up yourself.” comes from!!

threshold = df_grouped['count'].quantile(0.90)
df_grouped['count'] = df_grouped['count'].clip(upper=threshold)

The previous grouping smooshed our timestamps out of existence - we need that to map to a calendar later. So we’ll add it back in. It will be more coarse than the original data, but that’s ok.

# Create a timestamp column from the year, month, and day values
df_grouped['timestamp'] = pd.to_datetime(df_grouped[['year', 'month', 'day']])
df_grouped = df_grouped.set_index('timestamp')

For the graph, we’ll use the CalMap package. It is a little crusty and could benefit from a bit of TLC to bring it up to date with modern Pandas, but it does the job.

Open source loves contributions. If you want to help out, you can find the repo here.

The graph will need to know what value to use to grade our contributions, so we’ll use a Pandas Series command on the grouped data. The calculation for the relative colouring is handled by the calmap package.

events = pd.Series(df_grouped['count'])

Github has YellowGreen as its all too familiar chart, so to plot the graph in lovely Speckle Blue (all varieties of them)

import calmap

# Plot the data using `calmap`
plot = calmap.calendarplot(
    events,
    daylabels='MTWTFSS',
    dayticks=[0, 2, 4, 6],
    monthticks=3, 
    how=None,
    fig_kws=dict(figsize=(20, 10)), 
    linewidth=0.5, 
    cmap='Blues')

Only the events field is strictly mandatory here; designers gonna design.

So for the last 2 years on speckle.xyz, these are my contributions:

These could be further partitioned by any of the 4 data points I included in my query: userId, streamId, actionType

Others on the Speckle Team may scoff that they fill the chart :D, but if nothing else, testing and releasing a Connector certainly ups one’s game

vwb · 5 February 2023 20:11

So cool. High quality content Jonathon, thanks. Hahahah!

dimitrie · 6 February 2023 08:12

This should get js-ified and make its way in the user profile page in FE2

jonathon · 1 March 2023 18:04

When I first posted the example above it included this line:

…which was problematic for a couple of the gang

In the interest of brevity, I was greedily grabbing an entire bunch of timelines in one go. (Incidentally, if you make more than 1M commits in a year, get in touch, you are a hero.)

Nevertheless, this is not a good practice, and like many of the other endpoints the Speckle server has, it is better to claim paginated resources. i.e. get the first n results and keep getting n results until there are no more. This will make your code less fragile in response to server pauses and connection fragility and protects our servers from almighty loads.

My simple revision is a straightforward implementation of receiving paginated resources.

The revised query object inside the emails loop:

for email in emails:
  # Set the initial cursor to null
  cursor = "null"
  while True:
    # Construct the GraphQL query with the `first` and `after` arguments
    query = f"""
      {{
        userSearch(query:"{email}"){{
          items {{
            id
            timeline(first: {page_size}, after: {cursor}) {{   
              cursor
              items {{
                userId
                streamid
                actionType
                time
              }}
            }}
          }}
        }}
      }}
    """

   # Execute the query and load the JSON result
   result = client.execute(query)
   loaded = json.loads(result)

   # If there are no more pages, break out of the loop
   if not loaded.get('data', {}).get('userSearch', {}).get('items', [{}])[0] \
      .get('timeline', {}).get('pageInfo', {}).get('hasNextPage', False):
     break

   # .... other filters and qualifications #

   # Otherwise, set the cursor to the end of the current page and continue
   cursor = f'"{loaded["data"]["userSearch"]["items"][0]["timeline"]["cursor"]}"'

This is a better practice than the original, so you can all now implement this without fear of breaking the server. You are welcome @fabians .