This month we have mostly been working on behind-the-scentes improvements like telemetry, refactoring, and performance improvements. However, we have also made some big changes to the dataset explorer and are working on making models independent of datasets.

DuckDB Explorer

Exploring datasets is an important part of checking the quality of the data and ensuring any model trained on it will be useful. Previously, datasets were around 3k-6k rows and could be easily sent as a big JSON object to the frontend and then rendered in a table. However, as we have started to generate larger datasets (20k+ rows), this approach has become problematic.

We really liked the speed of local frontend exploration and didn't want to lose that by turning every query or filter into a backend request. So our solution wasn't to move everything to the backend, but to shrink the amount of data we send to the frontend and then be able to run more efficient queries on it.

To improve load and query times we now send the dataset parquet file directly to the frontend and load it into DuckDB. Obviously, there is some overhead in loading DuckDB but this is a one-time cost and its more important to us that the user experience is fast and responsive once the data is loaded.

With DuckDB you can now run arbitrary SQL queries on the dataset and get better search results using DuckDB full-text search. This is a big improvement over the previous system and we're excited to see how it helps you explore datasets.

Indpendent Models

An exciting upcoming change that we are making in response to internal feedback is to de-couple models from datasets. Currently, models are trained on a specific dataset version and only one kind of model can be trained on a dataset version at a time. This is limiting in a few ways:

Comparing base models on the same dataset version is difficult
Models can't be trained on multiple datasets (often needed for more complex use-cases where multiple types of synthetic data are needed)
It is difficult to design diverse dataset versions for a single model

To address these issues, we are making models independent of datasets. This will mean you can train a model on any number of dataset versions and models will no longer be tied to a specific dataset version.

This change should make it possible to create much more complex models with Glaive.

Changelog

Re-implemented the dataset explorer using DuckDB
Redesigned and improved the tutorial
Improved clustering quality
Better performance for all data generation tasks
Fixed an incorrect parameter issue with our OpenAPI spec
Added Llana3.1 as a base model
Download counts now available for published datasets
Updated pricing and free tier
Billing usage and plan details now available in app
Actions that cause usage now indicate this in the UI
Versions can now have descriptions
Mobile improvements on main site
Numerous bug fixes.

Dev Diary #15 - DuckDB and Independent Models

DuckDB Explorer

Indpendent Models

Changelog