Back to blog

Dev Diary #15 - DuckDB and Independent Models

2024-08-09

This month we have mostly been working on behind-the-scentes improvements like telemetry, refactoring, and performance improvements. However, we have also made some big changes to the dataset explorer and are working on making models independent of datasets.

DuckDB Explorer

Exploring datasets is an important part of checking the quality of the data and ensuring any model trained on it will be useful. Previously, datasets were around 3k-6k rows and could be easily sent as a big JSON object to the frontend and then rendered in a table. However, as we have started to generate larger datasets (20k+ rows), this approach has become problematic.

We really liked the speed of local frontend exploration and didn't want to lose that by turning every query or filter into a backend request. So our solution wasn't to move everything to the backend, but to shrink the amount of data we send to the frontend and then be able to run more efficient queries on it.

To improve load and query times we now send the dataset parquet file directly to the frontend and load it into DuckDB. Obviously, there is some overhead in loading DuckDB but this is a one-time cost and its more important to us that the user experience is fast and responsive once the data is loaded.

With DuckDB you can now run arbitrary SQL queries on the dataset and get better search results using DuckDB full-text search. This is a big improvement over the previous system and we're excited to see how it helps you explore datasets.

Indpendent Models

An exciting upcoming change that we are making in response to internal feedback is to de-couple models from datasets. Currently, models are trained on a specific dataset version and only one kind of model can be trained on a dataset version at a time. This is limiting in a few ways:

  • Comparing base models on the same dataset version is difficult
  • Models can't be trained on multiple datasets (often needed for more complex use-cases where multiple types of synthetic data are needed)
  • It is difficult to design diverse dataset versions for a single model

To address these issues, we are making models independent of datasets. This will mean you can train a model on any number of dataset versions and models will no longer be tied to a specific dataset version.

This change should make it possible to create much more complex models with Glaive.

Changelog

  • Re-implemented the dataset explorer using DuckDB
  • Redesigned and improved the tutorial
  • Improved clustering quality
  • Better performance for all data generation tasks
  • Fixed an incorrect parameter issue with our OpenAPI spec
  • Added Llana3.1 as a base model
  • Download counts now available for published datasets
  • Updated pricing and free tier
  • Billing usage and plan details now available in app
  • Actions that cause usage now indicate this in the UI
  • Versions can now have descriptions
  • Mobile improvements on main site
  • Numerous bug fixes.