Back to blog

Dev Diary #11 - Sources v2 and Logging API

2024-05-31

This week, our ML team was hard at work on a behind-the-scenes project that we hope to share more on soon. In the meantime, we shipped an overhaul of the sources system that adds a suite of new functionality, as well as some quality of life & UI changes to improve the overall custom source experience. Additionally, our logging API now allows for bulk uploads, which will enable future features for measuring models.

Sources

A few weeks ago, we introduced custom external sources (see Dev Diary #6). This week, we built out a view for individual sources, enabled custom source-only dataset generation, and added a new source type. Additionally, you can now specify sources on the initial version of a dataset, rather than needing to link them after the first version has already been created.

Source view

The new source view allows you to view the configuration of the source, as well as a list of datasets the source is linked to. Additionally, we've changed the "Delete" action to "Archive", which will help mitigate accidental loss of progress as well as enable better backwards-visibility on datasets.

Source-only generation

Our normal dataset generation pipeline uses our own private common crawl content to generate datasets in tandem with any provided external sources. With source-only generation, we give you the option to exclude our common crawl content and instead generate a dataset exclusively from the sources you provide. This is another big step towards the goal of generating datasets custom-tailored to any usecase.

Web sources

While our existing source system allowed for file sources (.txt, .json) and S3 bucket sources, we now allow you to configure a list of URLs as an external source. The provided URLs will be scraped and used to generate your dataset. The web is your oyster!

Bulk Logging

One of our next areas of focus over the coming weeks is enabling accurate, customizable, and precise measurement of how a model is performing. A key prerequisite to these measurement features is for you to supply us with large amounts of data from your Glaive models! Our logging API now supports bulk uploads, which should make this process much more efficient and convenient.

Changelog

  • Added source viewer.
  • Sources now linkable in dataset designer.
  • Added source-only option to dataset designer.
  • Added web sources.
  • Added bulk upload to logging API.
  • Fixed a bug related to model training retries.
  • Fixed several form input bugs.
  • Fixed a bug related to routing team source lists.