Back to blog

Dev Diary #4 - Adding Complexity Controls and External Data Sources

2024-03-29

Last week, we talked about the complexity of a dataset and how some use cases require higher complexity than others. This week, we shipped complexity controls and also the ability to edit the knowledge graph with custom nodes.

Complexity

We shipped the ability to control the complexity of a dataset, which will let users have more control over how complex the reasoning within the dataset samples is. This becomes really important for logical or reasoning-heavy use cases (code, math, etc.) where you want the dataset to have complex examples to teach a model how to navigate and reason through them. Right now, complexity is defined at four discrete levels, and you can preview what each level means for your use case.

Editing knowledge graph

The knowledge graph is a tool to help users visualize what knowledge is covered by the dataset and subsequently goes into the model. Until now, users could only view the knowledge graph and try to iterate on that by editing the description of the use case, but this week we shipped the ability to edit the knowledge graph by adding or removing nodes. This will help users control what knowledge they want the dataset to cover, remove unwanted data, and add things that we might otherwise miss.

Adding external data sources

We are seeing recurring requests from users where they want to ground the synthetic dataset we generate using seed data from external data sources like websites or internal knowledge bases. This is important for use cases where no information exists on the topic outside of your internal knowledge base so we can’t accurately generate synthetic data for it, or if you’d like to target some very specific data for your model like documentation of a specific set of APIs.

This week we built out support for external data sources for our data generation pipeline, and you can expect this feature to be shipped and available to you next week.

Better improvement workflow for models

Over the past few weeks, we have been working hard to improve the process of building a dataset and model on our platform, and now we are starting to focus also on a better flow for improving your datasets and, by extension, models, based on learning from measurements.

We MVP'd a flow for improvement which will let you update a dataset and then train a model on that in the following ways:

  1. Create a new dataset from scratch, currently supported.
  2. Add knowledge to the dataset, useful where the model performs badly on inputs because the information is missing in the dataset.
  3. Edit the style schema, and we apply the changes to all the samples in the existing dataset.
  4. Remove examples that meet a certain condition; this will help you remove unwanted noise.
  5. Add samples of a certain type to balance the dataset. For example, if you see that your model performs worse on a RAG use case where there are more than 5 document sources in the input, you can add samples of just those types to improve the model in that area.

We expect to ship this feature within the next week.

Changelog

Schema Builder

  • Added complexity control to the schema builder.
  • Several bug fixes related to schema parsing.

Knowledge Graph:

  • Knowledge graph can now be edited by adding or removing nodes.
  • Edit keywords inline.
  • Format as a horizontal tree.

Note: You will need to make a new version and generate a new graph to see the changes on existing datasets.

Other:

  • We shipped a new and updated landing page; check it out here.
  • Data previews are now 5x faster.

Fixes

  • Some weird labels in the dataset designer have been removed.
  • Previews no longer require a knowledge graph to be generated.