Back to blog

Dev Diary #3 - Building Schemas to Control Your Datasets

2024-03-22

Last week we showed off how we are going to let you design the perfect dataset for your use-case using the idea of a schema to describe what your generated data should look like. This week we talk about the schema builder, get into how these schemas will help make better datasets, and how we are thinking about complexity in datasets.

Schema builder

If you saw the horrible JSON last week you can imagine it might not be the most intuitive thing to write so we went ahead and built a UI for creating schemas. Hopefully this tool gives you everything you need to create both complex and simple schemas, for both inputs and outputs.

The best way to understand how it works is to use it, but here are a few highlights:

  • Create typed variables
  • Template those variables together in a description
  • Certain types have additional properties (i.e set ranges for numbers, or token length for strings)
  • Use structs to format data as JSON
  • Use arrays to force multiple generations of the same data
  • Use enums to scope inputs and outputs to specific values
  • Drag and drop to reorganize variables into structs and arrays

In the coming weeks we will add more helpful tooltips and validation so your schemas are more reliable and easier to work with.

Of course if you do feel like writing horrible JSON let us know and we might add that option as well instead of using the UI.

Better dataset with schemas

Having more control over your synthetic datasets is really important for the quality of your models, because you know your use-case best. Schemas give you finer control over your dataset generation process and we can now add features such as type checking, forcing JSON outputs, defining input/output structures etc.

For example: If you need to generate a dataset about movie reviews, involving a review, rating from 0-10 and a yes/no if someone would recommend the movie based on the review and rating. You will be able to define the exact structure and length of the review, as well as ensure that the generated rating is an integer etc. Before schemas you would have to prompt engineer the correct inputs/outputs, but now you can define exactly what you need.

In short, schemas prevent unnecessary hallucinations and gives you finer control over your synthetic dataset.

Adding further control over your synthetic dataset’s — adjust the complexity in your outputs

We are also working on adding complexity controls to your dataset outputs. Complexity control will allow you adjust the depth and details in your generated outputs while adhering to the schema provided by you.

Taking a story writing use case for example, both of the following sample have the same knowledge and style but differ in complexity:

  • Example 1 (lower complexity response) Once upon a time, a boy named Jack traded his cow for magic beans. The beans grew into a giant beanstalk reaching up to the sky. Jack climbed the beanstalk and found a castle with a giant living in it. He stole a hen that laid golden eggs and a harp that played by itself. Jack escaped down the beanstalk, cut it down, and he and his mother lived happily ever after with their riches.
  • Example 2 (higher complexity response) Jack, a curious boy from a poor village, traded his last cow for magic beans from a mysterious old woman. Overnight, the beans grew into a giant beanstalk reaching up into the clouds. Bravely, Jack climbed it, discovering a hidden world above where a giant lived with his stolen treasures, including a goose that laid golden eggs and a magical harp. Using his wit, Jack managed to take the goose and the harp, but the giant chased him down the beanstalk. Once safely back, Jack chopped down the beanstalk, defeating the giant and saving his village from further threat. With the golden eggs, Jack and his mother became wealthy, but he often looked up at the sky, wondering about the secrets beyond the clouds.

Expect this feature to be shipped and available to you next week.

Changelog

Schema Builder

See above and these other improvements:

  • Reworked dataset creation and editing into a single dataset designer.
  • Added a side navigation so that you can easily get around the designer.
  • Fixed a few bugs relating to creating datasets.

Knowledge Graph:

  • Greatly improved key topics generation for knowledge graphs when dealing with complex use cases like information extraction.
  • Made improvements to the diversity of the topics generated for a use case.

Fixes

  • Training model button disappeared on small screens.
  • Dataset views will now display schemas.
  • GetModel requests will not stop running if they 404 on the dataset view.