Back to blog

Dev Diary #2 - Designing Synthetic Datasets

2024-03-15

The number one thing we hear from people is that they want more control over their datasets and models. Last week we gave some insight into what your data your model "knows", but this week the focus is on controlling the process of generating that synthetic knowledge.

The current problem

When you are creating a dataset with Glaive you can describe the types of inputs and outputs you want the model to be trained on. This is fine for certain tasks where a simple description suffices, but for more complex tasks you need more control.

For example, its not easy to force the samples to be a certain length, be in JSON, or choose from among a defined set of values. This is where a schema comes in.

Describing the structure of synthetic data

Since we are generating data its still really important that you can describe the nature of the data but being able to also describe the structure of the data is a big step forward. This is where a schema comes in.

Here is a simple example for a model that generates fairy-tales:

  "input_variables": {
    "mythology": {
      "name": "mythology",
      "description": "A mythological setting",
      "type": {
        "enum": {
          "values": ["greek", "norse", "egyptian", "mayan", "chinese"]
        }
      }
    },
    "moral": {
      "name": "moral",
      "description": "A moral lesson",
      "type": {
        "string": {
          "size": {
            "max": 40
          }
        }
      }
    }
  },
  "output_variables": {
    "story": {
      "name": "story",
      "description": "A creative fairy tale set in {{mythology}} mythology with the moral lesson: {{moral}}",
      "type": {
        "string": {
          "size": {
            "min": 500
          }
        }
      }
    }
  }

By describing all of these parameters in more detail you can generate much more focused and specific datasets that increase quality and reliability.

And if you thought that JSON was bad just look at these possible variables:

// Example JSON or struct variables
"jsonVariables": {
  "name": "input",
  "description": "The JSON input",
  "type": {
    "struct": {
      "fields": {
        "property": {
          "name": "property",
          "description": "A property",
          "type": {
            "float": {}
          }
        },
        "substruct": {
          "name": "substruct",
          "description": "A substruct",
          "type": {
            "struct": {
              "fields": {
                "subproperty": {
                  "name": "subproperty",
                  "description": "A subproperty",
                  "type": {
                    "bool": {}
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

// Example array variable
"arrayVariable": {
  "name": "array",
  "description": "An array",
  "type": {
    "array": {
      "variable": {
        "name": "array",
        "description": "Array values",
        "type": {
          "int32": {}
        }
      }
    }
  }
}

Anyway the point is that you can now design nearly arbitrary synthetic datasets that match perfectly with the data you want your model to act upon.

We are going to be shipping a way to build these schemas next week so stay tuned for that!

Changelog

Playground

  • We have reworked the playground infrastructure to hopefully be a whole lot more reliable, so try it out and let us know if you run into any problems.
  • It should also save your settings between sessions.

Previews

  • Dataset previews are now much more stable and able to handle a more diverse set of inputs and outputs.
  • Added the ability to control the number of previews generated.

Fixes

  • Added a rate limiter.
  • Fixed a web app crash related to nil knowledge graphs.