This week we finally shipped an early version of external sources, which will let you optionally provide additional data to be used during synthetic generation. Additionally, Llama3 has been released, and we now support it as our second base model!

External Sources

Until now, we provided all the data necessary to generate synthetic data, but this has its limitations. Firstly, we have to guess what the best information is, and second, we can't provide any proprietary data. With external sources, you can now provide additional data to be used during synthetic generation. This is a big step towards making the platform more flexible, and we are excited to see how you use it.

The first two ways you can add data are by uploading a File (only JSON and TXT for now) or pointing us to an S3 bucket (again, we will just look for JSON and TXT files). We will be adding more ways to add data in the future, but for now, this should be enough to get you started.

Alternative Base Models

Thanks to the folks at Meta for their great open-source work, there is now another fantastic base model available in the sub-10B parameter range. Llama3 is now available as a base model, and you can choose between it and Mistral 7B when training a model.

As a slightly larger model (8B parameters), Llama3 should be used for more complex tasks, but be warned, the weights will be larger.

Changelog

Sources

Create sources on the page
Link and unlink them inside a dataset
Currently, only JSON and TXT files are supported via File Upload and S3

Models

Llama3 is now available as a base model

Edits