Stability AI has announced Stable Diffusion XL 1.0, a text-to-image model that the company describes as its "most advanced" version to date.
Stability AI said that SDXL 1.0 can produce more vivid and accurate colors, has been enhanced in contrast, light and shadow, and can produce 1 million pixel images (1024×1024). It also supports post-editing of generated images directly on the web page.
Cue words can also be simpler than before. This is because the basic model parameters of SDXL 1.0 have reached 3.5 billion, and the understanding ability is stronger. Compared with the basic version of Stable Diffusion, the number of parameters is only about 1 billion. As a result, SDXL 1.0 has become one of the largest open image models currently.
The Stability AI blog presents more technical details of SDXL 1.0. First, the model breaks new ground in both scale and architecture. It innovatively uses a base model (base model) + a refiner model (refiner model), the parameters of which are 3.5 billion and 6.6 billion respectively.
This also makes SDXL 1.0 one of the largest open graphics models currently available.
Emad Mostaque, founder of Stability AI, said that a larger number of parameters can allow the model to understand more concepts and teach it deeper things. At the same time, the RLHF enhancement was also carried out in the SDXL 0.9 version.
This is why SDXL 1.0 now supports short prompts, and can distinguish between the Red Square and a Red Square.
In the specific synthesis process, in the first step, the base model generates noisy latent, and then the refined model performs denoising.
The basic model can also be used as an independent module. The combination of these two models can generate better quality images without consuming more computing resources.
Test results:
Install:
1. Clone the repo
git clone [email protected]:Stability-AI/generative-models.git
cd generative-models
2. Set up a virtual environment
This is assuming you've navigated to the generative-models root after the clone.
NOTE: This was tested under python3.8 and python3.10. For other python versions you may have version conflicts.
PyTorch 1.13
# install required packages from pypi
python3 -m venv .pt13source .pt13/bin/activate
pip3 install -r requirements/pt13.txt
PyTorch 2.0
# install required packages from pypi
python3 -m venv .pt2source .pt2/bin/activate
pip3 install -r requirements/pt2.txt
3. Install sgm
pip3 install .
4. Install sdata for training
pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
Package
This repository uses the PEP 517-compliant packaging method hatch.
To build distributable wheels, install hatch and then run hatch build (specifying -t wheel will skip building sdist, which is not necessary).
pip install hatch
hatch build -t wheel
You will find the built package in dist/. You can install wheels with pip install dist/*.whl.
Note that this package is not currently specified as a dependency; depending on your use case and PyTorch version, you will need to manually install the required packages.
reasoning
We provide a trickle text-to-image and image-to-image sampling demo scripts/demo/sampling.py. We provide the file hash of the full file, as well as the file hash of only the tensors held in the file (see model specs for the script to evaluate this). The following models are currently supported:
· SDXL-base-1.0
File Hash (sha256): 31e35c80fc4829d14f90153f4c74cd59c90b779f6afe05a74cd6120b893f7e5b
Tensordata Hash (sha256): 0xd7a9105a900fd52748f20725fe52fe52b507fd36bee4fc107b1550a26e6ee1d7
· SDXL-Refiner-1.0
File Hash (sha256): 7440042bbdc8a24813002c09b6b69b64dc90fded4472613437b7f55f9b7d9c5f
Tensordata Hash (sha256): 0x1a77d21bebc4b4de78c474a90cb74dc0d2217caf4061971dbfa75ad406b75d81
· SDXL-base-0.9
· SDXL-Refiner-0.9
· SD-2.1-512
· SD-2.1-768
Weight of SDXL :
SDXL-1.0: SDXL-1.0 weights are available (under the CreativeML Open RAIL++-M license) here:
· Basic model: https://hugging face .co/stability ai/stable-diffusion-XL-base-1.0/
· Refiner model: https://hugging face . co/stability ai/stable-diffusion-XL-refiner-1.0/
SDXL-0.9: SDXL-0.9 weights are available and subject to a research license. If you would like to access these models for research, please apply using one of the links below: SDXL-base-0.9-models, and SDXL-refiner-0.9. This means you can apply for either of these two links, and if you are approved, you will have access to both. Please log in to your Hugging Face account with your organizational email to request access.
After getting the weights, put them into checkpoints/. Next, use
streamlit run scripts/demo/sampling.py --server.port <your_port>
Invisible Watermark Detection
Images generated with our code use the invisible watermark library to embed invisible watermarks into the model output. We also provide a script to easily detect watermarks. Note that this watermark is different from previous stable diffusion 1.x/2.x releases.
To run the script, you need to have a working installation as above or try an experimental one using only a minimal number of package imports:
python -m venv .detectsource .detect/bin/activate
pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark
To run the script, you need to have a working installation as described above. The script can be used in the following way (don't forget to activate your virtual environment in advance, e.g. source .pt1/bin/activate):
# test a single file
python scripts/demo/detect.py <your filename here># test multiple files at once
python scripts/demo/detect.py <filename 1> <filename 2> ... <filename n># test all files in a specific folder
python scripts/demo/detect.py <your folder name here>/*
Training:
We provide an example training configuration in configs/example_training. To start training, run
python main.py --base configs/<config1.yaml> configs/<config2.yaml>
The configurations are merged from left to right (later configurations will override the same value). This can be used to combine models, training and data configuration. However, all of these can also be defined in a single configuration. For example, to run class-conditional pixel-based diffusion model training on MNIST, run
python main.py --base configs/example_training/toy/mnist_cond.yaml
Note 1: To configure configs/example_training/imagenet-f8_cond.yaml, configs/example_training/txt2img-clipl.yaml and configs/example_training/txt2img-clipl-legacy-ucg-training.yaml for training with a non-toy dataset, set Needs to be edited according to the dataset used (the dataset is expected to be stored in web dataset-format). To find the part that needs to be modified, search for comments containing USER: in the respective configuration.
Note 2: This repository supports both pytorch1.13 and pytorch2 for training generative models. However, for autoencoder training, such as configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml, only pytorch1.13 is supported.
Note 3: Training the underlying generative model (eg configs/example_training/imagenet-f8_cond.yaml ) needs to retrieve the hugging face from the checkpoint and replace the CKPT_PATH placeholder in this line. Do the same for the provided text-to-image configuration.
Build a new diffusion model
Regulator
This GeneralConditioner is passed conditioner_config. Its only attribute is emb_models , a list of different embedders (all inheriting from AbstractEmbModel ) used to condition the generated model. All embedders should define whether they are trainable (is_trainable, default False), use the bootstrap loss rate without classifiers (ucg_rate, default 0), and the input key (input_key), e.g. txt for text conditioning or cls for class conditioning. When evaluating the condition, the embedder will get batch[input_key] as input. We currently support 2D to 4D conditions, and conditions from different embedders are properly concatenated. Note that embed programs are important in conditioner_config.
network
The neural network is passed network_config. This used to be called unet_config, which isn't common enough yet, as we plan to experiment with transformer-based diffusion backbones.
fail
Loss is configured via loss_config in the following way. For standard diffusion model training, you must set sigma_sampler_config.
Sampler configuration
As mentioned above, samplers are model independent. In sampler_config we set the type of numerical solver, the number of steps, the type of discretization, and e.g. a bootstrap wrapper for classifier-less bootstrapping.
Dataset processing
For large-scale training, we recommend using our data pipeline data pipeline project. The project is included in the requirements and following the installation section. Small map style datasets should be defined in a repository (e.g. MNIST, CIFAR-10, ...) and return a dictionary of data keys/values, e.g.,
example = {"jpg": x, # this is a tensor -1...1 chw
"txt": "a beautiful image"}
We expect images in -1...1, channel-first format.
According to the official introduction, SDXL 1.0 can run on a consumer-grade GPU with 8GB VRAM, or on the cloud. In addition, SDXL 1.0 has also been improved in fine-tuning, and can generate custom LoRAs or checkpoints.
The Stability AI team is also now building a new generation of task-specific structured, styled and combined controls, with T2I/ControlNet specifically for SDXL.