Large Language Model (LLM) Tutorial with Julia’s Transformers.jl

By Peter Cheng | Aug 08, 2023

JuliaHub recently added a ChatGPT integration that allows users to ask questions about the Julia language including information about documentation, package information, and code examples. You can actually try our AskAI feature right now by first signing up for JuliaHub for free. To learn more about large language models in Julia, use this walkthrough with the Transformers.jl package.  

You can access the original notebook here:


To Start the Walkthrough:  

Start by adding the following packages:  

using Transformers, CUDA

After loading the package, we need to set up the GPU. Currently multi-GPU architecture is not supported. If your machine has multiple GPU devices, we can use CUDA.devices() to get the list of all devices and use CUDA.device!(device_number) to specify the device we want to run our model on.  



For demonstration, we disable the scalar indexing on GPU so that we can make sure all GPU calls are handled without performance issues. By setting enable_gpu, we get a todevice provided by Transformers.jl that will move the data/model to the GPU device.


In this tutorial, we show how to use dolly-v2-12b in Julia. Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. It's based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction-following dataset databricks-dolly-15k, crowdsourced among Databricks employees. They provide 3 model sizes: dolly-v2-3b, dolly-v2-7b, and dolly-v2-12b. More information can be found in databricks' blogpost. The process should also work for other causal LM based models. With Transformers.jl, we can get the tokenizer and model by using the hgf"" macro or HuggingFace.load_tokenizer/HuggingFace.load_model. The required files such as model weights will be downloaded and managed automatically.

using Transformers.HuggingFace

textenc = hgf"databricks/dolly-v2-12b:tokenizer"
model = todevice(hgf"databricks/dolly-v2-12b:ForCausalLM") # move to gpu with `todevice` (or `Flux.gpu`)

using Flux
using StatsBase

function temp_softmax(logits; temperature = 1.2)
    return softmax(logits ./ temperature)

function top_k_sample(probs; k = 1)
    sorted = sort(probs, rev = true)
    indexes = partialsortperm(probs, 1:k, rev=true)
    index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
    return index

The main generation loop is defined as follows:

  1. The prompt is first preprocessed and encoded with the tokenizer textenc. The encode function returns a NamedTuple where .token is the one-hot representation of our context tokens.
  2. At each iteration, we copy the tokens to GPU and feed them into the model. The model also returns a NamedTuple where .logit is the predictions of our model. We then apply the greedy decoding scheme to get the prediction of the next token. The token will be appended to the end of context tokens. The iterations stop if we exceed the maximum generation length or the predicted token is an end token.
  3. After the loop, we decode the one-hot encoding back to text tokens. The decode function converts the onehots to texts and also performs some post-processing to get the final list of strings.

using Transformers.TextEncoders

function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
    encoded = encode(textenc, context).token
    ids = encoded.onehots
    ends_id = lookup(textenc.vocab, ends)
    for i in 1:max_length
        input = (; token = encoded) |> todevice
        outputs = model(input)
        logits = @view outputs.logit[:, end, 1]
        probs = temp_softmax(logits; temperature)
        new_id = top_k_sample(collect(probs); k)[1]
        push!(ids, new_id)
        new_id == ends_id && break
    return decode(textenc, encoded)

We use the same prompt of dolly defined in instruct_pipeline.py

function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
    prompt = """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    ### Instruction:
    ### Response:
    text_token = generate_text(textenc, model, prompt; max_length, k, temperature, ends = "### End")
    gen_text = join(text_token)


generate(textenc, model, "Explain to me the difference between nuclear fission and fusion.")

JuliaHub is a unified platform for modeling, simulation, and user built applications with the Julia language. Follow along with us and learn more about the Julia language and ecosystem and try our AskAI for free by signing up today.

Looking for Model Predictive Control (MPC)?

Learn about JuliaSim Control in our webinar on MPC, trimming, and linearization in JuliaSim.

Watch Now