15 July, 2024

Transitioning Legacy Compute to JuliaHub: A Practical Guide

Julia - the Gateway to Fast Computing

The Julia language won renown by providing an accessible front-end to high-performance scientific computing. The requirements for high-performance scientific computing also overlap with many of those for fast computing, thus making Julia a competitive programming language.

Within the application stack, it makes a lot of sense to use Julia for the “compute” layer of the application: whether OLAP, high-dimension data analytics, or maximizing an objective function, Julia as the compute layer provides high-performance for many computing applications. With the extensive universe of open-source packages as part of the Julia language, developers can quickly empower themselves with packages that enable rapid development with well-established and community-vetted dependencies.

JuliaHub - the Canonical Platform for the Compute in Your Stack

JuliaHub enables an observable, secure, consistent, and replicable compute platform for Julia applications. By combining Julia with JuliaHub, clients have been able to retire legacy code or drop licenses, gaining performance, maintainability, and even cost-savings.

Advantages - Plugging-In Julia to your Application With JuliaHub

Below is a simple diagram of the relationship between your application layer and your JuliaHub compute layer: commands and/or data will be sent from your application to the JuliaHub compute layer, where a response from JuliaHub will contain the desired evaluation of the commands and/or data from your application.

unnamed (4)

Advantage #1: Use Your Existing SSO Authentication

Before diving into serializing/deserializing data structures or even parsing application commands into Julia code, the application itself needs to securely connect with the Julia compute layer. It is here where JuliaHub begins to demonstrate the power of its API, first by enabling authentication, including an optional SSO authentication from the application.

Without any additional work, a developer can seamlessly authenticate into JuliaHub using the existing SSO of the application. Other options available include using long-lived tokens or JuliaHub-specific user credentials, but best-practices remain to defer to the SSO authentication of the main application.

Advantage #2: Compute Jobs Are Isolated From One Another

The JuliaHub API enables running a compute job in an isolated namespace and without sharing no data with other jobs or users. Thus, a single JuliaHub instance can securely compute across an application’s users without violating SLA’s or compromising audit certification.

Advantage #3: JuliaHub Computes Are Consistent

Architecture-specific variations in native Julia code can sometimes result in discrepancies - this risk is eliminated with JuliaHub, where the compiler’s target architecture is the same for every job, every time.

Advantage #4: Re-running a Compute is Easy

In addition to providing a stable target architecture with which to assure consistent computing outcomes, re-running a given compute job is a feature of JuliaHub, allowing users to re-run compute jobs with exactly the same state and settings.

Advantage #5: Observability Is Built-in and Customizable

Easily observe jobs with the JuliaHub API and view metrics regarding memory usage, CPU utilization, and custom metrics consisting of any logged variable desired.

Advantage #6: Powerful Datasets API

If your compute job requires large datasets, JuliaHub has a powerful datasets API with which to push your data into your JuliaHub compute instance.

Your New Compute Layer - Migrating to Julia With JuliaHub

To illustrate how a migration may work, we will use a hypothetical example: imagine your application currently depends on the “ProTRAN”(hypothetical) application for its computes.

ProTRAN has its own syntax and semantics, which differ significantly from those in Julia.

ProTRAN has its own data types and structure, which also differ significantly from those in Julia.

Hence, before we even consider network connections, deployment, or security, we must first find out how to:

Serialize data structures sent to ProTRAN by the application so they can be ingested by Julia
De-serializing data sent to the new Julia compute into native Julia data structures

unnamed (5)

To get more detailed in this example, suppose ProTRAN is designed around OLAP and tabular data. We could then expect to see some sort of way to establish “relations” between indices for dimensions and the underlying data in their respective columns.

Before building-out a comprehensive solution, which typically requires Metaprogramming (see https://docs.julialang.org/en/v1/manual/metaprogramming/) and developing a Domain-Specific Language, it often helps accelerate development to build an intuition for how Julia and JuliaHub work by starting with a typical compute job from the legacy application and converting it to Julia and then running it on JuliaHub. This is what we will do, now.

For our demonstration, we will make a contrived example, where three ProTRAN types are defined:

Index
Table
Eval

These three types will be used to create a tabular dataset with a function applied to its columns. The function applied to the columns in a Table will be determined by the Eval function, while the columns of the table itself will be specified by Index.

Legacy ProTRAN Code


def Index 
  Index1 = (100,  200, 300)
  Index2 = (1,2)
end

Creates two “indices” with which to index an anticipated tabular dataset. The Index objects must eventually be associated with a Table for our case. Furthermore, we need to define a function, price, that takes all (in this case) Index entities as arguments.


def Table PriceTable
  INPUTS
  Index1
  Index2
  OUTPUTS
  price(Index1,Index2)
end

Function	Index1	Index2	price(Index1,Index2)
price	100	1	price(100, 1)
price	100	2	price(100, 2)
price	200	1	price(200, 1)
price	200	2	price(200, 2)
price	300	1	price(300,1)
price	300	2	price(300,2)

This excruciatingly contrived ProTRAN code above defines a two-dimensional function, price, of Index1 and Index2. However, because it is an Eval type object, it must be bound with a Table, and so Price is implied by the argument in the Eval declaration to be taking indices from Table PriceTable with the Price(PriceTable) portion of the function definition.


def Eval price(Index1,Index2)
  median(Index1  * Index2) + sqrt(Index1)
end

Again, note the implied arguments of the price function: it automatically uses both(all, in general) indices as arguments to the function. This is the default behavior of ProTRAN and the behavior we seek to model, for our example:

Function	Index1	Index2	price(Index1,Index2)
price	100	1	price(100, 1)
price	100	2	price(100, 2)
price	200	1	price(200, 1)
price	200	2	price(200, 2)
price	300	1	price(300,1)
price	300	2	price(300,2)

With this mildly agonizing exercise complete, we can finally begin interpreting the above legacy code into its corresponding Julia and dispatch it as a Job to JuliaHub.

Firstly, we will address the Index creation - a pair of vectors in Julia:


def Index 
  Index1 = (100,200,300)
  Index2 = (1,2)
end

… in Julia…


Index1 = [100,200,300]
Index2 = [1,2]

Now, we are going to run into a question that requires us to make a decision: Julia does not have a “native” tabular data structure: there are many proven and effective packages that handle such data structure, and we will use the popular DataFrames.jl package, in addition to broadcasting, to make our table compute as expected for our example.

The question, then, is how to turn this ProTRAN...


def Table PriceTable
  INPUTS
  Index1
  Index2

  OUTPUT
  price(Index1,Index2)
end

def Eval price(PriceTable)
  median(Index1*Index2) + sqrt(Index1)
end

… into well-performing Julia code.

With DataFrames.jl (https://dataframes.juliadata.org/stable/), we simply need to do the following:

Create a dataframe with our data (the vectors specified as Index variables)
Apply our transformation (the price function, in our case) to the data
Return the new dataframe

Because our vectors do not automatically “denormalize” like Index objects do in a Table in ProTRAN , we must first “denormalize” our indices before using them to construct our dataframe object in Julia:


index1_df = repeat(Index1, inner = length(Index2))
index2_df = repeat(Index2, outer = length(Index1))

The repeat function (an in-built Julia function) allows to repeat the data as needed for our denormalization, and so when we use the DataFrame constructor, we get the desired result:


df = DataFrame(Index1 = index1_df, Index2=index2_df)

… will then give…


6×2 DataFrame

 Row │ Index1  Index2 

     │ Int64   Int64  

─────┼────────────────

   1 │    100       1

   2 │    100       2

   3 │    200       1

   4 │    200       2

   5 │    300       1

   6 │    300       2

With our dataframe ready, we are now prepared for the transformation: it will add a new column, price, which is a function of both of the prior columns. Note the . prefix before the operators: these vectorize their evaluation, e.g.


[1,2,3] .* 3
3-element Vector{Int64}:
3
6
9

In our case, we use vectorized operators for the price function:


function price(x,y)
  x .* y .+ sqrt(x)  
end

Finally, we transform our DataFrame to show the result of this price function applied to our DataFrame, in accordance with the standard from our design specification:


df = transform(df, [:Index1, :Index2] => ByRow(price) => :price)

Now, our desired case was to run this compute and return the result from a JuliaHub job.

We have two options on how to compute our script, which we can call, `app_compute.jl`:

Use the JuliaHub “jobs API”
Host a web server on JuliaHub to route and process the computes

Option #2 would require setting-up a Julia webserver (e.g. using the Oxygen.jl package), and while that is straightforward, it is certainly simpler to just use the JuliaHub “jobs API” for our example.

With option #1, we can, for illustrative purposes, use ENV parameters to our job as parameters. In general, when sending data to JuliaHub for a compute, there are three options:

ENV parameters for small data (e.g. our prototype example, here)
Using an appbundle, which contains the data as file(s)
Upload the data to JuliaHub as a dataset and access it using the JuliaHub datasets API

For the sake of brevity, we will use only the first option, ENV parameters, for sending data to our compute;

appbundles (https://help.juliahub.com/julia-api/stable/guides/jobs/#jobs-batch-appbundles) and the JuliaHub datasets API (https://help.juliahub.com/julia-api/stable/guides/datasets/) have good resources available for users.

Below is a schematic of how we initially interact with our compute job on JuliaHub:

unnamed (6)

To capture these ENV values, we use the following code:


# default values in third argument of get
get(ENV, "Idx1", "[100,200,300]")
get(ENV, "Idx2", "[1,2]")

Next, we will need to parse both Index1 and Index2: our vectors are passed as strings, and these strings must become vectors of integers for our compute to run: we use a vectorized parse to achieve this:


Index1 = parse.(Int, split(chop(Index1, head=1),','))
Index2 = parse.(Int, split(chop(Index2, head=1),','))

The app_compute.jl code should look like the following:


import Pkg;
Pkg.add("DataFrames")

using DataFrames

# default values in third argument of get
get(ENV, "Idx1", "[100,200,300]")
get(ENV, "Idx2", "[1,2]")

Index1 = parse.(Int, split(chop(Index1, head=1),','))
Index2 = parse.(Int, split(chop(Index2, head=1),','))

index1_df = repeat(Index1, inner = length(Index2))
index2_df = repeat(Index2, outer = length(Index1))

df = DataFrame(Index1 = index1_df, Index2=index2_df)

function price(x,y)
  x .* y .+ sqrt(x)  
end

df = transform(df, [:Index1, :Index2] => ByRow(price) => :price)

To accommodate the JuliaHub API, we will need to make a few changes to this code:

Add the JSON package
“Use” the JSON package
Cast the df dataframe to JSON
Set the ENV[“RESULTS”] to the value of json(df)

The code with these additions is below:


import Pkg;

Pkg.add("DataFrames")
Pkg.add("JSON")

using DataFrames
using JSON

# default values in third argument of get
get(ENV, "Idx1", "[100,200,300]")
get(ENV, "Idx2", "[1,2]")

Index1 = parse.(Int, split(chop(Index1, head=1),','))
Index2 = parse.(Int, split(chop(Index2, head=1),','))

index1_df = repeat(Index1, inner = length(Index2))
index2_df = repeat(Index2, outer = length(Index1))

df = DataFrame(Index1 = index1_df, Index2=index2_df)

function price(x,y)
  x .* y .+ sqrt(x)
end

df = transform(df, [:Index1, :Index2] => ByRow(price) => :price)

ENV["RESULTS"] = json(df)

Next, from a local Julia REPL instance, we will authenticate our JuliaHub user, with the JuliaHub.jl authentication API:


using JuliaHub
JuliaHub.authenticate("juliahub.com")

(the URL specified within JuliaHub.authenticate() function may differ for your specific JuliaHub instance - if so, put the URL for your JuliaHub instance)

Next, you will be queried to invoke the UI to authenticate your account:

Once you authenticate, you can now make API calls using the JuliaHub.jl package to your JuliaHub instance. In our case, we want to do three things:

Set environment variables for the job inputs
Dispatch a compute
Query updated “Job” object to get ENV[“RESULT”] value - note that the result will be returned as a JSON object

Defining the environment variables for this job is simply constructing a Dict object, with string values for each key. Again, from within your REPL, defined env_var, the environment Dict to be used by our anticipated Job on JuliaHub:


env_var = Dict("Idx1" => "[100,200,300]", "Idx2" => "[1,2]")

Next, we dispatch the compute from the REPL:


job = JuliaHub.submit_job(JuliaHub.script("app_compute.jl"), env = env_var)

Note how the env values, Idx1 and Idx2, determined by the env_var defined above, are still strings when submitted, even though they seem like they are not: it is our code that parsed them that makes them usable.

To access results, it is possible to simply set the ENV[“RESULTS”] dictionary to the desired values within `app_compute.jl`: then, one accesses the results from the “Job” object created when submit_job() was invoked. From the REPL:


job = JuliaHub.job(job)

It may be helpful to query the status of the Job, to see if it is completed before querying the result.


job.status

Will now give us the final table we desired, (in JSON format):

unnamed (8)

Within the JuliaHub UI(Jobs >> Completed Jobs >> Details), all of the ENV inputs and the outputs can be viewed:

ENV inputs, with Idx1 and Idx2:

unnamed (9)

ENV[“RESULTS”], with the JSON-ified dataframe data cleanly displayed within the UI:

unnamed (10)

Up until now, we have been using ENV parameters to set the input values and for accessing output - when working with larger data, you will use the JuliaHub datasets API for input and/or output.

In particular, the ENV[“RESULTS”] environment variable is limited to roughly 1kb. You will need to download the file from JuliaHub for larger datasets.

Because we are working with a DataFrame object, we will use the CSV package, which allows us to easily write our output as a .CSV file.

Firstly, we need to add the necessary dependencies and “activate” them:


Pkg.add("CSV")
using CSV

Next, we need to define the path and filename of our desired output file and set it to the value of ENV[“RESULTS_FILE”]:


ENV["RESULTS_FILE"] = joinpath(@__DIR__,"output.csv")

FInally, we write the CSV file, using the write function defined within the CSV package:


CSV.write(ENV["RESULTS_FILE"], df)

Our compute_app.jl should now look like this:


import Pkg;
Pkg.add("DataFrames")
Pkg.add("JSON")
Pkg.add("CSV")

using DataFrames
using JSON
using CSV

# default values in third argument of get
get(ENV, "Idx1", "[100,200,300]")
get(ENV, "Idx2", "[1,2]")

Index1 = parse.(Int, split(chop(Index1, head=1),','))
Index2 = parse.(Int, split(chop(Index2, head=1),','))

index1_df = repeat(Index1, inner = length(Index2))
index2_df = repeat(Index2, outer = length(Index1))

df = DataFrame(Index1 = index1_df, Index2=index2_df)

function price(x,y)
  x .* y .+ sqrt(x)
end

df = transform(df, [:Index1, :Index2] => ByRow(price) => :price)

ENV["RESULTS"] = json(df)

# define the path and filename
ENV["RESULTS_FILE"] = joinpath(@__DIR__,"output.csv")

# write the file to the path defined above, using the df object
CSV.write(ENV["RESULTS_FILE"], df)

When we submit the job, we can easily download the file from within the JuliaHub UI (Jobs >> Completed Jobs >> Details):

unnamed (11)

As expected, output.csv appears as one of our “Output Files”, and we simply click the “Download” button to retrieve it, locally.

It would be remiss to not mention what the other “Output Files” are, and so in order:

code.jl - the code we submitted is simply renamed code.jl

Project.toml & Manifest.toml - define the dependencies associated with executing our code (for example, download and check to see which ones are in Project.toml - there should only be three, and they should match what we added in this blog!)

Output.csv - the desired dataframe result, in CSV format

Downloading and opening output.csv yields the following:


Index1,Index2,price
100,1,110.0
100,2,210.0
200,1,214.14213562373095
200,2,414.14213562373095
300,1,317.3205080756888
300,2,617.3205080756887

Review

We have taken a legacy application, “ProTRAN” and built a prototype of executing one of its core functions in Julia. We demonstrated how to modify the code to enable the use of environment variables in a compute job (ENV), for both input and output from a Job run on JuliaHub. Finally, we showed how to produce a download-able output file from this Job in JuliaHub and how to view it locally.

Scope of Work & Future Work

The scope of this article has been very focused to demonstrate the workflow for those seeking to prototype their migration from legacy applications to a compute on JuliaHub. Typically, such a migration requires writing a Domain-Specific Language(DSL), using Metaprogramming. DSL’s are an active discussion topic in the Julia community and have many resources available on their development and best practices and design patterns.

Furthermore, input data often proves too large to be set using a ENV parameter - instead, one needs to either to use the JuliaHub Datasets API (https://help.juliahub.com/julia-api/stable/reference/datasets/) or, for smaller datasets but those too large to be included as ENV parameters to a Job, included as a file within an appbundle (https://help.juliahub.com/julia-api/stable/reference/job-submission/#JuliaHub.appbundle).

Migrating to JuliaHub to execute Julia compute jobs, previously the role of legacy applications, is a common role for JuliaHub used by clients who require both high-performance and open-source solutions.

JuliaHub

About the Author

JuliaHub

Discover key content authored by JuliaHub experts. Stay updated on innovative practices in scientific and technical computing.

Transitioning Legacy Compute to JuliaHub: A Practical Guide

About the Author

JuliaHub

Recent blog posts

Learn More

Want to learn more about our capabilities? We are here to help.

Transitioning Legacy Compute to JuliaHub: A Practical Guide

About the Author

JuliaHub

Recent blog posts

AI from elite motorsport: Binnies, Williams Grand Prix Technologies and JuliaHub announce groundbreaking partnership to bring scientific machine learning to the UK water sector for the first time

Learn More

Want to learn more about our capabilities? We are here to help.