JuliaDB is our take on a package for writing succinct, expressive and fast data processing pipelines. It includes tools to load data from CSV files, index them, iterate over row or column subsets, perform relational queries and save intermediate results.
In this post, we will work with a particular dataset published by LendingClub. Our goal will be to train a model to learn to classify loans as bad and good ones, depending on their likelihood of being repaid in a timely manner. It also serves as a case study of how to use JuliaDB in a non-trivial application. This post owes its selection of dataset and methodology to a blog by Microsoft's R Server Tiger team.
This example will use datasets provided by LendingClub, the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.
The first step then is to download some files from this page. For the purpose of this post, we have used the files 2007 through 2016 (8 files in total).
Place the files in a folder named data
.
The files contain lines with copyright info and comments, these can be removed using this neat sed command, courtesy of Doug Bates.
cd data/
sed -i -n -e '/^\"/p' *.csv
The code below is to be executed from the same directory that contains the data directory.
To load the data in parallel, you might use many processes.
addprocs(4)
will add 4 worker processes.
Then a call to loadfiles
will load the files into a single table:
julia> @time data = loadfiles("data", escapechar='"',usecache=false);
Metadata for 0 / 8 files can be loaded from cache.
Reading 8 csv files totalling 995.147 MiB...
57.671541 seconds (475.41 k allocations: 28.874 MiB, 0.03% gc time)
JuliaDB can figure out the correct types of the columns and loaded a table that is completely type-stable way. You can look at the element type of the table by running eltype(data)
. I will forgo the eye-sore involved in the thorough book-keepings here by not showing the output here.
To assess which features to use in our prediction model, we can see density plots of numeric quantities in the dataset for good loans versus bad ones. Let's start by partitioning the data into good loans and bad loans:
const bad_statuses = ("Late (16-30 days)","Late (31-120 days)","Default","Charged Off")
good_loans = filter(x->!(x.loan_status in bad_statuses), data)
bad_loans = filter(x->x.loan_status in bad_statuses, data);
One can use eltype
of the table to introspect the field types and extract only numeric fields.
fields = fieldnames(eltype(data))
ftypes = eltype(data).parameters
numeric_cols = [fields[i] for i = 1:length(fields) if (ftypes[i] <: Number ||
ftypes[i] <: Nullable{<:Number}) && !(fields[i] in [:id, :member_id, :dti_joint])];
Here, we filtered the field names based on the type of each field. This filtered list of names is in numeric_cols
86-element Array{Symbol,1}:
:loan_amnt
:funded_amnt
:funded_amnt_inv
⋮
:total_bal_ex_mort
:total_bc_limit
:total_il_high_credit_limit
using Gadfly
import NullableArrays: dropnull
# Density plot for bad and good loans
plots = Gadfly.Plot[]
for (name, g, b) in zip(numeric_cols, columns(good_numbers), columns(bad_numbers))
g = dropnull(g)
b = dropnull(b)
p = plot(layer(x=g, Geom.density, Theme(default_color=colorant"green")),
layer(x=b, Geom.density, Theme(default_color=colorant"red")),
Guide.title(string(name)), Guide.ylabel("density"))
push!(plots, p)
end
Next we render and layout the plots in a 9x10 grid:
grid = reshape(vcat(render.(plots), fill(Compose.context(), 90-length(plots))), 9,10);
draw(SVG(24inch, 24inch), gridstack(grid))
Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.
To validate our model, we need to segregate the data into a training dataset and a testing dataset, we will use 75% and 25% of the data, randomly chosen, as these respective subsets:
perm = randperm(length(data))
train_till = round(Int, length(data) * 3/4)
training_subidx = sort!(perm[1:train_till])
testing_subidx = sort!(perm[train_till+1:end]);
training_subset = data[training_subidx]
testing_subset = data[testing_subidx]
In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:
features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]
Further, we need to create a vector of labels which labels each loan as good or bad:
labels = collect(map(x->x in bad_statuses, data))
import NullableArrays: dropnull
findnonnulls(xs::Columns) = find(x->!any(map(isnull, x)), xs)
function input_matrix(table, fields)
float_features = collect(values(table, fields))
tmp = findnonnulls(float_features) # indices of the rows where all fields are non-null
nzidxs = collect(keys(table,1)[tmp]) # corresponding indices in the table
reduce(hcat, map(dropnull, columns(float_features[tmp]))), values(labels[nzidxs]) # matrix, label vector
end
The random forest model needs us to create two arrays — one, a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the labels
label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.
Having done this, we can now call the build_forest
function from the DecisionTree.jl package.
training_matrix, train_labels = input_matrix(training_subset, numeric_features)
model = build_forest(train_labels, training_matrix, 3, 10, 0.8, 6)
# 0.512169 seconds (7.67 k allocations: 440.766 KiB)
# Out[98]:
# Ensemble of Decision Trees
# Trees: 10
# Avg Leaves: 41.8
# Avg Depth: 8.0
Should we want to save our model to reuse at a later time, we can store it to our disk.
f = open(" loanmodel.jls", "w")
serialize(f, model)
close(f)
We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.
features_test, test_labels = input_matrix(testing_subset, numeric_features)
@time predictions = mapslices(features_test, 2) do fs
DecisionTree.apply_forest(model, fs)
end;
# => 0.234474 seconds (36.23 k allocations: 1.964 MiB)
With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.
# Receiver Operating Characteristics curve
using ROC
curve = roc(convert(Vector{Float64}, predictions[:]),convert(BitArray{1}, test_labels))
# An ROC plot in Gadfly with data calculuated using ROC.jl
plot(layer(x=curve.FPR, y=curve.TPR, Geom.line),
layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
Geom.line, Theme(default_color=colorant"red")), Guide.title("ROC"),
Guide.xlabel("False Positive Rate"),Guide.ylabel("True Positive Rate"))
The ROC would look like this.
The area under the curve is:
# Area Under Curve
AUC(curve)
# => 0.7291976225854385
We have shown how to use JuliaDB to create a model predicting the quality of a loan, combining data access and sophisticated analysis in a single environment.