Data Engineering in Julia

Everything you need to start creating Data Pipelines 🪠🧑‍🔧

Published in

TDS Archive

8 min readJan 28, 2022

Julia is a natural choice for Data Science and Machine Learning? But what about Data Engineering? In this post, we go over how you can use Julia for various Data Engineering tasks, from the basics to creating pipelines. As you will see, Julia’s performance and ease of use can be leveraged across the entire ML and DS stack with ease.

Next, we will explore the following topics and learn how Julia can be used for:

Reading 🤓 and Writing ✍️ Files in Julia
Working with Databases 💽
Overview of other tools

If you want to learn about why Julia is well suited for Data Science, check out this article:

Why You Should Invest in Julia Now, as a Data Scientist

Know what Julia has to offer and the resources to get started

betterprogramming.pub

And if you are interested in why Julia is the future of machine learning, check out:

The Future of Machine Learning and why it looks a lot like Julia 🤖

Everything you need to know about the deep learning and machine learning ecosystem in Julia.

towardsdatascience.com

What is Data Engineering? 🧐

Before we dive into the details, it is worth going over what Data Engineering really is. If you already know this stuff and want to dive into Julia, feel free to skip this section. At the highest possible level, Data Engineering is the moving of data from one source to another source. Let’s say for example that you have a website running in production with multiple servers, each storing data for specific product sales. One way to check the overall sales is to run queries on the production databases and aggregate the results to get the total sales. But what if you want to do a more complex analysis?

In this case, a Data Engineer might connect all of the production databases to a different database which could be used specifically for analysis purposes. So far, we have covered the Extract and Load functions of Data Engineering. In many cases, however, we might want the data in a different format than it was originally in. Perhaps you were logging total transactions but now want the line-by-line item cost. This is where a Data Engineer would perform data transformations, the 3rd main function of Data Engineering.

So in Data Engineering terms, what have we actually done in this example? Well, we defined a pipeline! To make the example work, we had a data source (the original production servers) where we did an extraction, then augmented / transformed the data, followed by loading it into the data warehouse (storage) where the data science team can now do further analysis.

Now that you have a sense of what the Data Engineering process might look like, it may be more clear as to why having the Data Science team and the Data Engineering team using the same tool (i.e. Julia) could be to everyone’s advantage.

Edit: My Co-author and I are thrilled to share that pre-orders our new book, Julia Crash Course, are now live:

Julia Crash Course (eBook Pre-order)

Julia Crash Course is a hands-on guide to using the Julia programming language, written with beginners in mind. We take…

logankilpatrick.gumroad.com

Reading 🤓 and Writing ✍️ Files in Julia

One common data storage technique is saving information into comma-separated values (CSV). Let’s explore how to load, modify, and write data in CSV with Julia. To do this, we are going to use the DataFrames.jl and CSV.jl package. We begin by adding the package:

(@v1.7) pkg> add DataFrames # type "]" to enter pkg mode(@v1.7) pkg> add CSV # type "]" to enter pkg mode

If you have never used the Package manager in Julia and need a quick overview, check out this article:

The most underrated feature of the Julia Programming Language 👀, the package manager 📦

Where other languages fall short, Julia’s package manager provides robust features and intuitive syntax.

blog.devgenius.io

Now that we have DataFrames and CSV installed, we will load them in:

julia> using DataFrames, CSV

Next, let’s load in a CSV file into DataFrames.jl:

julia> df = DataFrame(CSV.File("/Users/logankilpatrick/Desktop/QS World University Rankings combined.csv"))

In this example, we will be playing around with the QS World University Rankings 2017–2022 on Kaggle dataset which is in the
Public Domain. We can see the results as follows:

If you wanted to select a specific column, you could do something like:

julia> df.university 6482-element Vector{String}:"Massachusetts Institute of Technology (MIT) ""Stanford University""Harvard University""University of Cambridge"⋮

To get all of the column names, we can use the names or propertynames function:

julia> names(df)9-element Vector{String}:"year""rank_display""university""score""link""country""city""region""logo"

And if we wanted to iterate through all of the rows one by one, we would do:

for row in eachrow(df)    print(row)end

Okay, now that we know how to do some basic operations, let us look next at how we can modify the data frame and then write it back to a CSV file. In DFs.jl, we can access a specific row by doing df[index, columns] so in this case, if we want to access the 2nd row and all of the columns that are part of it, we would do df[2, :] . Let’s now create a new variable:

julia> new_row = df[2, :]julia> new_row.university = "Julia Academy""Julia Academy"julia> new_row.city = "Remote""Remote"julia> new_row.link = "juliaacademy.com""juliaacademy.com"

And then to add the new row back into the data frame we can use the push! function:

julia> push!(df, new_row)

Now we have the updated data frame, we can write it to CSV by doing:

julia> CSV.write("/Users/logankilpatrick/Desktop/QS World University Rankings combined and Edited.csv", df)

You can confirm locally that this created a new CSV file with an updated value at the end. This should give you a nice into working with data in CSV format. There is of course a lot more than we can cover in this single post but I hope that the example gives you confidence that working with CSV in DataFrames.jl is quite easy. If you want to read more, check out the DataFrames.jl docs or watch this workshop tutorial from JuliaCon 2021:

Working with Databases 💽 in Julia

Let’s begin this database section by working with MySQL in Julia. Just like we did with the other packages, to add MySQL.jl, we can simply type ] add MySQL . Then, run using MySQL to load the package.

Note that this section assumes you already have a MySQL database created on your computer and running. We can connect to that DB by doing:

julia> con = DBInterface.connect(MySQL.Connection, "localhost","root", "my_password", db = "teacher_sample")MySQL.Connection(host="localhost", user="root", port="3306", db="teacher_sample")

In the code above, we connected to the database at the destination “localhost”, with a username of “root”, password which is “my_password”, and a Schema name of “teacher_sample”. Again, some of these properties were pre-set when I created the MySQL database itself.

In order to properly see the results that we can return from our SQL commands, we need to also load in the DataFrames.jl package via:

using DataFrames

Then, we can try and execute a sample command such as SELECT * :

julia> DataFrame(DBInterface.execute(con, "SELECT * FROM teacher_sample.teachers;"))

This produces the following data frame:

The DBInterface.execute() function takes two inputs:

The cursor, which is the pointer to the database we want to work with that we initially defined
The SQL command in string format (just like in the Python MySQL equivalent)

From here, you can execute pretty much any SQL commands you want by using the same execute function with different SQL strings. This wraps up the core functionality you will get from a database package.

Other Databases in Julia

In addition to MySQL, there are many other Database bindings written in Julia. You might want to check out the likes of TypeDB.jl:

Or JuliaDB (a pure Julia Database):

And many others! I suggest checking out https://github.com/JuliaData and https://github.com/JuliaDatabases for a comprehensive list of all the databases available in Julia.

Other Data Engineering Tools ⚒️

What you use for Data Engineering really depends on what you are working on. Let’s briefly go over how you can use some of the most common Data Engineering Tools in Julia:

RedShift in Julia

If you want to connect to AWS RedShift in Julia, you can use LibPQ.jl

BigQuery in Julia

BigQuery by Google can be accessed via GCP.jl.

Tableau in Julia

While not fully supported, you can connect with Tableau in Julia via extensions as detailed here:

Apache Spark in Julia

Julia binding for Apache Spark exist in Spark.jl

Apache Kafka in Julia

A wrapper for librdkafka can be found in RDKafka.jl

Dask in Julia

While not entirely the same, Dagger.jl provides parallelism which is very inspired by Dask: https://github.com/JuliaParallel/Dagger.jl

Apache Hive in Julia

Hive is available via Hive.jl

Could Julia be the future of Data Engineering? 🧐

The main goal of this post is to highlight that many if not all of the tools you need to do productive Data Engineering exist today in one form or another. While there might be sharp edges here and there, many of the building blocks are now in place so that folks learning Julia and wanting to use it have what they need to be productive.

The advantage I see of Julia in the Data Engineering space is the blurring of lines between those wanting to do Software Engineering, Data Engineering, and Data Science. Right now, because many of the tools are context-specific, there aren’t many folks operating in this domain. But I see a future where “full-stack data engineers” work across all three disciplines using Julia as the single tool to help them.

While I have done a fair amount of higher-level Data Engineering, there is still a lot I don’t know and might have missed in this post. You might also be interested in reading about ML pipelines with Lathe.jl.

Feel free to reach out to me on Twitter to share your thoughts: https://twitter.com/OfficialLoganK