Let’s say I have an array of bytes and I want to save it on the disk encrypted using a secret password. Let’s also assume that I’ve looked at existing solutions such as age and for some reason decided it’s not going to work in my particular context. What do I do?
Whenever you aren’t sure about something, it might be worth writing about it on the Internet. That’s the best way to learn.
Mind you: this entry features cryptography. All code and recommendations in this article come with no warranty or claims of fitness for any purpose whatsoever. You shouldn’t copy code samples from the Internet. You probably should hire a security expert or retain a cryptographer. They’ll know better than I do.
The article does not claim to be exhaustive. It should not, however, be misleading or factually wrong. If you see anything that you consider incorrect, don’t hesitate to get in touch.
I’ll illustrate the article using code samples in Clojure. We won’t be using any third-party libraries. All we need is built into the Java platform.
Our goal is to implement two functions: encrypt
and decrypt
. They both take three arguments:
- an
InputStream
with the input data, - an
OutputStream
for the result, and - a
String
with an arbitrary password.
This way we are flexible when it comes to the source of and the sink of data we process. If we read from a byte array, we can use a ByteArrayInputStream
. If we write to a file, we instantiate a FileOutputStream
.
Our requirements can be expressed using an automated property-based test.
In plain English: given an arbitrary byte array and an arbitrary password—neither of them empty—we can encrypt the input, decrypt it again, and end up with an identical byte array.
The details of how this test works aren’t essential to the rest of the entry. That’s why I’m not going to explain it in detail here. If you’d like to know more about what exactly is going on here, don’t hesitate to shoot me an email.
We’ll use the built-in javax.crypto
package for our implementation. Encryption and decryption is a responsibility of the Cipher
class. To obtain an instance we need to decide on some specific aspects of our encryption, namely: the block cipher algorithm, the mode, and the padding.
We will use the AES block cipher. We can employ various key lengths. We’ll use 256 bits, but to the best of my knowledge using 128 bits should make the resulting encryption strong enough as of 2023.
We’ll use the AES block cipher implementation from the JVM standard library. We’re not going to implement it on our own; that would both risky and unnecessary.
We can use the AES block cipher in various modes of operation. Some modes that the JVM implementation comes with are: ECB, CBC, or GCM. Not all of them are a good choice. Let’s take a closer look at each of those modes before making a decision.
AES used in the ECB mode is a bad choice. If the data you encrypted are longer than a single block, same data will get encrypted to the same cipher text. A picture is worth a thousand words. MySQL’s built-in AES functions have been using the cipher in ECB mode.
Another mode we can use is CBC. In the vast majority of cases it’s a better choice than ECB. CBC encrypts blocks one after another, the cipher text of one affecting the encryption of another. This addresses the ECB weakness. One shortcoming is that blocks need to be encrypted in order, one after another. It’s not the best usage of modern hardware’s CPUs that are capable of processing multiple blocks simultaneously.
GCM is a mode that allows us to encrypt several blocks in parallel. That’s not its only advantage. It also allows us to verify the authenticity of the cipher text. That is, decrypting cipher text that has been tampered with will lead to an error. Let’s use GCM.
The last piece of information we need to instantiate our encryption mechanism is how to pad data that don’t fit into the block size. For example, if we choose 128 bit (16 bytes) blocks and we have 36 bytes of data to encrypt, we’ve got only four bytes to encrypt in the third block.
There are various types of padding. The choice depends on the mode that we use. Unlike CBC, GCM does not require any padding at all.
Now we can reify our choice into a concrete implementation. We use javax.crypto
stringly-typed API to ask for a Cipher
instance described as "AES_256/GCM/NoPadding"
.
The cipher needs a key to perform encryption. The password we pass to our functions is not a valid encryption key. We need to turn our password into a byte array of a particular length. The JVM has built-in key derivation mechanisms that can turn a password into a byte array of a length we require.
The key derivation function we use is PBKDF2. In order to use it we need to provide the password, an array of random bytes called salt, and the desired key length. We also have to decide on the number of iterations of the algorithm. We choose to run the function 600,000 times, as recommended by OWASP. The salt and the high number of iterations will make it more costly in terms of CPU time to find out what our password was in the event of the generated keys being compromised.
Salt should be an array of bytes indistinguishable from random. We can create such array using a strong source of randomness, such as SecureRandom
.
To offer us confidentiality guarantees, GCM needs an array of bytes that will be used as an initialisation vector (IV) for encryption. We also need to specify the number of bits GCM will use for a field that verifies the integrity of our data. That field is referred to as a tag.
Now we can combine the steps above and define a function that encrypts contents of the input stream into the output stream.
Notice that the function writes the salt and the initialisation vector unencrypted into the output stream. We need them both to decrypt the data. We do not need to keep them secret. The only secret components are the password and the derived keys. It is important, though, that IVs and salts are not reused. For that reason we generate unpredictable arrays every time we encrypt.
The decryption code performs the same operation in reverse. It firstly reads the salt and the initialisation vector. Then it configures the cipher and decrypts the input stream.
If the data have been tampered with we expect GCM to catch that manipulation and throw an exception.
This sums up how to encrypt and decrypt a file on the JVM without any extra libraries. Let me know in case of any questions.