Tuesday, September 4, 2018

3 ways to convert String to byte array in Java - Example

I am going to discuss one of the common tasks for programmers, converting a String to a byte array. You need to do that for multiple reasons e.g. for saving content to a file, sending over a network or maybe some other reason. Suppose you have a String "abcd" and you want to convert it into a byte array, how will you do that in a Java program? Remember, String is made of the char array, so it involves character to byte conversion, which is subject to character encoding intricacies. Thankfully, Java provides a convenient getBytes() method to convert String to byte array in Java, but unfortunately, many developers don't use it correctly. Almost 70% of the code I have reviewed uses getBytes() without character encoding, leaving it on the chance that platform's default character encoding will be same as of the source String.

Oracle Java String, Oracle Java Tutorial and Material, Java Certification

The right way to use getBytes() should always be with explicit character encoding, as shown in this article. Java even comes with some standard set of character encoding which is supported out-of-box by StandardCharset class, we will review them as well.

It's also a good practice is to use the pre-defined contestants for specifying character encoding in your code instead of using a free text or String to avoid typos and other silly mistakes.

String to byte array using getBytes()


This is the most common way to convert a String into a byte array, it works most of the time but it's error-prone and can produce an erroneous result if platform's character encoding doesn't match with expected encoding.

Here is an example of converting String to byte[] in Java :

// converts String to bytes using platform's default character encoding,
// in Eclipse it's Cp1252
// in Linux it could be something else
byte[] ascii = "abcdefgh".getBytes();

System.out.println("platform's default character encoding : "
                     + System.getProperty("file.encoding"));
System.out.println("length of byte array in default encoding : "
                     + ascii.length);
System.out.println("contents of byte array in default encoding: "
                     + Arrays.toString(ascii));

Output :
platform's default character encoding : Cp1252
length of byte array in default encoding : 8
contents of byte array in default encoding: [97, 98, 99, 100,
                                               101, 102, 103, 104]

Remark :

1) Platform's default encoding is used for converting a character to bytes if you don't specify any character encoding.

2) You can see platform's default character encoding by using System.getProperty("file.encoding");, this return the default character encoding of the machine your JVM is running.

3) Beware, your code may work in one environment e.g. QA but not work in production because of different default character encoding. That's why you should not rely on default character encoding.

4) length of byte array may not be same as the length of String, it depends upon character encoding. Some character encoding is multi-byte but usually, take 1 byte to encode ASCII characters.

String to byte array using getBytes("encoding)


Here is another way to convert a String to a byte array but this time by specifying the proper encoding to leave any guess or platform default aside.

// convert String to bytes of specified character encoding but
// also throw checked UnsupportedEncodingException, which pollutes the code
try {
byte[] utf16 = "abcdefgh".getBytes("UTF-16");
System.out.println("length of byte array in UTF-16 charater encoding : "
 + utf16.length);
System.out.println("contents of byte array in UTF-16 encoding: "
 + Arrays.toString(utf16));

} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

Output :
length of byte array in UTF-16 charater encoding : 18
contents of byte array in UTF-16 encoding: [-2, -1, 0, 97,
0, 98, 0, 99, 0, 100, 0, 101, 0, 102, 0, 103, 0, 104]

Remark :

1) It's better than the previous approach but throws a checked exception java.io.UnsupportedEncodingException, if character encoding String has a typo or specifies and character encoding not supported by Java.

2) The returned byte array is on specified character encoding

3) You can see that length of the byte array is not same as a number of characters in String as was the case in the previous example because UTF-16 encoding takes at-least 2 bytes to encode a character.

String to byte array using getBytes(Charset)

This is third but probably the best way to convert to String to byte[] in Java. In this example, I have used java.nio.StandardCharsets to specify character encoding. This class contains some of the widely used character encoding constants e.g. UTF-8, UTF-16 etc.

A good thing about this approach is that it doesn't throw checked java.io.UnsupportedEncodingException, but unfortunately this class is only available from JDK 7 onward so it might not be an option for several Java application running on Java 6 and lower version.

// return bytes in UTF-8 character encoding
// pros - no need to handle UnsupportedEncodingException
// pros - bytes in specified encoding scheme
byte[] utf8 = "abcdefgh".getBytes(StandardCharsets.UTF_8);
System.out.println("length of byte array in UTF-8 : " + utf8.length);
System.out.println("contents of byte array in UTF-8: " + Arrays.toString(utf8));

Output:

length of byte array in UTF-8 : 8
contents of byte array in UTF-8: [97, 98, 99, 100, 101, 102, 103, 104]

Remarks :

1) This is the best way to convert String to a byte array in Java.

2) This doesn't throw java.io.UnsupportedEncodingException exception, which means no boilerplate code for handling this checked exception.

3) Though, you must keep in in mind that StandarhardCasets class is only available from Java 7 onward.

That's all about how to convert a String to byte array in Java. Remember the size of byte array can be more than the length of String because it's not necessary that one byte is used to encode one character, it all depends on character encoding. For example, UTF-8 is a multi-byte character encoding scheme and uses between 1 to 4 bytes per character. In general, characters of the old ASCII range takes 1 bytes but characters from the old ISO-8859 range beyond ASCII takes 2 bytes.