UTF-8 Serialization and Byte Arrays in C#

Posted in software by Christopher R. Wirz on Wed Mar 08 2017



The preferred approach to serialization in C# is to use XmlSerializer. While extremely convenient, there are a few small draw-backs.

The first draw-back is that XmlSerializer does not use the lowest character set by default. This could cause issues with legacy systems. However, a character set can be specified.


public static partial class ExtensionMethods
{
	/// <summary>
	///     Serializes a generic object
	/// </summary>
	/// <typeparam name="T">The object type</typeparam>
	/// <param name="obj">The object</param>
	/// <returns>A string giving the serialized object</returns>
	public static string Serialize<T>(this T obj)
	{
		string returnString = null;
		try
		{
			var xmlSerializer = new XmlSerializer(typeof(T));

			// Look pretty and use UTF-8
			var settings = new XmlWriterSettings
			{
				Indent = true,
				NewLineOnAttributes = true,
				Encoding = Encoding.UTF8
			};

			using (StringWriter sw = new Utf8StringWriter())
			{
				using (var textWriter = XmlWriter.Create(sw, settings))
				{
					xmlSerializer.Serialize(textWriter, obj);
				}
				sw.Flush();
				returnString = sw.ToString();
			}
		}
		catch { }
		return returnString;
	}

	/// <summary>
	///     Takes a string and returns an object
	/// </summary>
	/// <typeparam name="T">The object type</typeparam>
	/// <param name="obj">The object</param>
	/// <param name="str">The string</param>
	/// <returns>An object of type T</returns>
	public static T Deserialize<T>(this T obj, string str)
	{
		T returnValue = default(T);
		try
		{
			var xmlSerializer = new XmlSerializer(typeof(T));
			using (var xmlReader = new StringReader(str))
			{
				returnValue = (T)xmlSerializer.Deserialize(xmlReader);
			}
		}
		catch { }
		return returnValue;
	}
}

/// <summary>
///     A class only to override encoding with UTF8.
/// </summary>
public class Utf8StringWriter : StringWriter
{
	public override Encoding Encoding => Encoding.UTF8;
}

The above code targets UTF-8 as the encoding for the objects when serialized to XML (eXtensible Markup Language). In order to specify UTF-8 encoding, the Encoding property of the StringWriter had to be overridden.

Now that UTF-8 serialization has been created, it is time to serialize objects. A preferred approach is to take objects from a defined schema. For example, the following class that was generated using XSD tool (and cleaned up a little for this example).


[System.SerializableAttribute()]
public partial class IdType
{
	[System.Xml.Serialization.XmlElementAttribute(DataType = "hexBinary")]
	public byte[] UUID;
	public string Description;
}

The Serialization can be tested in a simple program.


class Program
{
	static int Main(string[] args)
	{
		// Serialize
		IdType id = new IdType
		{
			UUID = Encoding.UTF8.GetBytes("thisisad-emon-stra-tive-uuidfortests"),
			Description = "This is a plain string as an example"
		};
		File.WriteAllText("IdType.xml", id.Serialize());

		// Deserialize
		string serialized = File.ReadAllText("IdType.xml");
		IdType deserialized = id.Deserialize(serialized);

		Console.WriteLine(Encoding.UTF8.GetString(deserialized.UUID));
		// should show "thisisad-emon-stra-tive-uuidfortests"

		Console.ReadKey();
		return 0;
	}
}

Everything looks good in the console, but when the IdType.xml file is opened, the contents look like this:


<?xml version="1.0" encoding="utf-8"?>
<IdType xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <UUID>74686973697361642D656D6F6E2D737472612D746976652D75756964666F727465737473</UUID>
  <Description>This is a plain string as an example</Description>
</IdType>

The line <UUID>74686973697361642D656D6F6E2D737472612D746976652D75756964666F727465737473</UUID> contains the UUID, but it is hex encoded. Though it is human-readable, a human will not know the value. Sometimes this is a problem.

In order to control the serialization of the IdType, we have to make it implement IXmlSerializable. This is why the XSD tool generates partial classes from the xsd schema.

IXmlSerializable can be implemented as follows:


partial class IdType : IXmlSerializable
{

	#region Implementation of IXmlSerializable
	public XmlSchema GetSchema()
	{
		return null; // in most cases return null
	}

	public void ReadXml(XmlReader reader)
	{
		reader.MoveToContent();
		string name = string.Empty;
		while (reader.Read())
		{
			if (!reader.IsStartElement()) continue;
			if (reader.Name == nameof(UUID))
			{
				reader.Read();
				if (!string.IsNullOrEmpty(reader.Value))
				{
					UUID = Encoding.UTF8.GetBytes(reader.Value.Trim());
				}
			}
			if (reader.Name == nameof(Description))
			{
				reader.Read();
				if (!string.IsNullOrEmpty(reader.Value))
				{
					Description = reader.Value.Trim();
				}
			}
		}
	}

	public void WriteXml(XmlWriter writer)
	{
		if (UUID != null)
		{
			writer.WriteElementString(nameof(UUID), Encoding.UTF8.GetString(UUID));
		}
		if (!string.IsNullOrEmpty(Description))
		{
			writer.WriteElementString(nameof(Description), Description);
		}
	}
	#endregion
}

Now the program is compiled again, and executed. It should be able to deserialize "thisisad-emon-stra-tive-uuidfortests" successfully. The IdType.xml file is opened again, and the contents are truly human-readable:


<?xml version="1.0" encoding="utf-8"?>
<IdType>
  <UUID>thisisad-emon-stra-tive-uuidfortests</UUID>
  <Description>This is a plain string as an example</Description>
</IdType>

Now our IdType has successfully been serialized to UTF-8 xml - with human-readable byte array values.