What Everyone Should Know About Character Encoding

Thank goodness Joel wrote this article — that means that I can cross it off of my list
of potential future blog entries. Thanks Joel!

Fortunately the script engines are entirely Unicode inside.  Making sure that the script source code passed in to the engine is valid UTF-16 is the responsibility of the host, and as Joel mentions, IE certainly jumps through some hoops to try and deduce the encoding.  WSH also has heuristics which try to determine whether the file is UTF-8 or UTF-16, but nothing nearly so complex as IE.

I should mention that in JScript you can use the \u0000 syntax to put unicode codepoints into literal strings.  In VBScript it is a little trickier — you need to use the CHRW method.


Commentary from 2019:

I did not mention in this article that the original implementations of VBScript and JScript were for IE running on 16 bit Windows, and 16 bit Windows did not support Unicode. We had non-Unicode versions of the scripting toolchain for quite some time. I wrote a lot of string library code for dealing with DBCS and other odd character encoding problems when I was first at Microsoft as a full-time employee.

There were a couple of good reader questions:

What do we do with code points above FFFF ?

VBScript and JScript use UTF-16 as their string encoding, so you can represent higher codepoints with surrogate pairs.

Is that also true of JScript.NET?

Yes, JS.NET, and all the .NET languages, use UTF-16 internally also.

1 thought on “What Everyone Should Know About Character Encoding

  1. Pingback: Porting old posts, part 4 | Fabulous adventures in coding

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s