X#, Strings and Codepages - Part 2

In the previous article I have described how the xBase world evolved from DOS/Dbase and DOS/Clipper to Windows/Visual Objects and how that affected the character sets. OEM character sets for DOS/Clipper and Ansi Character sets for Windows/VO.

In this article I would like to discuss the relevance of all of this in relation to X#.

Introduction

To start: X# is a .Net development language as you all know and .Net is a Unicode environment. Each character in a Unicode environment is represented by one (or sometimes more) 16 bit numbers. The Unicode character set allows to encode all known characters, so your program can display any combination of West European, East European, Asian, Arabic etc. characters. At least when you choose the right font. Some fonts only support a subset of the Unicode characters, for example only Latin characters and other fonts only Chinese, Korean or Japanese.
The Unicode character set also has characters that represent the line draw characters that we know from the DOS world, and also various symbols and emoticons have a place in the Unicode character set.

You would probably expect that 64K characters should be enough to represent all known languages and that is indeed true. However because also emoticons and other (scientific symbols) have found their way into Unicode, even 16 bits is not enough. To make things even more complicated we now have many different skin color variations of emoticons. Being politically correct sometimes creates technical challenges !

The good thing is that the  .Net runtimes knows how to handle Unicode and the X# runtime is created on top of the .Net runtime so it is fully Unicode for free!

The challenge comes when we want to read or write data created in applications that are or were not Unicode, such as Ansi or OEM DBF files, Ansi or OEM text files. Another challenge is code that talks to the Ansi Windows API, such as the VO GUI Classes, and the VO SQL Classes.

DBFs and Unicode

When you open a DBF inside X#, then the RDD system will inspect the header of the file to detect with which codepage the file was created. This codepage is stored in a single byte in the header (using yet another list of numbers). We detect this codepage and map it to the right windows codepage and use an instance of the Encoding class to convert the bytes in the DBF to the appropriate Unicode characters. When we write to a dbf then the RDD system will convert the Unicode characters to the appropriate bytes and write these bytes to the DBF. This works transparently and may only cause problems when you try to write characters that do not exist in the codepage for which the DBF was created. In these situations the conversion either writes an accented character as the same character without accent, or, when no proper translation can be done then a question mark byte will be written to the DBF. The nice thing about the way things are handled now is that you can read and write to dbf files created for different ansi codepages at the same time. However that may cause a problem with regard to indexes.

For the sorting and indexing of DBF files the runtime will use the same mechanism as VO. You can use the SetCollation() function and specify SetCollation(#Windows) to tell the RDD system to compare using the windows comparison routines or SetCollation(#Clipper) to compare using the weight tables built into the Runtime. Important to realize here is that these comparison routines will first convert the Unicode strings to Ansi (for SetCollation(#Windows) ) or OEM (for SetCollation(#Clipper)) and will then apply he same sorting rules that VO uses. So the Windows collation will also use the Applocale (that you can set with SetAppLocale()) so you can distinguish between for example German sorting and German Phonebook sorting. The Clipper collation will compare the strings using the weight table selected with SetNatDLL().

VO SDK and VO SQL and Unicode (and x86 & Anycpu)

The VO SDK and SQL Classes are written against the ANSI Windows API calls in Kernel32, User32, ODBC32 etc. This means 2 things:

  • These class libraries are x86 only. They are full of code where pointers and 32 bit integers are mixed and can therefore not run in x64 mode
  • These class libraries use String2Psz() and Cast2Psz() in VO and Psz2String() and where these functions do very little in VO (they simply allocate static memory and copy the bytes from one location to the other location.
    Inside X# these functions do the same, but they also have to convert the Unicode strings to Ansi ( String2Psz and Cast2Psz) and back (Psz2String). For these conversion the runtime uses the Windows codepage that was read from the current workstation.

Important to realize is that the Ansi codepage from a DBF files can and sometimes will be different from the codepage from Windows. So it is possible to read a Greek DBF file and translate the characters to the correct Unicode strings. But if you want to display the contents of this file with the traditional Standard MDI application on a Western  European windows, then you will still see 'garbage' because many of these Greek characters are not available in the Western European codepage.

 

We recognize that the Ansi UI layer and Ansi SQL layer could be a problem for many people and we have already worked on GUI classes and SQL classes that are fully Unicode aware. As soon as the X# Runtime is stable enough we will add these classes libraries as a bonus option to the FOX subscribers version of X#. These class libraries have the same class names, property names and method names as the VO GUI and SQL Classes. Existing code (even painted windows) should work with little or no changes. We have also emulated the VO specific event model where events are handled at the form level with events such as EditChange, ListBoxSelect etc.
The GUI classes in this special library are created on top of Windows Forms and the SQL classes are created on top of Ado.NET. So these classes are not only fully Unicode but they are also AnyCPU.
Both of these class libraries will be "open": For the GUI library will use a factory model where there is a factory that creates the underlying controls, panels and forms used by the GUI library. The base factory creates standard System.Windows.Forms objects. However you can also inherit from this factory and use 3rd party controls, such as the controls from Infragistics or DevExpress. All you have to do is to tell the GUI library that you want to use another factory class and of course implement the changes in the factory. And you can mix things if you want. For example replace the pushbuttons and textboxes and leave the rest standard System.Windows.Forms.
You may wonder why we have chosen to build this on Windows.Forms and not WPF even when Microsoft has told us that Windows.Forms is "dead".  The biggest reason is that it is much easier to create a compatible UI layer with Windows Forms. And fortunately Microsoft recently has announced that they will come with Windows.Forms for .Net Core 3. The first demos show that this works very well and is much faster. So the future for Windows.Forms looks promising again !

 In a similar way we have created a Factory for the SQL Classes. We will deliver a version with 3 different SQL factories, based on the three standard Ado.Net dataproviders: ODBC, OLEDB and SqlClient.
You should be able to subclass the factory and use any other Ado.Net dataprovider, such as a provider for Oracle or MySql. Maybe we will find a sponsor and will include these factories as well. The SQL factories will also allow you to inspect and modify readers, commands, connections etc. before and after they are opened.

 

Sorting and comparing strings in your code,  the compiler option /vo13 and SetExact()

Creating indexes and sorting files is not the only place in your code where string sorting algorithms are used. They are also used in your code when you compare two strings to determine which us greater (string1 <= string2) , when you compare 2 strings for equality ( string1 = string2, string1 == string2) and in code such as ASort(). One factor that makes this extra complicated is that quite often the compiler cannot recognize that you are comparing strings because the strings are "hidden" inside a usual. The compiler has no idea for code like usual1 <= usual2 what the values can be at runtime. You could be comparing 2 dates, 2 integers, 2 strings, 2 symbols, 2 floats or even a mixture (like an integer with a float). In these cases the compiler will simply produce code that compares 2 usuals (by calling an operator method on the usual type) and "hopes" that you are not comparing apples and pears. When you do, you will receive a runtime error.

But lets start simple and assume you are comparing 2 strings. To achieve the same results in X# as in your Ansi VO App we have added a compiler option /vo13 which tells the compiler to use "compatible string comparisons". When you use this compiler option then the compiler will call a function in the X# runtime (called __StringCompare) to do the actual string comparison. This function will use the same comparison algorithm that is used for DBF files. So it follows the SetCollation() setting and will convert the strings to Ansi or Unicode and use the same routines that VO does. This should produce exactly the same results as your VO app. Of course it will not know how to compare characters that are not available in the Ansi codepage. So if you have 2 strings with for example difference Chinese characters and use compatible comparisons to compare them, then it is very well possible that they are seen as "the same" because the differences are lost when converting the Unicode characters to Ansi. They might both be translated to a string of one or more questionmarks.

If you compile your code without /vo13 then the compiler will not use the X# runtime function to compare strings but will insert a call to System.String.Compare() to compare the strings. Important to realize is that this function does NOT use the SetExact() logic. So where with SetExact(FALSE) the comparison "Aaaa" = "A" would return TRUE with compatible string comparisons, it will return FALSE when the noncompatible comparisons are used.
On top of that when you use the < or > operator to compare strings then you have to know how the .Net sorting for characters is, and that sorting algorithm differs per "culture".
If you look at the .Net documentation for StringCompare() you will see that there are various overloads, and that for some of these you can specify the kind of StringComparison with an enum with values like CurrentCulture, Ordinal and also combinations that ignore the case of the string. Simply said: there is no easy way to compare strings. That is one of the reasons that a language like C# does not allow you to compare 2 strings with a < or > operator. There is too much that can go wrong.

If you compare a String with a USUAL or 2 USUALS then the compiler cannot know how to compare the strings . In that case the code produced by the compiler will convert the string to a usual and call the comparison routine for USUALs.
Unfortunately, when we wrote the runtime we did not know what your compiler option of choice would be, so the runtime does not know what to do. To solve this problem we use the following solution:
In the startup code of your application the compiler adds a (compiler generated) line of code that saves the setting of the /vo13 option in the so called "runtimestate". The code inside the runtime that compares 2 usuals will check this setting and will either call the "plain" String.Compare() or the compatible __StringCompare().
Important to realize is that when your app consists of more than one assembly it becomes important to use the setting of /vo13 consistently over all of your assemblies and main app. Mixing this option can easily lead to problems that are difficult to find.
It can become even more complicated if you are using 3rd party components especially if you don't have the source. That is why we recommend that 3rd party developers always compile with /vo13 ON. You may think that that would produce problems if the rest of the components is compiled with /vo13 off. But rest assured, this is a scenario that we have foreseen: inside the __StringCompare() code that is called when /vo13 is enabled there is a check for the same vo13 option in the runtimestate. So if the function detects that the main app is running with /vo13 off, then the function will call the same String.Compare() code that the compiler would have used otherwise

SetCollation

We have already mentioned SetCollation() before. In the X# runtime we have added 2 new possible values for SetCollation(): #Unicode and #Ordinal.
These 2 new collations can be used in code that wants to use the "old" Clipper behavior with regard to Exact and Non-Exact string comparisons (in other words, wants to only compare upto the length of the string on the right hand side of the comparison) but still want to use Unicode or Ordinal string comparisons.
To use this, compile your code with /vo13 ON and set the collation at startup. If you choose SetCollation(#Unicode) then String.Compare() will be used to compare the strings. If you use SetCollation(#Ordinal) then String.CompareOrdinal() will be used. Both of these will only be used when the string on the right side has the same length or is longer than the left side. If the string on the right side is the shortest then an ordinal compare will be done on the number of characters in the string.
The Ordinal comparison is the fastest of these options, and is NOT culture dependent. The #Unicode comparison is culture dependent and faster then the #Windows and #Clipper collations, mostly because these last 2 have to convert from Unicode to 8bit values.

Handling Binary data and Unicode

Many developers have stored binary data (such as passwords) in DBF files or have read or written binary files in their VO applications using functions such as MemoRead() and MemoWrit(). That will NOT work reliably in X# (or any other Unicode application).

Let's start with binary data (like passwords) in DBF files. Encryption algorithms in VO would usually produce a list of fairly random 8 bit values, with all possible combinations of bits. The Crypt() function in the runtime does that. To produce the same result as the VO encode it converts the Unicode string to 8 bits, runs the encoding algorithm and converts the result back to Unicode. When you then write that string to the DBF it will have to become 8 bits again. Especially if the DBF is OEM this will produce a problem. To work around this we will add special version of the Crypt function to the runtime that accept or return an array of bytes and we will also add functions to the runtime that allow you to write these bytes directly to the DBF without any conversions. This should help you to avoid conversions that may corrupt your encrypted passwords. Of course you can also use other techniques, such as base encoding the passwords, so the result will always be a string where the values are never interpreted as special characters.

Something similar may happen if you use memoread/memowrit.

Memoread in the X# runtime will use a Framework function that is "smart" enough to detect if the source file is a Ansi text file or Unicode text file (by looking at the possible Byte Order mark at the beginning of the text file). It can also detect if a text file uses UTF encoding. The result is that MemoRead can read all kinds of text files and retrieve its contents in a .Net Unicode string.

If you use a function like MemoRead() to read the contents of for example a Word or Excel file the runtime will have no idea how to translate this from bytes into Unicode. It will produce a Unicode string, but it is most likely not a correct representation of the bytes in the file. When you use MemoWrit() to write the string back to a file then we have to do a revers conversion. Our experience is that this quite often corrupts the contents.  MemoRead() uses System.IO.File.ReadAllText() and MemoWrit() uses System.IO.File.WriteAllText() btw.

If you really want to deal with binary data then we recommend that you have a look at the .Net methods System.IO.File.ReadAllBytes() and System.IO.File.WriteAllBytes(). These functions return or accept (as the name implies) an array of bytes. No conversion will be done.

Summary

Adding Unicode support to your application makes your application much more powerfull. You are able to mix characters from different cultures.
If you want to keep on using "old" Ansi and/or OEM files, you will have to make sure that you configure your application correctly.

Unfortunately there is no universal best solution. You will have to evaluate your application and decide yourself what is good for you. Of course we are available on our forums to help you make that decision.


3 comments

  • Also if the database/GUI may remain in ANSI for a longer while, UNICODE has a big advantage for the always growing need of imports/exports/connections to other systems:
    Converting a file for example from ANSI or UTF8 format to the internal UNICODE representation is much more universal than converting from one limited codepage to another.

    Arne