A technical piece about adventure, frustration and inconsistencies
Context
I was given a task as part of an assignment to write code for an application that takes a file as a parameter
and it reads its content and prints all possible substrings that can be generated from the same content.
For example, let’s say I have a text file named “data.txt” containing the word “intrinsic”. Executing the
following statement in the terminal/console:
./SubstringTree data.txt
with “data.txt” parameter being the text file to import, prints out the following:
┌──i
│
├──in
│
│
├──int
│
│
│
└──intr
│
│
│
└──intri
│
│
│
└──intrin
│
│
│
└──intrins
│
│
│
└──intrinsi
│
│
│
└──intrinsic
│
│
└──ins
│
│
└──insi
│
│
└──insic
│
└──ic
├──n
│
├──nt
│
│
└──ntr
│
│
└──ntri
│
│
└──ntrin
│
│
└──ntrins
│
│
└──ntrinsi
│
│
└──ntrinsic
│
└──ns
│
└──nsi
│
└──nsic
├──t
│
└──tr
│
└──tri
│
└──trin
│
└──trins
│
└──trinsi
│
└──trinsic
├──r
│
└──ri
│
└──rin
│
└──rins
│
└──rinsi
│
└──rinsic
├──s
│
└──si
│
└──sic
└──c
I wrote the code for this task using C++, and it compiles both on Windows and Linux. I wanted to
investigate Unicode support in C++, and, to my surprise, this is where I found a multitude of discrepancies
between MSVC (Visual Studio) and G++, and frankly, MSVC was the one who shined in this category. G++
has, to my severe disappointment, a terribly lackluster and crippled implementation of Unicode. Its gaping
flaws in this respect were a major obstacle and a fruitful source of unwarranted headaches.
Warning: The following is an entertaining read.
Mainly, G++ does not support Unicode main() arguments, does not support file names with Unicode
characters, and has a crippling issue that causes wcout (which is a version of cout that *should* support
Unicode characters) to randomly decide to not print parts of the specified output as if it is a sentient
being, a terribly annoying one at that, despite adding setlocale(LC_ LL, "C.UTF-8") as the first statement
to configure the character set to UTF-8. A half-assed workaround I had to implement was to call:
cout << ‘\n’;
before anything is printed out, in order to set the console stream to ASCII to successfully print some
Unicode-specific characters (huh?!), which also causes some characters to fail to print, so it’s a double-
edge sword. I actually had to reach deep down and answer the question:
“Do I want my application to show 30% of the output, or do I want to support a small subset of UTF-
8 in the console and show invalid output if the input data contains Unicode characters?”
My answer, hesitatingly, was the latter. Hence I added the cout << ‘\n’ statement right after the call to
setlocale().
What followed was a beautiful blend of sheer confusion and frustration. Calling wcout << L”bla” (note the
L prefix to denote a Unicode string called wstring) no longer worked, and wcout << “bla” does not compile
(can’t cast string to wstring, wcout only accepts parameters of types wstring and wchar_t (and char too
because it can be trivially converted to wchar_t)). So how can I print any output? I managed to devise the
most counterintuitive workaround that actually worked in *most* cases. I wrote a function called
towString() that casts a string to wstring by populating a new wstring from the string character by
character.
Here is the workaround:
Instead of wcout << L”bla”;
I wrote wcout << towString(“bla”);
AND IT WORKED (somehow, for some reason). Then I tried to print the tree visualisation requested by the
task. The tree displays the Unicode character ‘├’ to join the branches of the nodes. So I tried the following:
wcout << L’├’;
//wchar_t
wcout << L”├”; //wstring
wcout << ’├’;
//Unicode character in char
Neither showed anything on screen... then I tried something which I would never have believed to work:
wcout << towString(”├”);
Let’s dissect this statement bit by bit. We have an ASCII string with a Unicode character in it (?!), and we
are converting it to wstring and printing the result to the terminal buffer, which is configured as an ASCII
buffer due to the previous print to cout. Not only does this statement somehow compile on G++, but it
is the only alternative that worked. And why does it work? I have no damn clue.
I’ll mention another adventure I had.
I compiled my code both in Visual Studio and in G++ from bash, and for some reason the Windows version
worked perfectly, but the Linux version could not find the file specified. I was completely baffled by this,
so I tried placing some “cout”s here and there to verify that the command-line argument was being parsed
correctly. The debugging output was actually correct every time as expected, so why was the file not
found? What was going on?
Approximately 6 hours had passed and I was banging my head against the wall. This made no sense, why
was it not finding the file if the parameters in argc[] were shown to be parsed correctly? I almost gave up,
and I decided to print all the character codes of the parsed arguments one by one. Everything looked fine,
except I noticed a suspicious 13 at the end. I googled for the ASCII table and discovered that 13 is the
‘carriage return’ (‘\r’) character.
Somehow, the parameter had a carriage return character at the end any was messing up the file path.
Never have I sworn so profanely in my entire life. When I was printing the parameters before, I had not
taken *invisible* characters into account, so I was completely misled. Surely enough, after filtering any
‘\r’s from the string stored in argc[], the Linux version was now properly finding the file and reading its
content.
F u n !