Convert in Another Format With Character Encoding - c#

I am using Oracle 10g database for a C# application. The issue is, in the database, the NVARCHAR column doesn't save other languages except English. As NVARCHAR supports Unicode, this should be working. But instead I've tried a simple method using a tutorial as follows:
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
//Convert the string into a byte[].
byte[] unicodeBytes = ascii.GetBytes("আমার সোনার বাংলা!"); //Text to show
//Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(ascii, unicode, unicodeBytes);
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
Console.WriteLine(asciiString);
Console.ReadKey();
May seems silly but was expecting if I can show the text in the console app with the above format. Now it shows question (??????) marks. Any way I can show the text and save it at least in any other format, so I can retrieve and show it appropriately in the front-end.

If you can use unicode (and you should, hey its 2018), then it would be best to avoid Bijoy altogether. Process and store everything that is a string, as System.String in .NET and as NVARCHAR in Oracle.
The Windows console can handle unicode without any problems, if we observe two important prerequisites that the documentation clearly states:
Support for Unicode [...] requires a font that has the glyphs needed to render that character. To successfully display Unicode characters to the console, the console font must be set to a [...] font such as Consolas or Lucida Console
This is something you must ensure in Windows setting, independently from your .NET application.
The second prerequisite, emphasis mine:
[...] Console class supports UTF-8 encoding [...] Beginning with the .NET Framework 4.5, the Console class also supports UTF-16 encoding [...] To display Unicode characters to the console. you set the OutputEncoding property to either UTF8Encoding or UnicodeEncoding.
What the documentation does not say, is that none of the fonts that can be selected from the properties menu of the console window will normally contain glyphs of all alphabets in the world. If you needed right-to-left capability as for example with Hebrew or Arabic, you're out of luck.
If the program is running a Windows version without the east asian fonts preinstalled, follow this tutorial to install the Bangla LanguageInterfacePack (KB3180030).
Then apply this answer to our problem as follows:
open the windows registry editor
navigate to HKLM\Software\Microsoft\WindowsNT\CurrentVersion\Console\TrueTypeFont
create a new string value, assign an available key like "000", and the value "Bangla Medium"
reboot the PC
Now set the console font to "Bangla", using the window menu of the console, last menu item "Properties", second tab "Font".
Finally get rid of all that encoding back and forth, and simply write:
using System;
using System.Text;
namespace so49851713
{
class Program
{
public static void Main()
{
var mbb = "\u263Aআমার সোনার বাংলা!";
/* prepare console (once per process) */
Console.OutputEncoding = UTF8Encoding.UTF8;
Console.WriteLine(mbb);
Console.ReadLine();
}
}
}

Related

Arabic presentation forms B support in c#

I was trying to convert a file from utf-8 to Arabic-1265 encoding using the Encoding APIs in C#, but I faced a strange problem that some characters are not converted correctly such as "لا" in the following statement "ﻣﺣﻣد ﺻﻼ ح عادل" it appears as "ﻣﺣﻣد ﺻ? ح عادل". Some of my friends told me that this is because these characters are from the Arabic Presentation Forms B. I create the file using notepad++ and save it as utf-8.
here is the code I use
StreamReader sr = new StreamReader(#"C:\utf-8.txt", Encoding.UTF8);
string str = sr.ReadLine();
StreamWriter sw = new StreamWriter(#"C:\windows-1256.txt", false, Encoding.GetEncoding("windows-1256"));
sw.Write(str);
sw.Flush();
sw.Close();
But, I don't know how to convert the file correctly using this presentation forms in C#.
Yes, your string contains lots of ligatures that cannot be represented in the 1256 code page. You'll have to decompose the string before writing it. Like this:
str = str.Normalize(NormalizationForm.FormKD);
st.Write(str);
To give a more general answer:
The Windows-1256 encoding is an obsolete 8-bit character encoding. It has only 256 characters, of which only 60 are Arabic letters.
Unicode has a much wider range of characters. In particular, it contains:
the “normal” Arabic characters, U+0600 to U+06FF. These are supposed to be used for normal Arabic text, including text written in other languages that use the Arabic script, such as Farsi. For example, “لا” is U+0644 (ل) followed by U+0627 (ا).
the “Presentation Form” characters, U+FB50 to U+FDFF (“Presentation Forms-A”) and U+FE70 to U+FEFF (“Presentation Forms-B”). These are not intended to be used for representing Arabic text. They are primarily intended for compatibility, especially with font-file formats that require separate code points for every different ligated form of every character and ligated character combination. The “لا” ligature is represented by a single codepoint (U+FEFB) despite being two characters.
When encoding into Windows-1256, the .NET encoding for Windows-1256 will automatically convert characters from the Presentation Forms block to “normal text” because it has no other choice (except of course to turn it all into question marks). For obvious reasons, it can only do that with characters that actually have an “equivalent”.
When decoding from Windows-1256, the .NET encoding for Windows-1256 will always generate characters from the “normal text” block.
As we’ve discovered, your input file contains characters that are not representable in Windows-1256. Such characters will turn into question marks (?). Furthermore, those Presentation-Form characters which do have a normal-text equivalent, will change their ligation behaviour, because that is what normal Arabic text does.
First of all, the two characters you quoted are not from the Arabic Presentation Forms block. They are \x0644 and \x0627, which are from the standard Arabic block. However, just to be sure I tried the character \xFEFB, which is the “equivalent” (not equivalent, but you know) character for لا from the Presentation Forms block, and it works fine even for that.
Secondly, I will assume you mean the encoding Windows-1256, which is for legacy 8-bit Arabic text.
So I tried the following:
var input = "لا";
var encoding = Encoding.GetEncoding("windows-1256");
var result = encoding.GetBytes(input);
Console.WriteLine(string.Join(", ", result));
The output I get is 225, 199. So let’s try to turn it back:
var bytes = new byte[] { 225, 199 };
var result2 = encoding.GetString(bytes);
Console.WriteLine(result2);
Fair enough, the Console does not display the result correctly — but the Watch window in the debugger tells me that the answer is correct (it says “لا”). I can also copy the output from the Console and it is correct in the clipboard.
Therefore, the Windows-1256 encoding is working just fine and it is not clear what your problem is.
My recommendation:
Write a short piece of code that exhibits the problem.
Post a new question with that piece of code.
In that question, describe exactly what result you get, and what result you expected instead.

Why is the console not printing the characters i am expecting

I'm currently trying to educate my self about the different Encoding types. I tried to make a simple console app to tell me the difference between the types.
byte[] byteArray = new byte[] { 125, 126, 127, 128, 129, 130, 250, 254, 255 };
string s = Encoding.Default.GetString(byteArray);
Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: " + s);
s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: " + s);
s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: " + s);
The output however is nothing like I expected it to be.
Default: }~€‚úûüýþÿ
ASCII: }~?????????
UTF8: }~���������
Hmm... the characters do not copy well from the console output to here either so here's a print screen.
What I do expect is to see the the extended ASCII characters. The default encoding is almost correct but it cannot display 251, 252 and 253 but that might be a shortcoming on the Console.writeLine() though i'd not expect that.
The representation of the variable when debugging is as follows:
Default encoded string = "}~€‚úûüýþÿ"
ASCII encoded string = "}~?????????"
UTF8 encoded string = "}~���������"
Can someone tell me what I'm doing wrong? I expect one of the encoding types to properly display the extended ASCII table but apparently none can...
A bit of context:
I am trying to determine what Encoding would be best a standard in our company, I personally think UTF8 will do but my supervisor would like to see some examples before we decide.
Obviously we know we will need to use other encoding types every now and then (serial communication for example uses 7-bits so we can't use UTF8 there) but in general we would like to stick with one encoding type. Currently we are using default, ASCII and UTF8 at random so that's not a good thing.
EDIT
The output according to:
Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.CodePage);
Edit 2:
Since I thought there might not be an encoding in which the extended ascii characters correspond to the decimal numbers in the table I linked to I turned it around and this:
char specialChar = '√';
int charNumber = (int)specialChar;
gives me the number: 8730 which in the table is 251
The output encoding in your case should be mostly irrelevant since you're not even working with Unicode. Furthermore, you need to change your console window settings from Raster fonts to a TrueType font, like Lucida Console or Consolas. When the console is set to raster fonts, you can only have the OEM encoding (CP850 in your case), which means Unicode doesn't really work at all.
However, all that is moot as well, since your code is ... weird, at best. First, as to what is happening here: You have a byte array, interpret that in various encodings and get a (Unicode) string back. When writing that string to the console, the Unicode characters are converted to their closest equivalent in the codepage of the console (850 here). If there is no equivalent, not even close, then you'll get a question mark ?. This happens most prominently with ASCII and characters above 127, because they simply don't exist in ASCII.
If you want the characters you want to see, then either use correct encodings throughout instead of trying to meddle around until it somewhat works, or just use the right characters to begin with.
Console.WriteLine("√ⁿ²")
should actually work because it runs through the encoding translation processes described above.
Strange, with this code
Console.OutputEncoding = Encoding.Default;
Console.WriteLine("Default: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.ASCII.GetString(byteArray);
Console.OutputEncoding = Encoding.ASCII;
Console.WriteLine("ASCII: {0} for {1}", s, Console.OutputEncoding.HeaderName);
s = Encoding.UTF8.GetString(byteArray);
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("UTF8: {0} for {1}", s, Console.OutputEncoding.HeaderName);
I get this one:
Default: }~€‚úþÿ for Windows-1252
ASCII: }~?????? for us-ascii
UTF8: }~ ������ for utf-8
This is what I would expect. Default Codepage is CP1252, not CP850 which your tables shows.
Try another default font in for your console, e.g. "Consolas" or "Lucidia Console" and check the output.

C# console + Unicode

First of all, Console.OutputEncoding doesn't help. So i go on unicode table site, choose some symbols and try to output something. But only ?? is produced. Where am I wrong?
using System;
using System.Text;
namespace Test
{
class Program
{
static void Main()
{
Console.OutputEncoding = Console.InputEncoding = Encoding.Unicode;
string s = "\u2654\u2657";
Console.WriteLine(s);
Console.ReadKey();
}
}
}
The default font in the console window is not a Unicode font. You have to change it to a Unicode font like Lucida Console or Consolas so that it can display characters outside the regular 8-bit codepage.
Ref: http://msdn.microsoft.com/en-us/library/system.console.outputencoding%28v=vs.110%29.aspx
Under Remarks:
Note that successfully displaying Unicode characters to the console
requires the following:
The console must use a TrueType font, such as Lucida Console or Consolas, to display characters.
A font used by the console must define the particular glyph or glyphs to be displayed. The console can take advantage of font linking
to display glyphs from linked fonts if the base font does not contain
a definition for that glyph.
Tried you code in http://www.compileonline.com/compile_csharp_online.php worked find, as #Guffa said make sure your console encoding is capable to decode mentioned characters.

C# UNICODE to ANSI conversion

I need your help concerning something which disturbs me when working with UNICODE encoding in .NET Framework ...
I have to interface with some customer data systems with are non-UNICODE applications, and those customers have worldwide companies (Chinese, Korean, Russian, ...). So they have to provide me an ASCII 8 bits file, wich will be encoded with their Windows code page.
So, if a Greek customer sends me a text file containing 'Σ' (sigma letter '\u03A3') in a product name, I will get an equivalent letter corresponding to the 211 ANSI code point, represented in my own code page. My computer is a French Windows, which means the code page is Windows-1252, so I will have in place 'Ó' in this text file... Ok.
I know this customer is a Greek one, so I can read his file by forcing the windows-1253 code page in my import parameters.
/// <summary>
/// Convert a string ASCII value using code page encoding to Unicode encoding
/// </summary>
/// <param name="value"></param>
/// <returns></returns>
public static string ToUnicode(string value, int codePage)
{
Encoding windows = Encoding.Default;
Encoding unicode = Encoding.Unicode;
Encoding sp = Encoding.GetEncoding(codePage);
if (sp != null && !String.IsNullOrEmpty(value))
{
// First get bytes in windows encoding
byte[] wbytes = windows.GetBytes(value);
// Check if CodePage to use is different from current Windows one
if (windows.CodePage != sp.CodePage)
{
// Convert to Unicode using SP code page
byte[] ubytes = Encoding.Convert(sp, unicode, wbytes);
return unicode.GetString(ubytes);
}
else
{
// Directly convert to Unicode using windows code page
byte[] ubytes = Encoding.Convert(windows, unicode, wbytes);
return unicode.GetString(ubytes);
}
}
else
{
return value;
}
}
Well in the end I got 'Σ' in my application and I am able to save this into my SQL Server database. Now my application has to perform some complex computations, and then I have to give back this file to the customer with an automatic export...
So my problem is that I have to perform a UNICODE => ANSI conversion?! But this is not as simple as I thought at the beginning...
I don't want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to the customers. They will read the exported text file with their own code page so this idea was interesting for me.
But the problem is that the conversion in this way has a strange behaviour... Here are two different examples:
1st example (я)
char ya = '\u042F';
string strYa = Char.ConvertFromUtf32(ya);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251);
string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa)));
string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));
So strYa1252 contains '?', whereas strYa1251 contains valid char 'я'. So it seems it is impossible te convert to ANSI if valid code page is not indicated to Convert() function ... So nothing in Unicode Encoding class helps user to get equivalences between ANSI and UNICODE code points ? :\
2nd example (Σ)
char sigma = '\u3A3';
string strSigma = Char.ConvertFromUtf32(sigma);
System.Text.Encoding unicode = System.Text.Encoding.Unicode;
System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252);
System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253);
string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma)));
string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));
At this time, I have the correct 'Σ' in the strSigma1253 string, but I also have 'S' for strSigma1252. As indicated at the beginning, I should have 'Ó' if ANSI code has been found, or '?' if the character has not been found, but not 'S'. Why?
Yes of course, a linguist could say that 'S' is equivalent to the greek Sigma character because they sound the same in both alphabets, but they don't have the same ANSI code!
So how can the Convert() function in the .NET framework manage this kind of equivalence?
And does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
I should have ...'?' if the character has not been found, but not 'S'. Why?
This is known as 'best-fit' encoding, and in most cases it's a bad thing. When Windows can't encode a character to the target code page (because Σ does not exist in code page 1252), it makes best efforts to map the character to something a bit like it. This can mean losing the diacritical marks (ë→e), or mapping to a cognate (Σ→S), a character that's related (≤→=), a character that's unrelated but looks a bit similar (∞→8), or whatever other madcap replacement seemed like a good idea at the time but turns out to be culturally or mathematically offensive in practice.
You can see the tables for cp1252, including that Sigma mapping, here.
Apart from being a silent mangling of dubious usefulness, it also has some quite bad security implications. You should be able to stop it happening by setting EncoderFallback to ReplacementFallback or ExceptionFallback.
does someone have an idea to write back ANSI characters from UNICODE in text files I have to send to customers?
You'll have to keep a table of encodings for each customer. Read their input files using that encoding to decode; write their output files using the same encoding.
(For sanity, set new customers to UTF-8 and document that this is the preferred encoding.)

How to make console be able to print any of 65535 UNICODE characters

I am experimenting with unicode characters and taking unicode values from Wikipedia page
Ihe problem is my console displays all of C0 Controls and Basic Latin unicode characters ie from U+0000 to U+00FF but for all other categories like Latin Extended -B , Cyrillic , other languges etc , the console prints question mark character (?) .
My C# code is
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace DataTypes
{
class Program
{
static void Main(string[] args)
{
char ch = '\u0181';
Console.WriteLine("the unicode character is value" + ch);
}
}
}
I am working on windows 7 , Visual studio 2010. What should i do to increase Unicode support.
There's a lot of history behind that question, I'll noodle about it for a while first. Console mode apps can only operate with an 8-bit text encoding. This goes back to a design decision made 42 years ago by Ken Thompson et al when they designed Unix. A core feature of Unix that terminal I/O was done through pipes and you could chain pipes together to feed the output of one program to the input of another. This feature was also implemented in Windows and is supported by .NET as well with the ProcessStartInfo.RedirectStandardXxxx properties.
Nice feature but that became a problem when operating systems started to adopt Unicode. Windows NT was the first one that was fully Unicode at its core. Unicode characters must always be encoded, a common choice back then was UCS, later morphed into utf-16. Now there's a problem with I/O redirection, a program that spits out 16-bit encoded characters is not going to operate well when it is redirected to a program that still uses 8-bit encoded characters.
Credit Ken Thompson as well with finding a solution for this problem, he invented utf-8 encoding.
That works in Windows as well. Easy to do in a console mode app, you have to re-assign the Console.OutputEncoding property:
using System;
using System.Text;
class Program {
static void Main(string[] args) {
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Ĥėļŀō ŵŏŗłđ");
Console.ReadLine();
}
}
You'll now however encounter another problem, the font selected for the console window is likely to be unable to render the text. Press Alt+Space to invoke the system menu, Properties, Font tab. You'll need to pick a non-raster font. Pickings are very slim, on Vista and up you can choose Consolas. Re-run your program and the accented characters should render properly. Unfortunately, forcing the console font programmatically is a problem, you'll need to document this configuration step. In addition, a font like Consolas doesn't have the full set of possible Unicode glyphs. You are likely to see rectangles appear for Unicode codepoints for which it has no glyphs. All an unsubtle reminder that creating a GUI program is really your best bet.

Resources