以下可能不符合 SO 问题;如果超出范围,请随时告诉我离开.这里的问题基本上是,我是否正确理解 C 标准,这是正确的处理方式吗?"
The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"
我想就我对 C(以及 C++ 和 C++0x)中的字符处理的理解要求澄清、确认和更正.首先,一个重要的观察:
I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:
可移植性和序列化是正交概念.
可移植的东西是像 C、unsigned int
、wchar_t
这样的东西.可序列化的东西是诸如 uint32_t
或 UTF-8 之类的东西.可移植"意味着您可以重新编译相同的源代码并在每个受支持的平台上获得工作结果,但二进制表示可能完全不同(甚至不存在,例如 TCP-over-carrier pig).另一方面,可序列化的事物总是具有 same 表示,例如我可以在 Windows 桌面、手机或牙刷上阅读的 PNG 文件.可移植的东西是内部的、可序列化的东西,用于处理 I/O.可移植的东西是类型安全的,可序列化的东西需要类型双关.</序言>
Portable things are things like C, unsigned int
, wchar_t
. Serializable things are things like uint32_t
or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>
说到 C 中的字符处理,有两组分别与可移植性和序列化相关的内容:
When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:
wchar_t
、setlocale()
、mbsrtowcs()
/wcsrtombs()
:C 标准没有提及编码";事实上,它与任何文本或编码属性完全无关.它只说你的入口点是 main(int, char**)
;你得到一个 wchar_t
类型,它可以保存你系统的所有字符;你得到了读取输入的函数char-sequences 并将它们变成可用的 wstrings,反之亦然.
wchar_t
, setlocale()
, mbsrtowcs()
/wcsrtombs()
: The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**)
; you get a type wchar_t
which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.
iconv()
和 UTF-8,16,32:一个函数/库,用于在定义明确的、明确的、固定的编码之间进行转码.iconv 处理的所有编码都得到普遍理解和认可,只有一个例外.
iconv()
and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.
C 的可移植、编码不可知世界及其wchar_t
可移植字符类型与确定性外部世界之间的桥梁是WCHAR-T 和 UTF 之间的 iconv 转换.
The bridge between the portable, encoding-agnostic world of C with its wchar_t
portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.
那么,我是否应该始终将字符串内部存储在与编码无关的 wstring 中,通过 wcsrtombs()
与 CRT 接口,并使用 iconv()
进行序列化?概念上:
So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs()
, and use iconv()
for serialization? Conceptually:
my program
<-- wcstombs --- /============== --- iconv(UTF8, WCHAR_T) -->
CRT | wchar_t[] | <Disk>
--- mbstowcs --> ==============/ <-- iconv(WCHAR_T, UTF8) ---
|
+-- iconv(WCHAR_T, UCS-4) --+
|
... <--- (adv. Unicode malarkey) ----- libicu ---+
实际上,这意味着我会为我的程序入口点编写两个样板包装器,例如对于 C++:
Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:
// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>
std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc
int wmain(const std::vector<std::wstring> args); // user starts here
#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
setlocale(LC_CTYPE, "");
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
setlocale(LC_CTYPE, "");
return wmain(parse(argc, argv));
}
#endif
// Serialization utilities
#include <iconv.h>
typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;
U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);
/* ... */
这是仅使用纯标准 C/C++ 编写惯用的、可移植的、通用的、与编码无关的程序核心以及使用 iconv 的定义明确的 UTF I/O 接口的正确方法吗?(请注意,Unicode 规范化或变音符号替换等问题不在范围内;只有在您决定真正想要 Unicode(与您可能喜欢的任何其他编码系统相反)之后,才是处理这些问题的时候细节,例如使用像 libicu 这样的专用库.)
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)
更新
在许多非常好的评论之后,我想添加一些观察:
Following many very nice comments I'd like to add a few observations:
如果您的应用程序明确想要处理 Unicode 文本,您应该制作核心的 iconv
-conversion 部分并使用 uint32_t
/char32_t
-字符串在 UCS-4 内部.
If your application explicitly wants to deal with Unicode text, you should make the iconv
-conversion part of the core and use uint32_t
/char32_t
-strings internally with UCS-4.
Windows:虽然使用宽字符串通常没问题,但与控制台(就此而言,任何控制台)的交互似乎是有限的,因为似乎不支持任何合理的多字节控制台编码并且 mbstowcs
本质上是无用的(除了用于微不足道的扩展).从资源管理器中接收宽字符串参数与 GetCommandLineW
+CommandLineToArgvW
一起工作(也许应该有一个单独的 Windows 包装器).
Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs
is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW
+CommandLineToArgvW
works (perhaps there should be a separate wrapper for Windows).
文件系统:文件系统似乎没有任何编码概念,只是将任何以空字符结尾的字符串作为文件名.大多数系统使用字节字符串,但 Windows/NTFS 使用 16 位字符串.在发现存在哪些文件以及处理该数据时(例如,不构成有效 UTF16(例如裸代理)的 char16_t
序列是有效的 NTFS 文件名),您必须小心.标准 C fopen
不能打开所有 NTFS 文件,因为没有可能的转换映射到所有可能的 16 位字符串.可能需要使用特定于 Windows 的 _wfopen
.作为推论,通常没有明确定义的多少个字符"包含给定文件名的概念,因为首先没有字符"的概念.警告清空者.
File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t
sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen
is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen
may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.
这是仅使用纯标准 C/C++ 编写惯用的、可移植的、通用的、与编码无关的程序核心的正确方法吗
Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++
不,并且根本无法满足所有这些属性,至少如果您希望您的程序在 Windows 上运行.在 Windows 上,您几乎在所有地方都必须忽略 C 和 C++ 标准,只使用 wchar_t
(不一定在内部,但在系统的所有接口上).例如,如果您以
No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t
(not necessarily internally, but at all interfaces to the system). For example, if you start with
int main(int argc, char** argv)
您已经失去了对命令行参数的 Unicode 支持.你必须写
you have already lost Unicode support for command line arguments. You have to write
int wmain(int argc, wchar_t** argv)
相反,或者使用 GetCommandLineW
函数,C 标准中没有指定这些函数.
instead, or use the GetCommandLineW
function, none of which is specified in the C standard.
更具体地说,
#ifdef
.wchar_t
在 Windows 上是 UTF-16 代码单元,而 char
在 Linux 上通常(bot 并不总是)是 UTF-8 代码单元.编码意识通常是更理想的目标:确保您始终知道您使用的是哪种编码,或者使用将它们抽象出来的包装库.#ifdef
s.wchar_t
is a UTF-16 code unit on Windows and that char
is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.我想我必须得出结论,除非您愿意使用额外的库和系统特定的扩展,并在其中投入大量精力,否则完全不可能用 C 或 C++ 构建一个可移植的支持 Unicode 的应用程序.不幸的是,大多数应用程序已经无法完成相对简单的任务,例如将希腊字符写入控制台"或以正确的方式支持系统允许的任何文件名",而这些任务只是实现真正支持 Unicode 的第一步.
I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.
这篇关于WChars、编码、标准和可移植性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!