[教學] Is codepage 65001 and utf-8 the same thing?

本帖最後由 Laputa 於 2023-8-2 11:30 編輯

Source : https://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing
Blog 版: https://laputa.eu.org/archives/85
Telegraph : https://telegra.ph/Is-codepage-65001-and-utf-8-the-same-thing-08-02

UTF-8 is CP65001 in Windows (which is just a way of specifying UTF-8 in the legacy codepage stuff). As far as I read ASP can handle UTF-8 when specified that way.

Historically texts had a code page which simply specified which character set to use. Those had some number which differed from vendor to vendor, Windows seems to use a 16-bit unsigned integer for that purpose. Nowadays most encodings and character sets have names instead of numbers. I consider the fact that UTF-8 has a code page number (that is nowhere specified nor used outside Microsoft) a thing to ensure that it's still working with the old 16-bit integer code page number system. Even though UTF-8 is nothing like a code page in the first place. –
Joey

Oct 27, 2009 at 9:17
@Johannes: The codepage number is still an important feature of how Windows handles character encoding. For example in .NET the Encoding class can only be instanced using the codepage number. I don't think Codepage is yet "legacy". –
AnthonyWJones

Oct 27, 2009 at 13:28
2
It's only there for correct interoperability with previous and existing systems. Nowadays I guess such mechanisms would use names instead of arbitrary numbers simply because the encoding landscape has changed a bit since ye olde days of 1980. –
Joey
Oct 27, 2009 at 13:46
Code pages are still used in Windows DOS screens. For example, to change the code page used by a DOS screen to UTF-8: chcp 65001 –
Sabuncu
May 5, 2012 at 21:47
3
Sabuncu, (a) DOS is a misnomer for the Windows console, don't use it. (b) Switch the console window to a TrueType font and you'll get Unicode support without all the craziness. Whatever you set with chcp then doesn't affect the output of text. Besides, this question wasn't at all about the Windows console but rather about ASP. –
Joey
May 5, 2012 at 22:39
1
@AnthonyWJones msdn.microsoft.com/ru-ru/library/windows/desktop/dd317756.aspx - see 2st comment, made by Microsoft employee. While i agree that codepages would live as long as Windows lives, still they were named legacy. Like 8.3 names, 260-letter paths and so on. –
Arioch 'The
Nov 26, 2012 at 17:00
@AnthonyWJones: System.Text.Encoding.GetEncoding(string) accepts a name like ISO-8859-1 or UTF-32BE. –
Joey
Feb 22, 2013 at 10:45
1
CP 65001 support is buggy in cmd.exe and MS VC Runtime but as just an encoding to read files with bare winapi, it seems to be okay. –
ivan_pozdeev
Jan 30, 2018 at 20:29

Your code is correct although I prefer to set the CharSet in code rather than use the meta tag:-

<% Response.CharSet = "UTF-8" %>
The codepage 65001 does refer to the UTF-8 character set. You would need be make sure that your asp page (and any includes) are saved as UTF-8 if they contain any characters outside of the standard ASCII character set.

By specifying the CODEPAGE attribute in the <%@ block you are indicating that anything written using Response.Write should be encoded to the Codepage specified, in this case 65001 (utf-8). Its worth bearing in mind that this does not affect any static content which is sent verbatim byte for byte to the response. Hence the reason why the file needs be actually saved using the codepage that is specified.

The CharSet property of the response sets the CharSet value of the Content-Type header. This has no impact on how the content my be encoded it merely tells the client what encoding is being received. Again it is important that his value match the actual encoding sent.

1
The primary meaning and effect of <%@LANGUAGE="VBSCRIPT" CODEPAGE="65001"%> is for the source file encoding to be UTF-8 (or whatever the codepage specified). It only cascades through to the Response.CharSet property. You may save your file as UTF-8 and put the matching CODEPAGE declaration in and then still use another encoding for Response.CharSet. Like source in 65001 and output in 1251 or 1252. - You propably know that, I just didn't think it was completely clear from your text, which begins by implying that they might be simple alternatives. –
Lumi
Apr 14, 2012 at 8:30
2
@Lumi: I find no such implication, I quote "The CharSet property of the response sets the CharSet value of the Content-Type header. This has no impact on how the content may be encoded". Seems fairly clear to me. BTW the only actual effect of the CODEPAGE directive is to set the Response.CodePage, its the responsiblity of the developer to ensure the file is saved using the matching codepage. –
AnthonyWJones
Apr 14, 2012 at 14:50
1
you're right. I confused Response.CharSet and Response.CodePage. Setting the CODEPAGE directive cascade to the latter, not to the former; it has no bearing at all on the Content-Type header. I believe the CODEPAGE directive is best understood as "source file encoding". Here's an example of where it matters. The critical expression is domXml.createElement("Französisch"). The file was encoded in UTF-8 (had to be Unicode for all of Greek, Russian, etc to work) and so codepage=65001 was critical. –
Lumi
Apr 14, 2012 at 16:04

Yes, 65001 is the Windows code page identifier for UTF-8, as documented on the Microsoft website. Wikipedia suggests that IBM code page 128 and SAP code page 4110 are also indicators for UTF-8.

seem to give bad result when the physical file is saved as utf-8

Otherwise, it work as it is supposed to.

返回列表

[教學] Is codepage 65001 and utf-8 the same thing?

[收藏此主題] [關注此主題的新回復]

[通過 QQ、MSN 分享給朋友]