問題描述
我有一個帶有 SHIFT_JIS 字符集的 ASP 經典頁面.頁面head部分下的meta標簽是這樣的:
I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
我的頁面有一個文本框 (txtName),它只能允許 200 個字符.我有一個驗證字符長度的 Javascript 函數,該函數在我的提交按鈕的 onclick() 事件中調用.
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
問題是,Javascript 無法獲取以 SHIFT_JIS 編碼的正確長度的日文字符.例如,字符測的 SHIFT_JIS 長度為 8 個字符,但 Javascript 僅將其識別為一個字符,可能是因為 Javascript 默認使用的 Unicode 編碼.在 SHIFT_JIS 中,某些字符(例如 ケ)有 2 或 3 個字符.
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 測 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
如果我只依賴 Javascript 提供的長度,長日文字符將通過頁面驗證并嘗試保存在數據庫中,然后由于 DB 列的最大長度為 200 而失敗.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
我使用的瀏覽器是 Internet Explorer.有沒有辦法使用 Javascript 獲取日文字符的 SHIFT_JIS 長度?是否可以使用 Javascript 從 Unicode 轉換為 SHIFT_JIS?如何?
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
感謝您的幫助!
推薦答案
例如,字符測的 SHIFT_JIS 長度為 8 個字符,但 Javascript 僅將其識別為一個字符,可能是因為 Unicode 編碼的原因
For example, the character 測 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
讓我們明確一點:測,U+6D4B(漢字'測量,估計,猜想')是單個字符.當您將其編碼為特定編碼(如 Shift-JIS)時,它很可能會變成多個 字節.
Let's be clear: 測, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
一般而言,JavaScript 不提供編碼表,因此您無法確定一個字符將占用多少字節.如果你真的需要,你必須攜帶足夠的數據來自己解決.例如,如果您假設輸入僅包含在 Shift-JIS 中有效的字符,則此函數將通過保留所有單字節字符的列表來計算需要多少字節,并假設每個其他字符占用兩個字節:
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^x00-x80?????????????????????????????????????????????????????????????? ? ????]/g, 'xx').length;
}
但是,Shift-JIS 中沒有 8 字節序列,而且 Shift-JIS 中根本沒有字符測".(這是一個在日本不使用的漢字.)
However, there are no 8-byte sequences in Shift-JIS, and the character 測 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
你可能會認為它構成一個 8 字節序列的原因是:當瀏覽器無法在表單中提交字符時,因為它不存在于目標字符集中,它會用 HTML 字符引用替換它:在這種情況下 测
.這是一個有損的修改:您無法分辨用戶是按字面輸入的 測
還是 测
.如果您將提交的內容 测
顯示為 測
那么這意味著您忘記對輸出進行 HTML 編碼,這可能意味著您的應用程序很容易受到攻擊跨站點腳本.
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测
. This is a lossy mangling: you can't tell whether the user typed literally 測
or 测
. And if you are displaying the submitted content 测
as 測
then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
唯一明智的答案是使用 UTF-8 而不是 Shift-JIS.UTF-8 可以愉快地對 測 或任何其他字符進行編碼,而無需求助于損壞的 HTML 字符引用.如果您需要在數據庫中存儲受編碼字節長度限制的內容,可以使用一種偷偷摸摸的技巧來獲取字符串中 UTF-8 字節的數量:
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 測, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
雖然在數據庫中存儲原生 Unicode 字符串可能會更好,這樣長度限制指的是實際字符,而不是某些編碼中的字節.
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
這篇關于如何在Javascript中獲取日文字符的長度?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!