drmacro
Registered user
Global user
Registered: 01-2004
Posts: 7
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Help with Japanese HTML
[I tried to mail this directly to eviloverlord at sanriotown.com but it came back with "554 Error: no valid recipients" so I'm posting here.]
Dude,
Your site rocks, by the way--it's about the most entertaining thing I've come across in months. I'm not normally a blog reader but my wife showed me your Japanese porn dictionary, which is pretty cool, and then I went to your blog and was hooked.
Anyway, my current job is doing internationalized publishing using HTML and XML and stuff, so I know a little about this.
To render Japanese you have some challenges, but it shouldn't be that hard. To get non-Western languages to work, you need to have all the following things lined up:
1. Your Web pages have to accurately declare their encoding/character set
2. You have to use the right characters
3. Browser clients have to correctly detect and support your page's character set
4. Browser clients have to have the appropriate fonts installed.
For point 1, you have essentially two choices: a Japanese-specific character set or Unicode/UTF-8. I would recommend Unicode since it's more general and probably will have better support in most modern browsers. There are decent Unicode fonts for Windows and Linux. Certainly support for Unicode and non-Western characters is pretty good in IE5/6 and Netscape/Mozilla. I don't know what HTML editor you're using, but it should either be the default setting or should be something you can set without too much effort.
Unicode is an all-encompassing character set that aims to provide characters for all modern human languages. UTF-8 is one way of encoding those characters as sequences of bytes (UTF-16 being the other main Unicode encoding). Most people use Unicode/UTF-8 interchangably but technically they are different. Most simply: UTF-8 implies Unicode but Unicode does not imply UTF-8.
For point 2, once you have your HTML set up correctly, it should just work.
For point 3, the key is to make sure your HTML has an encoding specification, like this:
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
Without this declaration browsers must guess at the character set and they are likely to guess wrong or, even worse, users may have set their browser to a specific encoding.
Your other option for character set would be something like shift-jis, which is a Japanese-specific character set and encoding. However, it is less likely that non-Asian browsers would recognize it or support it properly. So use Unicode/UTF-8.
4. For point 4, fonts, while both IE and Netscape/Mozilla support Unicode/UTF-8 completely, unless a user has installed the appropriate fonts, they will still see either garbage or square boxes (or blanks) where the Japanese characters should be. This usually involves going to the Regional Settings in Windows and selecting whatever languages you need (e.g., Japanese). For Macs and Linux I don't know, but I know it can be done. Given the right fonts, IE and Netscape should just work under Windows. Worst case, you may need to specify fonts in the HTML, but you really shouldn't--the browser should know what to do. The danger with specifying fonts is of course that if a user doesn't have the font installed then you're right back where you were. The only reason to use a specific font would be to get a particular typographic effect and in that case you should probably make the font available for download (something I know less about).
Here are some debugging clues that you can provide to your readers:
1. If you see ASCII garbage instead of Japanese, it means the browser's encoding setting isn't correct. For IE, this means going to View->Encoding and setting the encoding to either Auto select (if you've provided the charset markup above) or selecting UTF-8 (assuming that's what you're storing your pages as).
2. If you see square boxes or nothing instead of Japanese, it most likely means you don't have the appropriate Japanese fonts installed.
Let me know if that helps or if you still can't get it going.
Cheers,
Eliot
|
|
1/13/2004, 3:47 pm
|
Send Email to drmacro
Send PM to drmacro
|
SquidLover
Registered user
Global user
Registered: 01-2004
Posts: 8
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
I also had many problems mailing to sanriotown.com (from yahoo.com). I didn't get a mail bounce when I mailed eviloverlord@helloykitty.com, but then I wasn't sure the mail actually got through
|
|
1/14/2004, 12:41 am
|
Send Email to SquidLover
Send PM to SquidLover
|
damagedmoderator
Head Administrator
Global user
Registered: 09-2003
Posts: 93
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
hi!
as for the hellokitty email, i'm sorry that it isn't working for you guys.
it isn't working for me either. some days i am totally unable to log in at all.
i'm basically going to say, '**** it'.
you can reach me at
somewhatcurious@hotmail.com
from now on.
and, drmarco, thanks so much for some really substantial tech advice!!
i can't wait to try out your advice, but first let me say . .
. . . right now i'm running windows XP (home version)
and dreamweaver 2.0 is my HTML editor.
but
i have no idea what i am using to input japanese text. i guess i'm using whatever utility comes automatically with XP.
so, my question right now is:
how do i get Unicode?
can i download it from somewhere?
or, if i'm using XP, am i already using Unicode?
i know that's kind of a dumb question: how could i not know what program i'm using??
but my friend set the Japanese language utility up for me, and he's canadian.
also, all the japanese webpages i've seen
(which display correctly) contain this HTML in their head:
<meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Shift_JIS">
naturally, i changed my page's header to also read Shift_JIS. it made no difference!
so, i also want to ask, is Shift_JIS different than unicode?
wait, don't answer that.
what i mean is, if i get the Unicode thingy, and i use that to enter my Japanese text. . .
should i change my header so it reads
<meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Unicode/f-8">
instead of
<meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Shift_JIS">
?
also thanks for not making fun of me for using HTML editors in the first place. this is irritating and snobby, particularly when the person clowning me for 'not knowing how to code' is giving useless advice.
|
|
1/14/2004, 11:53 am
|
Send Email to damagedmoderator
Send PM to damagedmoderator
|
drmacro
Registered user
Global user
Registered: 01-2004
Posts: 7
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML (part 1 of 2)
[Sorry for the long delay in responding--for some reason I expected responses to be sent to me by email so I didn't check back to this site until now.]
> i have no idea what i am using to input japanese text.
> i guess i'm using whatever utility comes automatically with XP.
Windows lets you set up locale-specific input methods (IMEs), which might be what you're using, or maybe you're using someothing else--I'm a little out of my depth when it comes to Japanese-localized computers. I know that the Chinese IME I'm using produces Unicode characters, but I'm not sure why--probably a function of how your operating system is configured. But in any case...
> so, my question right now is:
> how do i get Unicode?
> can i download it from somewhere?
> or, if i'm using XP, am i already using Unicode?
> i know that's kind of a dumb question: how could i not know what program i'm using??
Not a dumb question at all--I see that I didn't really provide enough context or background on what Unicode is.
Unicode is character set, which is nothing more than a set of conventions for how programs interpret as characters the bytes in a file. For example, if a program is reading a file as text and sees the byte 0x41 (that is, the sequence of bits whose numeric value is 41 hex) it will interpret that byte as a character--the question is, which one? There's no single standard for how bytes are interpreted as characters--there are many that have been used over the years: ASCII, EBCDIC ("eb-si-****), and any number of language- and country-specific character sets. And of course the file might not be text at all--it might be some binary format. But if you tell your editor to read it as text it will do the best it can. This is why when you open something like GIF file in a text editor it looks like garbage--the editor blindly interpreting the bytes as characters. It's also why you may sometimes see understandable text in such a file--some of the bytes are in fact text and so will be interpreted correctly by the editor.
This means that the program reading the file needs to know what character set the data uses. In the case of 0x41, if the characer set is ASCII, the resulting character is "A" (latin capital letter A). If the character set is EBCDIC it's not a character because the EBCDIC character set doesn't define a meaning for 0x41. "A" in EBCDIC is at 0xC1. The mapping from bytes to characters is completely arbitrary. In addition, there's no well-defined way for a file to declare what character set it uses (this is because in the past a given computer only used one character set so it was never an issue--PCs used ASCII, IBM mainframes used EBCDIC. Before the Web or anything like it there was little opportunity for data to move from one computer to another, much less from one platform to another, much less from a computer localized to one language to a computer localized to a different language.).
Modern computer networks changed that, of course, and the Web really changed it, leading to the problem you now face: how to create documents in a national language that has limited scope of common use such that anyone in the world can reliably view them?
Unicode is an attempt to address this issue by defining a single character set that accomodates all modern human writing systems (and by implication, all modern human languages with a written form).
However, languages like Japanese needed a solution long before Unicode was developed, so we have character sets like shift-JIS, which is an ASCII-based character set that provides a way to encode all the non-Latin characters used in Japanese.
The technical challenge with character sets for Asian languages is that the original character sets like ASCII and EBCDIC, which were developed in the 60's, were only designed to accomodate Latin-based languages (because computers were developed in the West and early computers had a hard enough time handling upper and lower case, much less 40,000 ideographic characters and computer programmers are, at least historically, culturally insensitive bastards who just didn't care. It probably didn't help that China, the single largest potential market for computers, was solidly Communist and more or less at war with the West during the formative days of the computer age and Japan and Korea were still recovering from their respective wars. If IBM could have sold business computers to China in 1960 they might have addressed the issue of ideographic character sets much earlier. But I digress).
ASCII and EBCDIC are both single-byte character sets, meaning that they use exactly one byte for each character, which means they can represent at most 255 characters. More than enough for latin alphabets, punctuation, and control codes, but obviously not going to work for Japanese, Chinese, Korean, and similar languages.
The solution back in the day was to extend ASCII by using some sort of magic byte sequence to indicate either a multi-byte character or a shift from single-byte data to multi-byte data (thus shift-JIS). However, each of these character sets is still language specific: there are several for Japanese, several for Chinese, etc. It's a mess.
Unicode cleans up the mess (or at least attempts to) by defining a multi-byte character set that is big enough (64K bytes in the core character set) to accomodate everybody. The downside is that software has to be able to interpret and manage Unicode-based text, which was a challenge a few years ago, but no longer is, as all modern operating systems and programming languages, including Windows since Windows 95, Linux, Java, Python, VB, and so on, are natively Unicode-based, meaning that all text handling is done using Unicode, at least by default.
But for Japanese computers, I suspect that the use of shift-JIS and other Japanese-specific character sets is so entrenched that most computers in Japan, and therefore most Japanese-specific software, use shift-JIS as their primary or default character set. Doh!
To sum up so far:
- A character set specification that defines an arbitrary mapping from numbers ("codes") to characters ("the concept of the latin capital letter A")
- There are many different character sets in use today, many of which are locale-specific
- The character set of a text file is a function of how the file was created and saved.
- There's no 100% reliable way to know for any random text file what it's character encoding is, although there are ways to guess. HTML and XML both define ways, within the file, to declare the encoding, but this still requires you to guess at the encoding until you stumble on one that lets you recognize the encoding declaration as an encoding declaration.
- For Japanese there are three main character sets you'll have to work with: ASCII, shift-JIS, and Unicode.
- For everyone *except* Japanese people with Japanese-localized computers, Unicode is the encoding most likely to work with things like Web browsers and text editors. For Japanese people shift-JIS is the most likely to work (because it's what they've been using for decades).
Finally, there is the subject of fonts:
A computer font file is nothing more than a table that maps numeric codes to pictures, which we normally think of as characters but which are technically "glyphs".
Note that a font *does not* map characters to glyphs, it maps numbers to glyphs. But since a given font file is a one-to-one mapping of codes to glyphs and a character set is a one-to-one mapping of codes to characters, we can get a mapping of characters to glyphs by coordinating the codes in the font with the codes in the character set you want the font to support.
For example, in an a normal ASCII font like Arial, the glyph at code 0x41 is a picture that we will recognize as a visual representation of the character "A". A given character may have any number of glyph representations, as provided by different fonts.
This means that fonts are specific to character sets. Thus you have "ASCII fonts", "Unicode fonts", and so on, meaning fonts whose glyphs correspond to the appropriate characters in the character set. Fonts may also be language-specific, both for ASCII fonts and for Unicode fonts.
For ASCII-based character sets you need multiple fonts in order to render all the characters in a non-Latin language, by taking advantage of the fact that a font maps codes to glyphs, not characters to glyphs. For example (and I'm just making the details up here), an Arabic-specific ASCII encoding might use code 0xD4 for the Arabic character Alef. In the Arabic ASCII font, the glyph at code 0xD4 would be an Alef, not whatever it is in the base ASCII character set (O with a hat, I think).
For Unicode characters, a full Unicode font, that is a font with a glyph for every one of the 64K characters in the base character set, would be huge. It's easier to have language-specific fonts that only have glyphs for the characters a given language needs. Thus you have fonts like MS Mincho and MS Gothic, which are Japanese-specific Unicode fonts. Japanese-specific because they contain glyphs for hirigana, katakana, and kanji, where the kanji characters reflect Japanese convensions, not Chinese convensions for drawing the characters.
In both cases, locale-specific ASCII and Unicode, you have to know which font to use for text in a given national language, but with Unicode the problem is much easier because you just have to pick the right font--you don't have to also know which of nine or ten font files have to be used for a given language in order to produce the right glyphs for each character.
[Next: solving the actual problem at hand]
|
|
3/20/2004, 6:26 pm
|
Send Email to drmacro
Send PM to drmacro
|
drmacro
Registered user
Global user
Registered: 01-2004
Posts: 7
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML (part 2 of 2)
This means that it's much easier for software like Web browsers and editors to support the display of Unicode characters than the display of old-style locale-specific character sets because it's just a matter of choosing the right font, which can be as simple as having a language-to-font mapping built-in, or lacking that, looking through the fonts at hand until you find one with a glyph for the character--because each character can only have one meaning in Unicode, you know that if you find a glyph for that character code in a Unicode font it's probably the right glyph.
Thus my admonition to use Unicode for your Web pages--it's about the only way you can hope that most viewers will have a prayer of seeing the content.
> also, all the japanese webpages i've seen
> (which display correctly) contain this HTML in their head:
> <meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Shift_JIS">
This is almost certainly because native Japanese computers use shift-JIS by default, including yours, and your browser is probably set up correctly to handle shift-JIS, including knowing how to map shift-JIS characters to fonts.
> naturally, i changed my page's header to also read Shift_JIS. it made no difference!
Probably because the data is not in shift-JIS or because your browser was not set to display the content as Shift-JIS.
In IE, if you go to the View menu, you'll see the "Encoding" menu. This lets you set the encoding that the browser uses to interpret the Web page you're viewing. On my computer it's set to Western-European (Windows), which means the Windows variant of ASCII used for European languages. But I have other options, including Unicode and Shift-JIS. Sometimes the browser fails to detect the encoding correctly or the Web page lies about its encoding.
So one thing to do is to try different encodings until you find one that works--whichever one works is almost certainly the encoding the file is in.
Another useful tool is Unipad, www.unipad.org, which is a Unicode text editor. It has full support for converting to and from different encodings and has a nice Unicode character set browser.
> so, i also want to ask, is Shift_JIS different than unicode?
> wait, don't answer that.
Too late.
> what i mean is, if i get the Unicode thingy, and i use that to enter my Japanese text. . .
> should i change my header so it reads
> <meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Unicode/f-8">
> instead of
> <meta HTTP-EQUIV="Content-type" CONTENT="text/html; charset=Shift_JIS">
Yes. And now you know why the answer is "yes". Whether you wanted to or not (and I'm guessing not).
>also thanks for not making fun of me for using HTML editors in the first place. this is irritating and snobby, particularly when the
> person clowning me for 'not knowing how to code' is giving useless advice
I actually grew up typing tags by hand long before HTML was even invented and I depend on graphical editors for doing day-to-day work. While I might have technical quibbles with the quality of the markup generated by some HTML authoring tools, I can't fault people for using them. Just because you know the code doesn't mean you should create your HTML entirely in edlin. I mean really.
Cheers,
Eliot
|
|
3/20/2004, 6:27 pm
|
Send Email to drmacro
Send PM to drmacro
|
drmacro
Registered user
Global user
Registered: 01-2004
Posts: 7
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
Here's another data point.
Reading the latest page on the TDR I noticed that there was a character that wasn't rendering correctly--in my browser I was seeing a box (the "missing glyph" box) and "@".
So I checked the encoding setting in IE and it was set to Western European. When I set it to shift-JIS then I saw the character (left single quotation mark). I then checked the HTML source and sure enough it sets the charset to "shift-jis" in the meta element, but for whatever reason my browser didn't respect that.
So that could be a big part of the problem: browsers not correctly detecting the declared encoding for the pages.
I did an experiment by saving the page as shift-jis on my computer and then seeing what IE did and it correctly detected the encoding and forced the browser to use shift-JIS, which indicates that the base HTML is correctly declaring the encoding. So it appears that maybe something else the page is using is interfering with the browser-side encoding selection.
Also, the fact that the page renders correctly on my machine is not indicatative of how it will display for the average North American Windows user as I install the regional support for all languages as a matter of practice, so it's likely that I have, for example, the necessary Shift-JIS fonts installed that most people probably wouldn't.
Cheers,
E.
|
|
3/20/2004, 8:25 pm
|
Send Email to drmacro
Send PM to drmacro
|
damagedmoderator
Head Administrator
Global user
Registered: 09-2003
Posts: 93
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
you raise a most excellent point!
the browsers DON'T parse my pages' 'shift-jis' encoding command.
this is ALSO the reason why i can't write in japanese on my page. i've been trying to write in japanese since december.
but get this: when my japanese friend uploaded the same exact test page to HIS (JAPAENSE) server, the browsers DID parse the 'shift-jis' command.
the server is in germany.
my friend who runs the server insists he's pasted the right commands into the server's brain (commands telling it to use shift-jis encoding). but still, no luck.
so, could it be that there is something on my dreamweaver code messing things up?
and do you know what the difference is between a jaapnese server and a german server is? like, why does one mess me up and the other, not?
|
|
3/23/2004, 2:43 pm
|
Send Email to damagedmoderator
Send PM to damagedmoderator
|
drmacro
Registered user
Global user
Registered: 01-2004
Posts: 7
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
I suspect it's something dynamic that the page calls in like a script or something that is hosing stuff up. I think this because when I saved a page as "html only" and then opened that saved version, the shift-jis encoding was correctly handled, and I hadn't changed anything in the markup, nor could I see any obvious problem with it.
When I'm faced with this sort of problem I usually start taking stuff away until the failure stops--the last thing you changed is usually the culprit (but not always--sometimes it's some freaky interaction between stuff that makes no sense).
The fact that you're using DreamWeaver makes it harder simply because you have less control over the details of the markup and who knows what twisted things it might do in order to carry out your wishes.
And there's always the possibility that it's some really stupid IE bug.
I just went your site with Mozilla 1.5 and it correctly set the encoding to shift-jis, so it's starting to feel like an IE bug and/or interaction with something else your page is doing. (Did I mention that I hate Microsoft with the fiery passion of a thousand burning suns?)
Cheers,
E.
|
|
3/26/2004, 2:24 am
|
Send Email to drmacro
Send PM to drmacro
|
ccwf
Registered user
Global user
Registered: 02-2004
Location: Malibu
Posts: 1
Karma: 0 (+0/-0)

|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
quote: drmacro wrote:
I then checked the HTML source and sure enough it sets the charset to "shift-jis" in the meta element, but for whatever reason my browser didn't respect that.
Browsers are not required to respect the meta element. The proper way to set the encoding is in the Content-Type HTTP header, which browsers are required to obey.
|
|
11/20/2004, 8:59 am
|
Send Email to ccwf
Send PM to ccwf
AIM
MSN
Yahoo
|
Jaguarstrike Resurrection
Registered user
Global user
Registered: 12-2004
Posts: 1
Karma: 0 (+0/-0)
|
|
Reply | Quote
|
|
Re: Help with Japanese HTML
too bad i speak no jap.
Last edited by Jaguarstrike Resurrection, 12/20/2004, 6:27 pm
|
|
12/20/2004, 6:26 pm
|
Send Email to Jaguarstrike Resurrection
Send PM to Jaguarstrike Resurrection
|
Add a reply
Powered by AkBBS 0.9.5b - Link to us
- Blogs
- Hall of Honour
- Chat
Click here to get your own free message board
|
You are not logged in (login)
Board's time is: 11/25/2009, 12:28 am
|
|
|