Unicode UTF-8 encoding

Go to the test page with all Macintosh displayable Unicode characters
Go to the page with explanations and tips of 'HTML entities'
Go to the table with all HTML entities
Go to a page with tables of Macintosh & Windows standard encodings and Symbol encodings in ASCII.
Go to/Back to the index.

Unicode has been developed to describe all possible characters of all languages plus a lot of symbols with one unique number for each character/symbol. Unicode as defined by the Unicode organization has become a universal standard: ISO/IEC 10646, describing the 'Universal Multiple-Octet Coded Character Set' (UCS).
It is not always possible to transfer a Unicode character to another computer reliably. For that reason a special encoding scheme has been developed, UTF-8, which stands for UCS Transformation Format 8.
On this page you will find an overview of the UTF-8 encoding scheme.

This page is encoded as x-mac-roman. It is possible that your browser will not support this and will fall to iso-8859-1 or your default charset. In that case the literal characters of the table will be displayed incorrectly. That's no problem if you are only interested in the conversion algorithms for UTF-8 on this page.

Explanation of the table
chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
ª170AA21228482226.132.162E284A2â„¢TRADE MARK SIGNtrademark

The meaning of the columns is a follows:
ch: the specified character as literal, which is probably not displayed correctly
dec: the decimal ASCII value of the character
hx: the hexadecimal value of the character
U-hex: the Unicode value in hexadecimal
U-dec: the Unicode value in decimal
UTF-dec: the UTF8-encoded bytes as decimal numbers
UTF-hx: the UTF8-encoded bytes as hexadecimal numbers
lit: the UTF8-encoded characters as literals, which are probably not displayed correctly in the absolute sense, but are displayed 'as seen' by your browser
Unicode name: the full Unicode name of the character
PostScript name: the PostScript name of the character if this name exists

Let us take for example the trademark sign, which looks something like a higher positioned TM.
On a Macintosh you can produce this sign by taking character position number 170. On a Windows computer this is position 153. Unicode is the same for all users and in this scheme the trademark sign can be found at position 2122 hexadecimal, which is the same as 8842 decimal.
On a webpage you could try to encode this character like ™ but the Macintosh browsers are not able to reproduce many of those 'entities'. If your reader has a version 4 browser or better, the best thing you can do is encode the trademark sign with a Unicode entity like ™. In a suitable META-tag your page has to be defined as a UTF-8 page. This is explained in detail on the page with entity tips.
If you write an email with for instance Microsoft Outlook Express and let the emailer encode your letter as UTF-8, then Outlook Express converts the trademark sign to a UTF-8 code. The result is in this case a combination of three characters with numerical values 226, 132 and 162. What characters you will see on your screen without UTF-8 decoding, depends on your platform. A Macintosh user would see a comma like quotation mark, followed by a capital N with a tilde and a cent sign. The Windows viewer will see a letter a with a circumflex, followed by a double lower comma like quotation mark and a cent sign.
How did the encoding program get these numbers?

UTF-8 encoding
The proper way to convert between UCS-4 and UTF-8 is to use bitmask (and, or) and bitshift operations. But if you would like to convert only a couple of characters by hand or if your program development environment (scripting language) does not support bit operations, then integer division and multiplication can be used as follows.

From Unicode UCS-4 to UTF-8:
Start with the Unicode number expressed as a decimal number and call this ud.

If ud <128 (7F hex) then UTF-8 is 1 byte long, the value of ud.

If ud >=128 and <=2047 (7FF hex) then UTF-8 is 2 bytes long.
   byte 1 = 192 + (ud div 64)
   byte 2 = 128 + (ud mod 64)

If ud >=2048 and <=65535 (FFFF hex) then UTF-8 is 3 bytes long.
   byte 1 = 224 + (ud div 4096)
   byte 2 = 128 + ((ud div 64) mod 64)
   byte 3 = 128 + (ud mod 64)

If ud >=65536 and <=2097151 (1FFFFF hex) then UTF-8 is 4 bytes long.
   byte 1 = 240 + (ud div 262144)
   byte 2 = 128 + ((ud div 4096) mod 64)
   byte 3 = 128 + ((ud div 64) mod 64)
   byte 4 = 128 + (ud mod 64)

If ud >=2097152 and <=67108863 (3FFFFFF hex) then UTF-8 is 5 bytes long.
   byte 1 = 248 + (ud div 16777216)
   byte 2 = 128 + ((ud div 262144) mod 64)
   byte 3 = 128 + ((ud div 4096) mod 64)
   byte 4 = 128 + ((ud div 64) mod 64)
   byte 5 = 128 + (ud mod 64)

If ud >=67108864 and <=2147483647 (7FFFFFFF hex) then UTF-8 is 6 bytes long.
   byte 1 = 252 + (ud div 1073741824)
   byte 2 = 128 + ((ud div 16777216) mod 64)
   byte 3 = 128 + ((ud div 262144) mod 64)
   byte 4 = 128 + ((ud div 4096) mod 64)
   byte 5 = 128 + ((ud div 64) mod 64)
   byte 6 = 128 + (ud mod 64)

The operation div means integer division and mod means the rest after integer division.
For positive numbers a div b = int(a/b) and a mod b = (a/b-int(a/b))*b.
UTF-8 sequences of 4 bytes and longer are at the moment not supported by the regular browsers.
The highest character position currently (Unicode 3.2) defined is number 10FFFF hex (1114111 dec) in a 'private use' area. The highest character with an actual glyph is number E007F hex (917631 dec), the CANCEL TAG character.

From UTF-8 to Unicode UCS-4:
Let's take a UTF-8 byte sequence. The first byte in a new sequence will tell us how long the sequence is. Let's call the subsequent decimal bytes z y x w v u.

If z is between and including 0 - 127, then there is 1 byte z. The decimal Unicode value ud = the value of z.

If z is between and including 192 - 223, then there are 2 bytes z y; ud = (z-192)*64 + (y-128)

If z is between and including 224 - 239, then there are 3 bytes z y x; ud = (z-224)*4096 + (y-128)*64 + (x-128)

If z is between and including 240 - 247, then there are 4 bytes z y x w; ud = (z-240)*262144 + (y-128)*4096 + (x-128)*64 + (w-128)

If z is between and including 248 - 251, then there are 5 bytes z y x w v; ud = (z-248)*16777216 + (y-128)*262144 + (x-128)*4096 + (w-128)*64 + (v-128)

If z is 252 or 253, then there are 6 bytes z y x w v u; ud = (z-252)*1073741824 + (y-128)*16777216 + (x-128)*262144 + (w-128)*4096 + (v-128)*64 + (u-128)

If z = 254 or 255 then there is something wrong!

Example: take the decimal Unicode designation 8482 (decimal), which is for the trademark sign. This number is larger than 2048, so we get three numbers.
The first number is 224 + (8482 div 4096) = 224 + 2 = 226.
The second number is 128 + (8482 div 64) mod 64) = 128 + (132 mod 64) = 128 + 4 = 132.
The third number is 128 + (8482 mod 64) = 128 + 34 = 162.
Now the other way round. We see the numbers 226, 132 and 162. What is the decimal Unicode value?
In this case: (226-224)*4096+(132-128)*64+(162-128) = 8482.
And the conversion between hexadecimal and decimal? Come on, this is not a math tutorial! In case you don't know, use a calculator.

References
More information about the UTF-8 encoding can be found at:
ISO-10646/UTF-8 encoding.
The table on the page you're looking at now follows the Apple Roman Unicode encoding.
This encoding can be found at the unicode organization:
Apple Roman Unicode encoding.
This document describes the latest Apple character set, as used by the Apple Mac OS Text Encoding Converter software version 1.5.
Code position 0xDB is now used for the EURO SIGN, but a couple of years ago this position was used for the CURRENCY SIGN, as originally defined.
The standard Windows Roman encoding is 'code page 1252'. The unicode definition can be found at:
Windows code page 1252, Unicode encodings

And now where you have been waiting for, the complete 1-byte UTF-8 table.

chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
1288000C4196195.132C384ÄLATIN CAPITAL LETTER A WITH DIAERESISAdieresis
1298100C5197195.133C385Ã…LATIN CAPITAL LETTER A WITH RING ABOVEAring
1308200C7199195.135C387ÇLATIN CAPITAL LETTER C WITH CEDILLACcedilla
ƒ1318300C9201195.137C389ÉLATIN CAPITAL LETTER E WITH ACUTEEacute
1328400D1209195.145C391ÑLATIN CAPITAL LETTER N WITH TILDENtilde
1338500D6214195.150C396ÖLATIN CAPITAL LETTER O WITH DIAERESISOdieresis
1348600DC220195.156C39CÜLATIN CAPITAL LETTER U WITH DIAERESISUdieresis
1358700E1225195.161C3A1áLATIN SMALL LETTER A WITH ACUTEaacute
ˆ1368800E0224195.160C3A0àLATIN SMALL LETTER A WITH GRAVEagrave
1378900E2226195.162C3A2âLATIN SMALL LETTER A WITH CIRCUMFLEXacircumflex
Š1388A00E4228195.164C3A4äLATIN SMALL LETTER A WITH DIAERESISadieresis
1398B00E3227195.163C3A3ãLATIN SMALL LETTER A WITH TILDEatilde
Œ1408C00E5229195.165C3A5Ã¥LATIN SMALL LETTER A WITH RING ABOVEaring
1418D00E7231195.167C3A7çLATIN SMALL LETTER C WITH CEDILLAccedilla
Ž1428E00E9233195.169C3A9éLATIN SMALL LETTER E WITH ACUTEeacute
1438F00E8232195.168C3A8èLATIN SMALL LETTER E WITH GRAVEegrave
1449000EA234195.170C3AAêLATIN SMALL LETTER E WITH CIRCUMFLEXecircumflex
1459100EB235195.171C3ABëLATIN SMALL LETTER E WITH DIAERESISedieresis
1469200ED237195.173C3ADíLATIN SMALL LETTER I WITH ACUTEiacute
1479300EC236195.172C3ACìLATIN SMALL LETTER I WITH GRAVEigrave
1489400EE238195.174C3AEîLATIN SMALL LETTER I WITH CIRCUMFLEXicircumflex
1499500EF239195.175C3AFïLATIN SMALL LETTER I WITH DIAERESISidieresis
1509600F1241195.177C3B1ñLATIN SMALL LETTER N WITH TILDEntilde
1519700F3243195.179C3B3óLATIN SMALL LETTER O WITH ACUTEoacute
˜1529800F2242195.178C3B2òLATIN SMALL LETTER O WITH GRAVEograve
1539900F4244195.180C3B4ôLATIN SMALL LETTER O WITH CIRCUMFLEXocircumflex
š1549A00F6246195.182C3B6öLATIN SMALL LETTER O WITH DIAERESISodieresis
1559B00F5245195.181C3B5õLATIN SMALL LETTER O WITH TILDEotilde
œ1569C00FA250195.186C3BAúLATIN SMALL LETTER U WITH ACUTEuacute
1579D00F9249195.185C3B9ùLATIN SMALL LETTER U WITH GRAVEugrave
ž1589E00FB251195.187C3BBûLATIN SMALL LETTER U WITH CIRCUMFLEXucircumflex
Ÿ1599F00FC252195.188C3BCüLATIN SMALL LETTER U WITH DIAERESISudieresis
chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
 160A020208224226.128.160E280A0†DAGGERdagger
¡161A100B0176194.176C2B0°DEGREE SIGNdegree
¢162A200A2162194.162C2A2¢CENT SIGNcent
£163A300A3163194.163C2A3£POUND SIGNsterling
¤164A400A7167194.167C2A7§SECTION SIGNsection
¥165A520228226226.128.162E280A2•BULLETbullet
¦166A600B6182194.182C2B6¶PILCROW SIGNparagraph
§167A700DF223195.159C39FßLATIN SMALL LETTER SHARP Sgermandbls
¨168A800AE174194.174C2AE®REGISTERED SIGNregistered
©169A900A9169194.169C2A9©COPYRIGHT SIGNcopyright
ª170AA21228482226.132.162E284A2â„¢TRADE MARK SIGNtrademark
«171AB00B4180194.180C2B4´ACUTE ACCENTacute
¬172AC00A8168194.168C2A8¨DIAERESISdieresis
­173AD22608800226.137.160E289A0≠NOT EQUAL TOnotequal
®174AE00C6198195.134C386ÆLATIN CAPITAL LETTER AEAE
¯175AF00D8216195.152C398ØLATIN CAPITAL LETTER O WITH STROKEOslash
°176B0221E8734226.136.158E2889E∞INFINITYinfinity
±177B100B1177194.177C2B1±PLUS-MINUS SIGNplusminus
²178B222648804226.137.164E289A4≤LESS-THAN OR EQUAL TOlessequal
³179B322658805226.137.165E289A5≥GREATER-THAN OR EQUAL TOgreaterequal
´180B400A5165194.165C2A5Â¥YEN SIGNyen
µ181B500B5181194.181C2B5µMICRO SIGNmu
182B622028706226.136.130E28882∂PARTIAL DIFFERENTIALpartialdiff
·183B722118721226.136.145E28891∑N-ARY SUMMATIONsummation
¸184B8220F8719226.136.143E2888FâˆN-ARY PRODUCTproduct
¹185B903C0960207.128CF80Ï€GREEK SMALL LETTER PIpi
º186BA222B8747226.136.171E288AB∫INTEGRALintegral
»187BB00AA170194.170C2AAªFEMININE ORDINAL INDICATORordfeminine
¼188BC00BA186194.186C2BAºMASCULINE ORDINAL INDICATORordmasculine
½189BD03A9937206.169CEA9ΩGREEK CAPITAL LETTER OMEGAOmega
¾190BE00E6230195.166C3A6æLATIN SMALL LETTER AEae
¿191BF00F8248195.184C3B8øLATIN SMALL LETTER O WITH STROKEoslash
chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
À192C000BF191194.191C2BF¿INVERTED QUESTION MARKquestiondown
Á193C100A1161194.161C2A1¡INVERTED EXCLAMATION MARKexclamdown
Â194C200AC172194.172C2AC¬NOT SIGNlogicalnot
Ã195C3221A8730226.136.154E2889A√SQUARE ROOTradical
Ä196C40192402198.146C692Æ’LATIN SMALL LETTER F WITH HOOKflorin
Å197C522488776226.137.136E28988≈ALMOST EQUAL TOapproxequal
Æ198C622068710226.136.134E28886∆INCREMENTDelta
Ç199C700AB171194.171C2AB«LEFT-POINTING DOUBLE ANGLE QUOTATION MARKguillemotleft
È200C800BB187194.187C2BB»RIGHT-POINTING DOUBLE ANGLE QUOTATION MARKguillemotright
É201C920268230226.128.166E280A6…HORIZONTAL ELLIPSISellipsis
Ê202CA00A0160194.160C2A0 NO-BREAK SPACEnobreakspace
Ë203CB00C0192195.128C380ÀLATIN CAPITAL LETTER A WITH GRAVEAgrave
Ì204CC00C3195195.131C383ÃLATIN CAPITAL LETTER A WITH TILDEAtilde
Í205CD00D5213195.149C395ÕLATIN CAPITAL LETTER O WITH TILDEOtilde
Î206CE0152338197.146C592Å’LATIN CAPITAL LIGATURE OEOE
Ï207CF0153339197.147C593Å“LATIN SMALL LIGATURE OEoe
Ð208D020138211226.128.147E28093–EN DASHendash
Ñ209D120148212226.128.148E28094—EM DASHemdash
Ò210D2201C8220226.128.156E2809C“LEFT DOUBLE QUOTATION MARKquotedblleft
Ó211D3201D8221226.128.157E2809Dâ€RIGHT DOUBLE QUOTATION MARKquotedblright
Ô212D420188216226.128.152E28098‘LEFT SINGLE QUOTATION MARKquoteleft
Õ213D520198217226.128.153E28099’RIGHT SINGLE QUOTATION MARKquoteright
Ö214D600F7247195.183C3B7÷DIVISION SIGNdivide
×215D725CA9674226.151.138E2978Aâ—ŠLOZENGElozenge
Ø216D800FF255195.191C3BFÿLATIN SMALL LETTER Y WITH DIAERESISydieresis
Ù217D90178376197.184C5B8ŸLATIN CAPITAL LETTER Y WITH DIAERESISYdieresis
Ú218DA20448260226.129.132E28184â„FRACTION SLASHfraction
Û219DB20AC8364226.130.172E282AC€EURO SIGNcurrency
Ü220DC20398249226.128.185E280B9‹SINGLE LEFT-POINTING ANGLE QUOTATION MARKguilsinglleft
Ý221DD203A8250226.128.186E280BA›SINGLE RIGHT-POINTING ANGLE QUOTATION MARKguilsinglright
Þ222DEFB0164257239.172.129EFAC81ï¬LATIN SMALL LIGATURE FIfi
ß223DFFB0264258239.172.130EFAC82flLATIN SMALL LIGATURE FLfl
chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
à224E020218225226.128.161E280A1‡DOUBLE DAGGERdaggerdbl
á225E100B7183194.183C2B7·MIDDLE DOTperiodcentered
â226E2201A8218226.128.154E2809A‚SINGLE LOW-9 QUOTATION MARKquotesinglbase
ã227E3201E8222226.128.158E2809E„DOUBLE LOW-9 QUOTATION MARKquotedblbase
ä228E420308240226.128.176E280B0‰PER MILLE SIGNperthousand
å229E500C2194195.130C382ÂLATIN CAPITAL LETTER A WITH CIRCUMFLEXAcircumflex
æ230E600CA202195.138C38AÊLATIN CAPITAL LETTER E WITH CIRCUMFLEXEcircumflex
ç231E700C1193195.129C381ÃLATIN CAPITAL LETTER A WITH ACUTEAacute
è232E800CB203195.139C38BËLATIN CAPITAL LETTER E WITH DIAERESISEdieresis
é233E900C8200195.136C388ÈLATIN CAPITAL LETTER E WITH GRAVEEgrave
ê234EA00CD205195.141C38DÃLATIN CAPITAL LETTER I WITH ACUTEIacute
ë235EB00CE206195.142C38EÃŽLATIN CAPITAL LETTER I WITH CIRCUMFLEXIcircumflex
ì236EC00CF207195.143C38FÃLATIN CAPITAL LETTER I WITH DIAERESISIdieresis
í237ED00CC204195.140C38CÃŒLATIN CAPITAL LETTER I WITH GRAVEIgrave
î238EE00D3211195.147C393ÓLATIN CAPITAL LETTER O WITH ACUTEOacute
ï239EF00D4212195.148C394ÔLATIN CAPITAL LETTER O WITH CIRCUMFLEXOcircumflex
ð240F0F8FF63743239.163.191EFA3BFApple logoapple
ñ241F100D2210195.146C392Ã’LATIN CAPITAL LETTER O WITH GRAVEOgrave
ò242F200DA218195.154C39AÚLATIN CAPITAL LETTER U WITH ACUTEUacute
ó243F300DB219195.155C39BÛLATIN CAPITAL LETTER U WITH CIRCUMFLEXUcircumflex
ô244F400D9217195.153C399ÙLATIN CAPITAL LETTER U WITH GRAVEUgrave
õ245F50131305196.177C4B1ıLATIN SMALL LETTER DOTLESS Idotlessi
ö246F602C6710203.134CB86ˆMODIFIER LETTER CIRCUMFLEX ACCENTcircumflex
÷247F702DC732203.156CB9CËœSMALL TILDEtilde
ø248F800AF175194.175C2AF¯MACRONmacron
ù249F902D8728203.152CB98˘BREVEbreve
ú250FA02D9729203.153CB99Ë™DOT ABOVEdotaccent
û251FB02DA730203.154CB9AËšRING ABOVEring
ü252FC00B8184194.184C2B8¸CEDILLAcedilla
ý253FD02DD733203.157CB9DËDOUBLE ACUTE ACCENThungarumlaut
þ254FE02DB731203.155CB9BË›OGONEKogonek
ÿ255FF02C7711203.135CB87ˇCARONcaron

The following characters can normally not be displayed on a Mac. These are the Windows-specific characters and they follow the Windows code page cp1252.
Modern Macintosh browsers however, are able to display them, especially when the user has installed the full Apple Language Kits, found on the Mac OS 9+ CD.
chdechxU-hexU-decUTF-decUTF-hxlitUnicode namePostScript name
Š1388A0160352197.160C5A0Å LATIN CAPITAL LETTER S WITH CARONScaron
š1549A0161353197.161C5A1Å¡LATIN SMALL LETTER S WITH CARONscaron
¦166A6A6166194.166C2A6¦BROKEN BARbrokenbar
²178B2B2178194.178C2B2²SUPERSCRIPT TWOtwosuperior
³179B3B3179194.179C2B3³SUPERSCRIPT THREEthreesuperior
¹185B9B9185194.185C2B9¹SUPERSCRIPT ONEonesuperior
¼188BCBC188194.188C2BC¼VULGAR FRACTION ONE QUARTERonequarter
½189BDBD189194.189C2BD½VULGAR FRACTION ONE HALFonehalf
¾190BEBE190194.190C2BE¾VULGAR FRACTION THREE QUARTERSthreequarters
Ð208D0D0208195.144C390ÃLATIN CAPITAL LETTER ETHEth
×215D7D7215195.151C397×MULTIPLICATION SIGNmultiply
Ý221DDDD221195.157C39DÃLATIN CAPITAL LETTER Y WITH ACUTEYacute
Þ222DEDE222195.158C39EÞLATIN CAPITAL LETTER THORNThorn
ð240F0F0240195.176C3B0ðLATIN SMALL LETTER ETHeth
ý253FDFD253195.189C3BDýLATIN SMALL LETTER Y WITH ACUTEyacute
þ254FEFE254195.190C3BEþLATIN SMALL LETTER THORNthorn

© Oscar van Vlijmen, June 2000
URL-alias of this page: http://ovv.club.tip.nl/utf8tbl.html
Last update: 2002-12-28