Unicode Explained

Book description

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. There are hundreds of different encoding systems for mapping characters to numbers, but Unicode promises a single mapping. Unicode enables a single software product or website to be targeted across multiple platforms, languages and countries without re-engineering. It's no wonder that industry giants like Apple, Hewlett-Packard, IBM andMicrosoft have all adopted Unicode.

Containing everything you need to understand Unicode, this comprehensive reference from O'Reilly takes you on a detailed guide through the complex character world. For starters, it explains how to identify and classify characters - whether they're common, uncommon, or exotic. It then shows you how to type them, utilize their properties, and process character data in a robust manner.

The book is broken up into three distinct parts. The first few chapters provide you with a tutorial presentation of Unicode and character data. It gives you a firm grasp of the terminology you need to reference various components, including character sets, fonts and encodings, glyphs and character repertoires.

The middle section offers more detailed information about using Unicode and other character codes. It explains the principles and methods of defining character codes, describes some of the widely used codes, and presents code conversion techniques. It also discusses properties of characters, collation and sorting, line breaking rules and Unicode encodings. The final four chapters cover more advanced material, such as programming to support Unicode.

You simply can't afford to be without the nuggets of valuable information detailed in Unicode Explained.

Table of contents

  1. Table of Contents
  2. Preface
    1. Audience
    2. Assumptions and Approach
    3. Contents of This Book
    4. Self-Assessment Test
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Enabled
    8. How to Contact Us
    9. Acknowledgments
  3. Part I. Working with Characters
    1. Chapter 1. Characters as Data
      1. Introduction to Characters and Unicode
        1. Why Unicode?
        2. Unicode Can Be Easy
      2. What’s in a Character? (1/5)
      3. What’s in a Character? (2/5)
      4. What’s in a Character? (3/5)
      5. What’s in a Character? (4/5)
      6. What’s in a Character? (5/5)
        1. Why Do We Need to Know About Characters?
        2. Characters as Units of Text
          1. Characters as abstractions
          2. Variation of appearance or different characters?
          3. Variation in shape turned into a character difference
          4. Characters and “abstract characters”
          5. Characters and other units of text
        3. Characters Versus Images
        4. Processing of Characters
        5. Giving Identity to Characters
          1. Definitions of characters in standards
          2. Annotations used to emphasize differences
          3. The representative glyphs
          4. The number and the Unicode name as identifiers
          5. Unicode is more explicit
          6. Spelling of names and the U+nnnn convention
        6. Unicode Definitions of Characters
        7. Definitions of Characters Elsewhere
        8. What’s in a Name?
        9. Should We Be Strict About the Meanings of Characters?
        10. Ambiguity Among Characters
        11. How Do I Find My Character?
        12. Which Characters Does Each Language Use?
      7. Variation of Writing Systems
      8. Glyphs and Fonts (1/2)
      9. Glyphs and Fonts (2/2)
        1. Allowed Variation of Glyphs
        2. Fonts and Their Properties
        3. Font Variation Versus Characters
        4. Fonts in Implementations
        5. Failures to Display a Character
        6. Font Embedding
      10. Definitions of Character Repertoires
        1. Formally Defined Repertoires
        2. Practical Repertoires
      11. Numbering Characters
        1. Hexadecimal Notation
        2. Numbers as Indexes
        3. Making Use of Character Numbers
      12. Encoding Characters as Octet Sequences (1/2)
      13. Encoding Characters as Octet Sequences (2/2)
        1. Plain Text and Other Formats for Text
        2. Bytes and Octets
        3. Character Encodings
        4. Single-Octet Encodings
        5. Multi-Octet Encodings
        6. The “Character Set” Confusion
      14. Working with Encodings (1/2)
      15. Working with Encodings (2/2)
        1. Selecting the Encoding When Saving
        2. How Encodings Should Be Detected
        3. Setting the Encoding Manually
        4. Sending Unicode Email
        5. Viewing Web Pages in Different Encodings
        6. Common Confusion: Encoding Versus Language
      16. Working with Fonts (1/2)
      17. Working with Fonts (2/2)
        1. Installing Additional Support
        2. Font Support in Web Browsers
        3. Font Substitution: a Solution and a Problem
        4. Printer Fonts
        5. Finding Fonts
        6. Fonts in Web Authoring
          1. The fallback problem
          2. Effects of browser settings
      18. Summaries
        1. Summary of Definitions
        2. Summary of Concept Levels
    2. Chapter 2. Writing Characters
      1. Method Varieties
        1. A Simple Way or a Universal Way?
        2. An Overview of Methods
        3. Choosing Fonts
      2. Keyboard Variation and Settings
        1. Typing Characters—Just Pressing a Key?
        2. Keyboard Limitations and Variation
        3. Auxiliary Keys
        4. Dead Keys
      3. Virtual Keyboards
        1. A Keyboard on Screen
        2. Virtual Keys for Character Input in Forms
      4. Program Commands (1/2)
      5. Program Commands (2/2)
        1. Copying via the Clipboard
        2. Menu Commands
          1. Insertion menu in Thunderbird
          2. Symbol (character) insertion menu in MS Word
          3. The Show Formatting (Show ¶) tool
        3. Methods Using the Alt Key on Windows
          1. The Alt-0n method
          2. The code page–specific Alt-n method
          3. The Unicode-based Alt-n method
          4. The Alt-X method
          5. The Alt-+n method
        4. Ctrl-Q and Other Methods in Emacs
      6. Character Maps
        1. Character Map in MS Word
        2. Windows Character Map
      7. Replacements on the Fly (1/2)
      8. Replacements on the Fly (2/2)
        1. Default Replacements in MS Word
          1. Viewing and changing the rules
          2. Language dependency
          3. Autoformatting in MS Word
          4. Example: quotation marks
        2. Defining Your Own Shortcuts
      9. Special Techniques
        1. Combining Diacritic Marks
        2. Spacing Between Characters
        3. Inputting East Asian Characters
      10. Escape Sequences (1/2)
      11. Escape Sequences (2/2)
        1. Examples of Escape Notations
          1. CSS
          2. PostScript
          3. RTF
          4. TeX
        2. Notations for Human Readers
        3. Explanations to Human Readers
        4. HTML, SGML, and XML Notations for Characters
          1. Character and entity references in web authoring
          2. The role and use of character and entity references
          3. Definition: character reference
          4. Definition: entity reference
          5. Entity references in HTML
          6. Character entities in XML
      12. Specialized Editors
        1. BabelPad
        2. UniPad
      13. Exercise
    3. Chapter 3. Character Sets and Encodings
      1. Good Old ASCII
        1. American Origin
        2. The ASCII Repertoire
        3. The ASCII Encoding
        4. ISO 646 and National Variants of ASCII
        5. Subsets of ASCII for Safety
        6. The Misnomer “8-bit ASCII”
      2. ISO 8859 Codes
        1. ISO 8859-1 (ISO Latin 1)
        2. Names of Encodings
        3. Other ISO 8859 Codes
      3. Windows Latin 1 and Other Windows Codes
        1. Windows Latin 1
        2. Other Windows Character Codes
      4. Other 8-bit Codes (1/2)
      5. Other 8-bit Codes (2/2)
        1. DOS Code Pages
        2. Mac Encodings
        3. EBCDIC
        4. The Cyrillic KOI8 Encodings
        5. Ad Hoc “8-bit Codes” Defined by Fonts
      6. Unicode and UTF-8
        1. The Conceptual Model: Levels of Coding
          1. The Internet (IAB) model
          2. The four-level Unicode model
          3. Transfer Encoding Syntax
        2. Encodings for Unicode
        3. Saving as Unicode
      7. Encodings for East Asian Language
        1. Vietnamese 8-bit Codes
        2. Encodings for Chinese
        3. Encodings for Japanese
        4. Encodings for Korean
      8. Converters and Transcoding
        1. Transcoding Tools
        2. Free Recode
        3. The iconv Converter
      9. Using Character Codes (1/2)
      10. Using Character Codes (2/2)
        1. Repertoire Requirements
        2. Encodings and the Internet
        3. Encoding in Offline Data
        4. Common Choices of Encoding
        5. Sources of Information
        6. Exercises
          1. Testing encodings
          2. “Deciphering” text
  4. Part II. A Systematic Look at Unicode
    1. Chapter 4. The Structure of Unicode
      1. Design Principles
        1. Goals: Universality, Efficiency, Unambiguity
        2. The 10 Design Principles
        3. Unification
        4. Conformance Requirements
        5. Unicode and ISO 10646
        6. Why Go Beyond 16 Bits?
        7. Does Unicode Contain All Characters in the World?
        8. Identity of Characters
          1. Characters as elementary units of text
          2. Unicode numbers
          3. Unicode names of characters
          4. Using the names
          5. Characters used in character names
          6. Case of letters in names
          7. Notational issues
          8. UCS Sequence Identifiers (USI) and named character sequences
      2. Versions of Unicode
      3. Coding Space (1/3)
      4. Coding Space (2/3)
      5. Coding Space (3/3)
        1. Planes
        2. Allocation Areas
        3. Rows and Blocks
        4. Unicode as Extension of ISO-8859-1
        5. Internal Structure of Blocks
        6. Noncharacter Code Points
        7. Classification of Code Points
        8. Surrogates
        9. Unassigned Code Points and Private Use
      6. Unicode Terms
        1. Deprecated and Obsolete Characters
        2. Digraphs
        3. Text Elements
        4. Unicode Strings
      7. Guide to the Unicode Standard (1/2)
      8. Guide to the Unicode Standard (2/2)
        1. Accessing the Unicode Versions
        2. What Material Constitutes the Unicode Standard?
        3. Viewing the Standard Online
        4. The Chapters of the Standard
        5. How Do I Find All the Information About a Character?
          1. The Zvon database
          2. Using Unibook
          3. Using the Unicode standard
        6. Additional Reference Material
      9. Unicode and Fonts
        1. Unicode as Plain Text
        2. Font Variants as Characters
        3. Variation Selectors
        4. Affecting Font Usage
        5. Ligatures
        6. Vowels as Marks
        7. Operations on Glyphs
        8. Unicode Versus Font Tricks
      10. Criticism of Unicode (1/2)
      11. Criticism of Unicode (2/2)
        1. Overall Complexity
        2. Inefficiency?
        3. Is It Reasonable to Require Support for 100,000 Characters?
        4. Cultural Bias
          1. Lack of precomposed characters
          2. East Asian languages
          3. Favoring UTF-8
        5. Excessive Unification
        6. Semantic Disambiguation Frowned Upon
        7. Misleading Names of Characters
        8. Concepts and Definitions
        9. Illogical Division into Blocks
      12. Questions and Answers
        1. Where Can I Find Tools for Using Unicode?
        2. Why Do People Call Unicode a 16-Bit Code?
        3. How Can I Have a Character Added to Unicode?
        4. How Can I Check That I’ve Understood the Principles?
    2. Chapter 5. Properties of Characters
      1. Character Classification
        1. The Purposes of Classification
        2. General Category Values
        3. Use of General Category in Programming
      2. An Overview of Properties (1/3)
      3. An Overview of Properties (2/3)
      4. An Overview of Properties (3/3)
        1. Summary of Properties
        2. Normative and Informative Properties
        3. Structure of Database Files
      5. Compositions and Decompositions (1/3)
      6. Compositions and Decompositions (2/3)
      7. Compositions and Decompositions (3/3)
        1. The Impact of Diacritic Marks
          1. Precomposed and decomposed form
          2. Combining marks: powerful, but still poorly supported
          3. Features that are not diacritic marks
        2. Compatibility Mappings and Canonical Mappings
          1. Difference between canonical and compatibility mappings
          2. Canonical and compatibility equivalence
          3. The meaning of canonical mapping
          4. Differences in glyphs for equivalent characters
          5. How the mappings are defined
        3. Canonical Decomposition and Compatibility Decomposition
          1. Canonical decomposition
          2. Canonical Ordering Behavior
          3. Canonical equivalence
          4. Compatibility decomposition and equivalence
          5. Canonical and compatibility decomposable characters
        4. Compatibility Characters
        5. Compatibility Decomposable Characters
        6. Avoiding Compatibility Characters
        7. Compatibility Characters for Ligatures
      8. Normalization (1/2)
      9. Normalization (2/2)
        1. Normalization Versus Folding
        2. Overview of Normalization Forms
          1. Use of normalization forms
          2. Invariance of Basic Latin characters
        3. Normalization Form C
        4. Normalization Form KC
        5. Composition Exclusions
        6. Definition of Compatibility Decomposable Character
        7. W3C Normalization
      10. Case Properties
        1. Recognizing Uppercase, Lowercase, and Titlecase
        2. Case Mappings
        3. Case Folding in Unicode
        4. Viewing the Mappings
        5. Character Case Mappings Versus Visual Mappings
      11. Collation and Sorting (1/2)
      12. Collation and Sorting (2/2)
        1. Sorting Characters Versus Sorting Strings
        2. Collation and Unicode
        3. Layered Model of Collation
        4. Code Point Order Versus Collating Order
          1. Code point order is unnatural
          2. Using code point order as a fallback in definitions
          3. Code point order sorting for technical reasons
          4. Problems of legacy software
        5. Unicode Collation Algorithm
      13. Text Boundaries
      14. Directionality (1/2)
      15. Directionality (2/2)
        1. Writing Direction of Text
        2. Bidirectionality
        3. Directionality and Character Codes
        4. Directionality of Characters
        5. Control Characters for Directionality
        6. Bidi Mirroring
        7. Directionality in HTML and CSS
        8. Directionality of Formatting
      16. Line-Breaking Properties (1/4)
      17. Line-Breaking Properties (2/4)
      18. Line-Breaking Properties (3/4)
      19. Line-Breaking Properties (4/4)
        1. Conformance Criteria
        2. Characters for Special Control over Line Breaking
          1. Preventing line breaks
          2. Suggesting line break opportunities
          3. Limited support
        3. Principles of Line Breaking
        4. Emergency Breaks
        5. Unicode Line-Breaking Rules
          1. Values of the LineBreak property
          2. The format of LineBreak.txt
          3. The formal rules
          4. Applying the rules
          5. Pair table implementation
          6. Tailoring
          7. Some background and criticism
      20. Unicode Conformance Requirements (1/2)
      21. Unicode Conformance Requirements (2/2)
        1. An Informal Summary
        2. Notations and Terms Used in the Requirements
        3. Unassigned Code Points
        4. Interpretation
        5. Modification
        6. Character Encoding Forms
        7. Character Encoding Schemes
        8. Bidirectional Text
        9. Normalization Forms
        10. Normative References
        11. Unicode Algorithms
        12. Default Casing Operations
        13. Unicode Standard Annexes
      22. Effects on Choosing Characters
        1. Example: Some Mathematical Operators
    3. Chapter 6. Unicode Encodings
      1. Unicode Encodings in General
      2. UTF-32 and UCS-4
      3. UTF-16 and UCS-2
        1. UCS-2 Is BMP Only
        2. Surrogate Pairs in UTF-16
        3. Some Properties of UTF-16
      4. UTF-8
        1. UTF-8 Encoding Algorithm
        2. UTF-8 Versus ISO-8859-1
        3. Some Properties of UTF-8
      5. Byte Order
      6. Conversions Between Unicode Encodings
      7. Other Encodings (1/3)
      8. Other Encodings (2/3)
      9. Other Encodings (3/3)
        1. SCSU Compression
        2. BOCU-1 Compression
        3. CESU-8
        4. Modified UTF-8
        5. Base64 Encoding of Data
        6. Quoted Printable Encoding
        7. Uuencode
        8. UTF-7
        9. UTF-1
        10. UTF-EBCDIC
        11. GB 18030, “Chinese Unicode”
        12. Punycode, Encoding for Domain Names
        13. URL Encoding
          1. Introduction: URL Encoding for form data
          2. The original URL Encoding
          3. To encode or not to encode?
          4. Generalized URL Encoding
          5. Modern, UTF-8-based URL Encoding
      10. Auto-Detecting the Encoding
      11. Choosing an Encoding
        1. Storage Requirements
        2. Efficiency of Processing
        3. Specific Limitations
        4. Favoring UTF-8 on the Internet
  5. Part III. Advanced Unicode Topics
    1. Chapter 7. Characters and Languages
      1. Writing Systems and IT
        1. Internationalization (i18n) and Related Issues
        2. Aspects of Writing and Their IT Impact
          1. Writing direction
          2. What does a language setting really set?
        3. Setting the Language in Word Processing
          1. Automatic operations on punctuation
          2. Spelling and grammar checks
          3. Determining the language of text
          4. Exercise
        4. Setting Language Preferences in Browsers
        5. Script = Writing System
          1. Categories of Scripts
          2. Need for script information
          3. Scripts and spoofing
          4. Codes and names for scripts
          5. The Script property: the script of a character
      2. Character Requirements of Languages (1/3)
      3. Character Requirements of Languages (2/3)
      4. Character Requirements of Languages (3/3)
        1. The Impact of Character Repertoire
        2. Languages and Characters
          1. What constitutes a character?
          2. Does Unicode support all languages?
          3. Attempts at technical definitions of character requirements
          4. Which characters does a language need?
        3. Language Coverage of ISO Latin Alphabets
        4. Example: Spanish
        5. Example: French
      5. Transliteration and Transcription (1/2)
      6. Transliteration and Transcription (2/2)
        1. Solutions to Readers, Problems to Implementers
        2. Transliteration Converts Letters
        3. Transcription Converts Sounds
        4. Phonetic Transcription in IPA
        5. Transcription Inside a Script?
      7. Language Metadata (1/2)
      8. Language Metadata (2/2)
        1. Need for Language Information
        2. Methods of Determining Language
        3. Language Markup
          1. Attributes for language in HTML and XML
          2. The impact of language markup
          3. Granularity of markup
        4. Language Codes
          1. The confusion of codes
          2. ISO 639
          3. Language codes on the Internet
          4. Language codes and user interfaces
        5. Language Tags in Unicode
      9. Languages and Fonts
        1. Example: Shape of the Acute Accent
        2. Chinese Characters and Language Information
    2. Chapter 8. Character Usage
      1. Basics of Character Usage
        1. Orthography Sets Rules for Writing
        2. Typography Is About Appearance
        3. Liberal in What You Accept
        4. Conservative in What You Send
      2. ASCII (Basic Latin) (1/4)
      3. ASCII (Basic Latin) (2/4)
      4. ASCII (Basic Latin) (3/4)
      5. ASCII (Basic Latin) (4/4)
        1. Names of ASCII Characters
        2. Alphanumeric Characters
        3. Parentheses
        4. Other Graphic Characters
          1. Ampersand & (U⁠+⁠0026)
          2. Apostrophe ' (U⁠+⁠0027)
          3. Asterisk * (U⁠+⁠002A)
          4. Circumflex accent ^ (U⁠+⁠005E)
          5. Colon : (U⁠+⁠003A)
          6. Comma , (U⁠+⁠002C)
          7. Dollar sign $ (U⁠+⁠0024)
          8. Commercial at @ (U⁠+⁠0040)
          9. Equals sign = (U⁠+⁠003D)
          10. Exclamation mark ! (U⁠+⁠0021)
          11. Full stop “.” (U⁠+⁠002E)
          12. Grave accent ` (U⁠+⁠0060)
          13. Greater-than sign > (U⁠+⁠003E)
          14. Hyphen-minus “-” (U⁠+⁠002D)
          15. Less-than sign < (U⁠+⁠003C)
          16. Low line _ (U⁠+⁠005F)
          17. Number sign # (U⁠+⁠0023)
          18. Percent sign % (U⁠+⁠0025)
          19. Plus sign + (U⁠+⁠002B)
          20. Question mark ? (U⁠+⁠003F)
          21. Quotation mark " (U⁠+⁠0022)
          22. Reverse solidus \ (U⁠+⁠005C)
          23. Semicolon ; (U⁠+⁠003B)
          24. Solidus / (U⁠+⁠002F)
          25. Space “ ” (U⁠+⁠0020)
          26. Tilde ~ (U⁠+⁠007E)
          27. Vertical line | (U⁠+⁠007C)
        5. ASCII Control Characters (C0 Controls)
          1. Control characters or control codes?
          2. Types of control characters
          3. Visible symbols for control characters
          4. Summary of C0 Controls
      6. Latin-1 Supplement (ISO 8859-1) (1/2)
      7. Latin-1 Supplement (ISO 8859-1) (2/2)
        1. Diacritic Marks and Letters with Them
        2. Other Letters
        3. Superscript Digits (¹ ² ³) and Vulgar Fractions (¼ ½ ¾)
        4. Punctuation
        5. Currency Symbols
        6. Mathematical, Logical, and Physical Symbols
        7. Specialized Characters
      8. Other Latin Letters
      9. Other European Alphabetic Scripts
        1. Greek Script
        2. Cyrillic Script
        3. Armenian and Georgian Scripts
      10. Diacritic Marks (1/2)
      11. Diacritic Marks (2/2)
        1. Why Diacritic Marks?
        2. Early Approaches
        3. Coded Combinations
        4. Combining Diacritic Marks
        5. Variation in Appearance
        6. Spacing Diacritic Marks
      12. Letterlike Symbols
      13. General Punctuation (1/3)
      14. General Punctuation (2/3)
      15. General Punctuation (3/3)
        1. Space Characters
          1. Space
          2. No-break space: use it!
          3. Fixed-width spaces: rarely used
          4. Adjusting spacing in other ways
          5. Additional no-break space characters
          6. A practical approach to thin spaces
          7. Disallowing and allowing line breaks
        2. Quotation Marks
          1. Language-specific quotation marks
          2. The apostrophe versus the single quotation mark
        3. Hyphens and Dashes
          1. Use of hyphens and dashes
          2. The soft hyphen
          3. MS Word specialties
        4. Ellipsis
        5. Angular brackets
      16. Line Structure Control
        1. Different Approaches to Line Structuring
        2. Lines and Records
        3. Methods of Coding Line Structure
        4. Editors, Word Processors, and Data Transfer
      17. Mathematical and Technical Symbols (1/2)
      18. Mathematical and Technical Symbols (2/2)
        1. Superscripts and Subscripts
        2. The Number Forms Block
          1. Roman numerals
          2. Fractions
        3. Characters in SI Notations
          1. Conceptual levels of SI notations
          2. Notes on individual characters
          3. Letterlike symbols and the SI
      19. Other Blocks (1/2)
      20. Other Blocks (2/2)
        1. Spacing Modifier Letters
        2. Currency Symbols
        3. Phonetic Characters
        4. Specials
        5. Dingbats
        6. Summary of Blocks
    3. Chapter 9. The Character Level and Above
      1. Levels of Text Representation and Processing
        1. Plain Text, Rich Text, and Markup
          1. Plain text
          2. Rich text formats
          3. Text with markup
          4. Quasi-markup
          5. Conversion to plain text
        2. Example: Nonbreaking Hyphen
        3. Example: Formatting in Word Processing
        4. Example: HTML Markup and CSS
        5. Linear Text Versus Mathematical Notations
        6. Unicode and Mathematics
        7. Characters Outside the Repertoire
          1. Different workarounds
          2. Using a character versus using a small image
          3. Button-like symbols
          4. Using an image for esthetic reasons
        8. Selecting the Appropriate Level of Expression
        9. Subscripts and Superscripts
          1. Visual appearance of subscripts and superscripts
          2. Replacement notations for superscripts and subscripts
          3. Suggested policy on subscripting and superscripting
        10. Characters and Accessibility
          1. Characters in non-visual presentation
          2. Understandability of characters
          3. Explaining characters
      2. Characters and Markup (1/4)
      3. Characters and Markup (2/4)
      4. Characters and Markup (3/4)
      5. Characters and Markup (4/4)
        1. Markup and Styling
        2. Document-wide Versus Local Decisions
        3. Unicode Versus Markup
          1. Differences between markup and plain text
          2. Characters that should not be used in marked-up text
          3. Formatting characters that may be used in marked-up text
          4. Characters with compatibility mappings
        4. Preventing Line Breaks
        5. Breaking the Flow of Text
        6. Why Not Markup in Unicode?
      6. Media Types for Text (1/2)
      7. Media Types for Text (2/2)
        1. The Type text
        2. The Character Encoding
        3. The text Type Versus the application Type
        4. Subtypes of text
    4. Chapter 10. Characters in Internet Protocols
      1. Information About Encoding
        1. What Happens Without Information About Encoding
        2. Approaches to Specifying the Encoding
        3. Practical Recommendations
        4. Looking at the Headers
      2. Characters in MIME (1/5)
      3. Characters in MIME (2/5)
      4. Characters in MIME (3/5)
      5. Characters in MIME (4/5)
      6. Characters in MIME (5/5)
        1. Media Types
        2. Character Encoding (“charset”) Information
        3. MIME Headers
          1. Internet message format and MIME
          2. Headers related to characters
          3. Headers for transfer encoding
          4. The Quoted-Printable (QP) transfer encoding
          5. How MIME should work
        4. Troubleshooting Examples
        5. Character Encoding on the Web
          1. Headers in HTTP
          2. Specifying the encoding in HTTP headers
          3. Which encodings can be used?
          4. HTTP versus HTML
          5. Checking the HTTP headers
          6. Server configuration
          7. Using a meta tag
          8. Resolution of conflicts
          9. The effect of XHTML
          10. Heuristics of detecting encoding
          11. Which encoding should I use?
          12. Avoiding the encoding problem
          13. The “Unicode Encoded” logo
      7. Content Negotiation and Multilingual Sites (1/3)
      8. Content Negotiation and Multilingual Sites (2/3)
      9. Content Negotiation and Multilingual Sites (3/3)
        1. Introduction to Multilingual Web Sites
          1. Parallel versions in different languages
          2. Pages with a mix of languages
          3. Language negotiation: automatic selection of version
          4. Language versus country
        2. Links to Language Versions
        3. Writing Link Texts
        4. Language Negotiation in the HTTP Protocol
        5. Language Negotiation: the Server Side
          1. Using Multiviews
          2. Using type-map
          3. When negotiation fails
        6. Language Negotiation: the Browser Side
        7. Notes on Multilingual Sites
          1. Producing the translations
          2. Translation or different content?
          3. Indicating what is available in each language
          4. Naming the versions
          5. Language preferences and JavaScript
          6. Making use of language preferences in CGI scripts
        8. Types of Negotiation
      10. Characters in Protocol Headers
        1. The Signature Convention May Help
        2. The Q Encoding
        3. The B Encoding
        4. Summary: Dealing with Non-ASCII Characters in Headers
      11. Characters in Domain Names and URLs
        1. Internationalized Domain Names (IDN)
          1. The IDNA implementation
          2. Security threats
        2. Characters in URLs
    5. Chapter 11. Characters in Programming
      1. Characters in Computer Languages
        1. Common Escape Notations
        2. Characters in Markup Languages and CSS
          1. Characters in HTML and XML
          2. Problems in generating markup programmatically
          3. Problems in using scripts inside HTML
          4. Characters in CSS
          5. Identifiers in CSS
      2. Character and String Data (1/5)
      3. Character and String Data (2/5)
      4. Character and String Data (3/5)
      5. Character and String Data (4/5)
      6. Character and String Data (5/5)
        1. Constructs and Principles of Processing Characters
        2. The FORTRAN Model: Hollerith Data
        3. The C model
          1. The character data type
          2. Strings as arrays
          3. 8-bit characters and sign extension
          4. The EOF indicator
          5. The zero byte (NUL byte) convention
          6. The null pointer
          7. Confusion around NUL, NULL, and relatives
          8. C and Unicode
        4. Unicode with 8-bit Quantities?
        5. Wide Characters
        6. Win32 APIs
        7. Multibyte Character Sets (MBCS) Versus Unicode
        8. The Perl Model
          1. Strings and characters in Perl
          2. The catenation operator “.”
          3. In Perl, double quotes mean evaluation
          4. Notations for Unicode characters
          5. Using properties of characters
        9. ECMAScript (JavaScript)
          1. String oriented
          2. The ECMAScript standard
          3. UTF-16 implied
          4. The \u escape notation
        10. PHP: Mostly Just 8 Bits
        11. Java: Rich Support to Unicode
          1. Characters, strings, objects, and methods
          2. Encodings and escape notations
          3. 16-bit characters
          4. Java identifiers
          5. Library routines
      7. The Preparedness Principle (1/2)
      8. The Preparedness Principle (2/2)
        1. Being Prepared for Amount of Data
        2. Being Prepared for Content of Data
          1. Methods of handling unexpected characters
          2. Displaying unrecognized or undisplayable code points
          3. Default ignorable code points
        3. Table-Driven Versus Property-Driven Processing
        4. Naïve Processing
      9. Character Input and Output (1/2)
      10. Character Input and Output (2/2)
        1. Character-Oriented and Line-Oriented Processing
        2. Perl I/O
        3. Java File I/O
        4. Buttons for Character Input
      11. Processing Form Data
        1. Decoding Form Data
        2. Recognizing the Encoding
        3. Avoid Oddities by Using UTF-8
        4. Using UTF-8
        5. Submitting a File
      12. Identifiers, Patterns, and Regular Expressions (1/4)
      13. Identifiers, Patterns, and Regular Expressions (2/4)
      14. Identifiers, Patterns, and Regular Expressions (3/4)
      15. Identifiers, Patterns, and Regular Expressions (4/4)
        1. Identifiers
          1. Identifiers: internal or external?
          2. Traditional format of identifiers
          3. Case sensitivity
          4. The Unicode approach to identifiers
        2. Patterns
        3. Identifier and Pattern Characters
        4. Identifier Syntax
          1. Normalization
          2. Case folding
          3. Identifiers (names) in XML
        5. Alternative Identifier Syntax
        6. Pattern Syntax
        7. Regular Expressions
          1. Regexp use in programming
          2. Regexp use by end users
          3. Unicode regular expressions
          4. Basic Unicode support
          5. Examples
      16. International Components for Unicode (ICU)
      17. Using Locales (1/3)
      18. Using Locales (2/3)
      19. Using Locales (3/3)
        1. The Locale Concept
        2. CLDR
          1. CLDR versus Unix/Linux/POSIX locale concept
        3. Using CLDR
        4. Internationalization and Localization
        5. CLDR Description and Data
        6. Problems with Aspects of Localization
  6. Appendix. Tables for Writing Characters (1/4)
  7. Appendix. Tables for Writing Characters (2/4)
  8. Appendix. Tables for Writing Characters (3/4)
  9. Appendix. Tables for Writing Characters (4/4)
    1. Additional Notes
      1. Coverage
      2. Ordering
      3. Specific Notes
      4. Mapping from Symbol Font to Unicode (1/2)
      5. Mapping from Symbol Font to Unicode (2/2)
  10. Index (1/6)
  11. Index (2/6)
  12. Index (3/6)
  13. Index (4/6)
  14. Index (5/6)
  15. Index (6/6)

Product information

  • Title: Unicode Explained
  • Author(s): Jukka K. Korpela
  • Release date: June 2006
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9780596101213