CJKV Information Processing, 2nd Edition

Book description

First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This second edition reflects the considerable impact that Unicode, XML, OpenType, and newer operating systems such as Windows XP, Vista, Mac OS X, and Linux have had on East Asian text processing in recent years.

Written by its original author, Ken Lunde, a Senior Computer Scientist in CJKV Type Development at Adobe Systems, this book will help you:

  • Learn about CJKV writing systems and scripts, and their transliteration methods
  • Explore trends and developments in character sets and encodings, particularly Unicode
  • Examine the world of typography, specifically how CJKV text is laid out on a page
  • Learn information-processing techniques, such as code conversion algorithms and how to apply them using different programming languages
  • Process CJKV text using different platforms, text editors, and word processors
  • Become more informed about CJKV dictionaries, dictionary software, and machine translation software and services
  • Manage CJKV content and presentation when publishing in print or for the Web

Internationalizing and localizing applications is paramount in today's global market -- especially for audiences in East Asia, the fastest-growing segment of the computing world. CJKV Information Processing will help you understand how to develop web and other applications effectively in a field that many find difficult to master.

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface (1/2)
  3. Preface (2/2)
  4. Chapter 1: CJKV Information Processing Overview
    1. Writing Systems and Scripts
    2. Character Set Standards
    3. Encoding Methods
      1. Data Storage Basics
    4. Input Methods
    5. Typography
    6. Basic Concepts and Terminology FAQ
      1. What Are All These Abbreviations and Acronyms?
      2. What Are Internationalization, Globalization, and Localization?
      3. What Are the Multilingual and Locale Models?
      4. What Is a Locale?
      5. What Is Unicode?
      6. How Are Unicode and ISO 10646 Related?
      7. What Are Row-Cell and Plane-Row-Cell?
      8. What Is a Unicode Scalar Value?
      9. Characters Versus Glyphs: What Is the Difference?
      10. What Is the Difference Between Typeface and Font?
      11. What Are Half- and Full-Width Characters?
      12. Latin Versus Roman Characters
      13. What Is a Diacritic Mark?
      14. What Is Notation?
      15. What Is an Octet?
      16. What Are Little- and Big-Endian?
      17. What Are Multiple-Byte and Wide Characters?
    7. Advice to Readers
  5. Chapter 2: Writing Systems and Scripts
    1. Latin Characters, Transliteration, and Romanization
      1. Chinese Transliteration Methods
      2. Japanese Transliteration Methods (1/2)
      3. Japanese Transliteration Methods (2/2)
      4. Korean Transliteration Methods
      5. Vietnamese Romanization Methods
    2. Zhuyin/Bopomofo
    3. Kana
      1. Hiragana
      2. Katakana
      3. The Development of Kana
    4. Hangul
    5. Ideographs
      1. Ideograph Readings
      2. The Structure of Ideographs
      3. The History of Ideographs
      4. Ideograph Simplification
    6. Non-Chinese Ideographs
      1. Japanese-Made Ideographs—Kokuji
      2. Korean-Made Ideographs—Hanguksik Hanja
      3. Vietnamese-Made Ideographs—Chữ Nôm
  6. Chapter 3: Character Set Standards
    1. NCS Standards
      1. Hanzi in China
      2. Hanzi in Taiwan
      3. Kanji in Japan
      4. Hanja in Korea
    2. CCS Standards
      1. National Coded Character Set Standards Overview
      2. ASCII
      3. ASCII Variations
      4. CJKV-Roman
      5. Chinese Character Set Standards—China (1/4)
      6. Chinese Character Set Standards—China (2/4)
      7. Chinese Character Set Standards—China (3/4)
      8. Chinese Character Set Standards—China (4/4)
      9. Chinese Character Set Standards—Taiwan (1/3)
      10. Chinese Character Set Standards—Taiwan (2/3)
      11. Chinese Character Set Standards—Taiwan (3/3)
      12. Chinese Character Set Standards—Hong Kong (1/2)
      13. Chinese Character Set Standards—Hong Kong (2/2)
      14. Chinese Character Set Standards—Singapore
      15. Japanese Character Set Standards
      16. Korean Character Set Standards (1/2)
      17. Korean Character Set Standards (2/2)
      18. Vietnamese Character Set Standards
    3. International Character Set Standards
      1. Unicode and ISO 10646 (1/5)
      2. Unicode and ISO 10646 (2/5)
      3. Unicode and ISO 10646 (3/5)
      4. Unicode and ISO 10646 (4/5)
      5. Unicode and ISO 10646 (5/5)
      6. GB 13000.1-93
      7. CNS 14649-1:2002 and CNS 14649-2:2003
      8. JIS X 0221:2007
      9. KS X 1005-1:1995
    4. Character Set Standard Oddities
      1. Duplicate Characters
      2. Phantom Ideographs
      3. Incomplete Ideograph Pairs
      4. Simplified Ideographs Without a Traditional Form
      5. Fictitious Character Set Extensions
      6. Seemingly Missing Characters
      7. CJK Unified Ideographs with No Source
      8. Vertical Variants
    5. Noncoded Versus Coded Character Sets
      1. China
      2. Taiwan
      3. Japan
      4. Korea
    6. Information Interchange and Professional Publishing
      1. Character Sets for Information Interchange
      2. Character Sets for Professional and Commercial Publishing
    7. Future Trends and Predictions
      1. Emoji
      2. Genuine Ideograph Unification
    8. Advice to Developers
      1. The Importance of Unicode
  7. Chapter 4: Encoding Methods
    1. Unicode Encoding Methods
      1. Special Unicode Characters
      2. Unicode Scalar Values
      3. Byte Order Issues
      4. BMP Versus Non-BMP
      5. Unicode Encoding Forms
      6. Obsolete and Deprecated Unicode Encoding Forms (1/2)
      7. Obsolete and Deprecated Unicode Encoding Forms (2/2)
      8. Comparing UTF Encoding Forms with Legacy Encodings
    2. Legacy Encoding Methods
      1. Locale-Independent Legacy Encoding Methods
      2. Locale-Specific Legacy Encoding Methods (1/4)
      3. Locale-Specific Legacy Encoding Methods (2/4)
      4. Locale-Specific Legacy Encoding Methods (3/4)
      5. Locale-Specific Legacy Encoding Methods (4/4)
    3. Comparing CJKV Encoding Methods
    4. Charset Designations
      1. Character Sets Versus Encodings
      2. Charset Registries
    5. Code Pages
      1. IBM Code Pages
      2. Microsoft Code Pages
    6. Code Conversion
      1. Chinese Code Conversion
      2. Japanese Code Conversion
      3. Korean Code Conversion
      4. Code Conversion Across CJKV Locales
      5. Code Conversion Tips, Tricks, and Pitfalls
    7. Repairing Damaged or Unreadable CJKV Text
      1. Quoted-Printable Transformation
      2. Base64 Transformation
      3. Other Types of Encoding Repair
    8. Advice to Developers
      1. Embrace Unicode
      2. Legacy Encodings Cannot Be Forgotten
      3. Testing
  8. Chapter 5: Input Methods
    1. Transliteration Techniques
      1. Zhuyin Versus Pinyin Input
      2. Kana Versus Transliterated Input
      3. Hangul Versus Transliterated Input
    2. Input Techniques
      1. The Input Method
      2. The Conversion Dictionary
      3. Input by Reading
      4. Input by Structure
      5. Input by Multiple Criteria
      6. Input by Encoding
      7. Input by Other Codes
      8. Input by Postal Code
      9. Input by Association
    3. User Interface Concerns
      1. Inline Conversion
    4. Keyboard Arrays
      1. Western Keyboard Arrays
      2. Ideograph Keyboard Arrays
      3. Chinese Input Method Keyboard Arrays
      4. Zhuyin Keyboard Arrays
      5. Kana Keyboard Arrays (1/2)
      6. Kana Keyboard Arrays (2/2)
      7. Hangul Keyboard Arrays
      8. Latin Keyboard Arrays for CJKV Input
      9. Mobile Keyboard Arrays (1/2)
      10. Mobile Keyboard Arrays (2/2)
    5. Other Input Hardware
      1. Pen Input
      2. Optical Character Recognition
      3. Voice Input
    6. Input Method Software
      1. CJKV Input Method Software
      2. Chinese Input Method Software
      3. Japanese Input Method Software
      4. Korean Input Method Software
  9. Chapter 6: Font Formats, Glyph Sets, and Font Tools
    1. Typeface Design
    2. How Many Glyphs Can a Font Include?
      1. Composite Fonts Versus Fallback Fonts
      2. Breaking the 64K Glyph Barrier
    3. Bitmapped Font Formats
      1. BDF Font Format
      2. HBF Font Format
    4. Outline Font Formats
      1. PostScript Font Formats (1/4)
      2. PostScript Font Formats (2/4)
      3. PostScript Font Formats (3/4)
      4. PostScript Font Formats (4/4)
      5. TrueType Font Formats
      6. OpenType—PostScript and TrueType in Harmony (1/2)
      7. OpenType—PostScript and TrueType in Harmony (2/2)
    5. Glyph Sets
      1. Static Versus Dynamic Glyph Sets
      2. CID Versus GID
      3. Std Versus Pro Designators
      4. Glyph Sets for Transliteration and Romanization
      5. Character Collections for CID-Keyed Fonts (1/3)
      6. Character Collections for CID-Keyed Fonts (2/3)
      7. Character Collections for CID-Keyed Fonts (3/3)
    6. Ruby Glyphs
      1. Generic Versus Typeface-Specific Ruby Glyphs
    7. Host-Installed, Printer-Resident, and Embedded Fonts
      1. Installing and Downloading Fonts
      2. The PostScript Filesystem
      3. Mac OS X
      4. Mac OS 9 and Earlier
      5. Microsoft Windows—2000, XP, and Vista
      6. Microsoft Windows—Versions 3.1, 95, 98, ME, and NT4
      7. Unix and Linux
      8. X Window System
      9. Font and Glyph Embedding
      10. Cross-Platform Issues
    8. Font Development Tools
      1. Bitmapped Font Editors
      2. Outline Font Editors
      3. Outline Font Editors for Larger Fonts
      4. AFDKO—Adobe Font Development Kit for OpenType
      5. TTX/FontTools
      6. Font Format Conversion
    9. Gaiji Handling
      1. The Gaiji Problem
      2. SING—Smart INdependent Glyphlets
      3. Ideographic Variation Sequences
      4. XKP, A Gaiji Handling Initiative—Obsolete
      5. Adobe Type Composer (ATC)—Obsolete
      6. Composite Font Functionality Within Applications
      7. Gaiji Handling Techniques and Tricks
      8. Creating Your Own Rearranged Fonts
      9. Acquiring Gaiji Glyphs and Gaiji Fonts
    10. Advice to Developers
  10. Chapter 7: Typography
    1. Rules, Principles, and Techniques
      1. JIS X 4051:2004 Compliance
      2. GB/T 15834-1995 and GB/T 15835-1995
    2. Typographic Units and Measurements
      1. Two Important Points—Literally
      2. Other Typographic Units
    3. Horizontal and Vertical Layout
      1. Nonsquare Design Space
      2. The Character Grid
      3. Vertical Character Variants (1/2)
      4. Vertical Character Variants (2/2)
      5. Dedicated Vertical Characters
      6. Vertical Latin Text
    4. Line Breaking and Word Wrapping
    5. Character Spanning
    6. Alternate Metrics
      1. Half-Width Symbols and Punctuation
      2. Proportional Symbols and Punctuation
      3. Proportional Kana
      4. Proportional Ideographs
      5. Kerning
    7. Line-Length Issues
      1. Manipulating Symbol and Punctuation Metrics
      2. Manipulating Inter-Glyph Spacing
      3. JIS X 4051:2004 Character Classes
    8. Multilingual Typography
      1. Latin Baseline Adjustment
      2. Proper Spacing of Latin and CJKV Characters
      3. Mixing Latin and CJKV Typeface Designs
    9. Glyph Substitution
      1. Character and Glyph Variants
      2. Ligatures
    10. Annotations
      1. Ruby Glyphs
      2. Inline Notes—Warichu
      3. Other Annotations
    11. Typographic Applications
      1. Page-Layout Applications (1/2)
      2. Page-Layout Applications (2/2)
      3. Graphics Applications
    12. Advice to Developers
  11. Chapter 8: Output Methods
    1. Where Can Fonts Live?
    2. Output via Printing
      1. PostScript CJKV Printers
      2. Genuine PostScript
      3. Clone PostScript
      4. Passing Characters to PostScript
    3. Output via Display
      1. Adobe Type Manager—ATM
      2. SuperATM
      3. Adobe Acrobat and PDF
      4. Ghostscript
      5. OpenType and TrueType
    4. Other Printing Methods
    5. The Role of Printer Drivers
      1. Microsoft Windows Printer Drivers
      2. Mac OS X Printer Drivers
    6. Output Tips and Tricks
      1. Creating CJKV Documents for Non-CJKV Systems
    7. Advice to Developers
      1. CJKV-Capable Publishing Systems
      2. Some Practical Advice
  12. Chapter 9: Information Processing Techniques
    1. Language, Country, and Script Codes
    2. CLDR—Common Locale Data Repository
    3. Programming Languages
      1. C/C++
      2. Java
      3. Perl
      4. Python
      5. Ruby
      6. Tcl
      7. Other Programming Environments
    4. Code Conversion Algorithms
      1. Conversion Between UTF-8, UTF-16, and UTF-32
      2. Conversion Between ISO-2022 and EUC
      3. Conversion Between ISO-2022 and Row-Cell
      4. Conversion Between ISO-2022-JP and Shift-JIS
      5. Conversion Between EUC-JP and Shift-JIS
      6. Other Code Conversion Types
    5. Java Programming Examples
      1. Java Code Conversion
      2. Java Text Stream Handling
      3. Java Charset Designators
    6. Miscellaneous Algorithms
      1. Japanese Code Detection
      2. Half- to Full-Width Katakana Conversion—in Java
      3. Encoding Repair
    7. Byte Versus Character Handling
      1. Character Deletion
      2. Character Insertion
      3. Character Searching
      4. Line Breaking
      5. Character Attribute Detection Using C Macros
    8. Character Sorting
    9. Natural Language Processing
      1. Word Parsing and Morphological Analysis
      2. Spelling and Grammar Checking
      3. Chinese-Chinese Conversion
      4. Special Transliteration Considerations
    10. Regular Expressions
    11. Search Engines
    12. Code-Processing Tools
      1. JConv—Code Conversion Tool
      2. JChar—Character Set Generation Tool
      3. CJKV Character Set Server
      4. JCode—Text File Examination Tool
      5. Other Useful Tools and Resources
  13. Chapter 10: OSes, Text Editors, and Word Processors
    1. Viewing CJKV Text Using Non-CJKV OSes
      1. AsianSuite X2—Microsoft Windows
      2. NJStar CJK Viewer—Microsoft Windows
      3. TwinBridge Language Partner—Microsoft Windows
    2. Operating Systems
      1. FreeBSD
      2. Linux
      3. Mac OS X
      4. Microsoft Windows Vista
      5. MS-DOS
      6. Plan 9
      7. Solaris and OpenSolaris
      8. TRON and Chokanji
      9. Unix
    3. Hybrid Environments
      1. Boot Camp—Run Windows on Apple Hardware
      2. CrossOver Mac—Run Windows Applications on Mac OS X
      3. GNOME—Linux and Unix
      4. KDE—Linux and Unix
      5. VMware Fusion—Run Windows on Mac OS X
      6. Wine—Run Windows on Unix, Linux, and Other OSes
      7. X Window System—Unix
    4. Text Editors
      1. Mac OS X Text Editors
      2. Windows Text Editors
      3. Vietnamese Text Editing
      4. Emacs and GNU Emacs
      5. vi and Vim
    5. Word Processors
      1. AbiWord
      2. Haansoft Hangul—Microsoft Windows
      3. Ichitaro—Microsoft Windows
      4. KWord
      5. Microsoft Word—Microsoft Windows and Mac OS X
      6. Nisus Writer—Mac OS X
      7. NJStar Chinese/Japanese WP—Microsoft Windows
      8. Pages—Mac OS X
    6. Online Word Processors
      1. Adobe Buzzword
      2. Google Docs
    7. Advice to Developers
  14. Chapter 11: Dictionaries and Dictionary Software
    1. Ideograph Dictionary Indexes
      1. Reading Index
      2. Radical Index
      3. Stroke Count Index
      4. Other Indexes
    2. Ideograph Dictionaries
      1. Character Set Standards As Ideograph Dictionaries
      2. Locale-Specific Ideograph Dictionaries
      3. Vendor Ideograph Dictionaries and Ideograph Tables
      4. CJKV Ideograph Dictionaries
    3. Other Useful Dictionaries
      1. Conventional Dictionaries
      2. Variant Ideograph Dictionaries
    4. Dictionary Hardware
    5. Dictionary Software
      1. Dictionary CD-ROMs
      2. Frontend Software for Dictionary CD-ROMs
      3. Dictionary Files (1/2)
      4. Dictionary Files (2/2)
      5. Frontend Software for Dictionary Files
      6. Web-Based Dictionaries
    6. Machine Translation Applications
    7. Machine Translation Services
      1. Free Machine Translation Services
      2. Commercial Machine Translation Services
    8. Language-Learning Aids
  15. Chapter 12: Web and Print Publishing
    1. Line-Termination Concerns
    2. Email
      1. Sending Email
      2. Receiving Email
      3. Email Troubles and Tricks
      4. Email Clients
    3. Network Domains
      1. Internationalized Domain Names
      2. The CN Domain
      3. The HK Domain
      4. The JP Domain
      5. The KR Domain
      6. The TW Domain
      7. The VN Domain
    4. Content Versus Presentation
    5. Web Publishing
      1. Web Browsers
      2. Displaying Web Pages
    6. HTML—HyperText Markup Language
      1. Authoring HTML Documents (1/2)
      2. Authoring HTML Documents (2/2)
      3. Web-Authoring Tools
      4. Embedding CJKV Text As Graphics
    7. XML—Extensible Markup Language
      1. Authoring XML Documents
    8. CGI Programming Examples
    9. Print Publishing
      1. PDF—Portable Document Format
      2. Authoring PDF Documents
      3. PDF Eases Publishing Pains
    10. Where to Go Next?
  16. Appendix A: Code Conversion Tables
  17. Appendix B: Notation Conversion Table
  18. Appendix C: Perl Code Examples (1/4)
  19. Appendix C: Perl Code Examples (2/4)
  20. Appendix C: Perl Code Examples (3/4)
  21. Appendix C: Perl Code Examples (4/4)
  22. Appendix D: Glossary (1/8)
  23. Appendix D: Glossary (2/8)
  24. Appendix D: Glossary (3/8)
  25. Appendix D: Glossary (4/8)
  26. Appendix D: Glossary (5/8)
  27. Appendix D: Glossary (6/8)
  28. Appendix D: Glossary (7/8)
  29. Appendix D: Glossary (8/8)
  30. Appendix E: Vendor Character Set Standards
  31. Appendix F: Vendor Encoding Methods
  32. Appendix G: Chinese Character Sets—China
  33. Appendix H: Chinese Character Sets—Taiwan
  34. Appendix I: Chinese Character Sets—Hong Kong
  35. Appendix J: Japanese Character Sets
  36. Appendix K: Korean Character Sets
  37. Appendix L: Vietnamese Character Sets
  38. Appendix M: Miscellaneous Character Sets
  39. Bibliography (1/6)
  40. Bibliography (2/6)
  41. Bibliography (3/6)
  42. Bibliography (4/6)
  43. Bibliography (5/6)
  44. Bibliography (6/6)
  45. Index (1/6)
  46. Index (2/6)
  47. Index (3/6)
  48. Index (4/6)
  49. Index (5/6)
  50. Index (6/6)

Product information

  • Title: CJKV Information Processing, 2nd Edition
  • Author(s): Ken Lunde
  • Release date: December 2008
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9780596514471