Common Character Database of 2003

What is it?

This public domain database contains over 90000 characters covering the major languages of the world. It is intended to be compatible with the ISO/IEC 10646:2003 standard and most web browsers. The downloadable file common-character-database-of-2003-(created-2019-05-09).zip contains this web page and two versions of the database.

Database version: common-character-database-of-2003.tsv

This tab-separated value file, with MD5 checksum bbd4e5cc26d446e765639ed5295d1340, has the following fields:

glyph: visual representation
character_code: 0-10ffff hexadecimal
character_byte_encoding: a list of hexadecimal bytes delimited by ':' which represent the UTF-8 encoding
script_type: alphabet, abjad, abugida, syllabary, ideographs, symbols, internal
script_name: Aegean, Arabic, Armenian, Arrow, Bengali, Block, Bopomofo, Box_Segment, Braille, Buhid, Byzantine_Musical, Canadian_Aboriginal, Cherokee, Combining_Mark, Control, Coptic, Currency, Cypriot, Cyrillic, Deseret, Devanagari, Dingbat, Ethiopic, Georgian, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Kanbun, Kangxi, Kannada, Katakana, Khmer, Lao, Latin, Letterlike, Limbu, Linear_B, Malayalam, Mathematical, Miscellaneous, Mongolian, Musical, Myanmar, Number, Ogham, Old_Italic, Optical_Character_Recognition, Oriya, Osmanya, Phonetic, Punctuation, Runic, Shape, Shavian, Sinhala, Subscript, Superscript, Syriac, Tag, Tagalog, Tagbanwa, Tai_Le, Tai_Xuan_Jing, Tamil, Technical, Telugu, Thaana, Thai, Tibetan, Ugaritic, Variation_Selector, Yi, Yiyang_Hexagram
character_use: public, private, future, invalid-surrogate, invalid-non-character
character_class: letter, combining-mark, punctuation, number, symbol, invisible
titlecase_character_code: 0-10ffff hexadecimal
uppercase_character_code: list of 0-10ffff hexadecimal delimited by '+' (only character 00df maps to more than one character code)
lowercase_character_code: 0-10ffff hexadecimal
punctuation_subclass: newline, space, stop, exclamation, question, colon, semicolon, comma, dash, repetition, quotation, group, other
punctuation_group_open_character_code: 0-10ffff hexadecimal
punctuation_group_close_character_code: 0-10ffff hexadecimal
combining_mark_position: <see combining_mark_sequence_number labels below>
combining_mark_sequence_number:

5 = overlay, 6 = nukta, 7 = kana-voice, 8 = virama
12 = connected-below, 14 = connected-left, 16 = connected-right, 18 = connected-above
21 = below-left, 22 = below, 23 = below-right, 24 = left, 26 = right, 27 = above-left, 28 = above, 29 = above-right
32 = farther-below, 38 = farther-above
42 = farthest-below
50 = surround

numeric_value: decimal or fractional value of this character, e.g. 100 or 2/3 or -1/2
canonical_decomposition_character_codes: list of 0-10ffff hexadecimal delimited by '+'
description_of_invisible_character: short description of any character without a corresponding glyph in most font character sets

Database version: common-character-database-of-2003-without-embedded-glyphs.tsv

This file, with MD5 checksum 2a829abe4734d4687dbae4f9aa57ac29, is the same as the version above without the first field.

Notes and Exceptions

uppercase of character 0069 is character 0130 when writing Turkish
lowercase of character 0049 is character 0131 when writing Turkish
lowercase of character 0197 may sometimes be character 0268 (instead of the value in the database specified by ISO 6438)
characters 2018, 2019, 201b, 201c, 201d and 201f may be opening or closing quotation marks depending on country of use

Why, when, and how was it made?

In 2018 I was unable to locate a character database that covered the major writing systems of the world, was compatible with most web browsers and had no legal restrictions. Since the facts in books are not usually subject to copyright or other laws, a book seemed to be a good source of data for a new database. And in case the European Database Directive would apply, the book should be at least fifteen years old. So beginning in 2019 I created this database using a book published in 2003 (ISBN 0321185781). Part of the data was generated with custom programs while the rest was manually entered. None came from the CDROM included with the book.

How has it changed over time?

2019-05-09: Characters 0xfe20-3 and 0xfff9-d given a script name and type
2019-05-02: Characters 0x20 and 0xa0 reclassified as punctuation, 0x303f reclassified as a symbol and 0x3000 given a description
2019-03-05: Initial publication

Questions or Comments

Contact information

Thanks

Numerous scribes developed the writing systems of the world. I am grateful to them and to the various national, international and commercial groups that have organized and published this information, and to the authors of ISBN 0321185781 (not named to avoid using a trademark) for a very clear and helpful reference resource. Thanks most of all to my father in heaven, the creator of everything, for providing me the ability, resources and motivation to undertake this work.

Public Domain Dedication

I dedicate this version of the "Common Character Database of 2003", created May 9, 2019, to the public domain. --Scot Doyle