Common Character Database of 2003
What is it?
This public domain database contains over 90000 characters covering the
major languages of the world. It is intended to be compatible with the
standard and most web browsers. The downloadable file
contains this web page and two versions of the database.
Database version: common-character-database-of-2003.tsv
tab-separated value file, with MD5 checksum
bbd4e5cc26d446e765639ed5295d1340, has the following fields:
- glyph: visual representation
- character_code: 0-10ffff hexadecimal
- character_byte_encoding: a list of hexadecimal bytes delimited by ':' which represent the UTF-8 encoding
- script_type: alphabet, abjad, abugida, syllabary, ideographs, symbols, internal
- script_name: Aegean, Arabic, Armenian, Arrow, Bengali, Block, Bopomofo, Box_Segment, Braille, Buhid, Byzantine_Musical, Canadian_Aboriginal, Cherokee, Combining_Mark, Control, Coptic, Currency, Cypriot, Cyrillic, Deseret, Devanagari, Dingbat, Ethiopic, Georgian, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Kanbun, Kangxi, Kannada, Katakana, Khmer, Lao, Latin, Letterlike, Limbu, Linear_B, Malayalam, Mathematical, Miscellaneous, Mongolian, Musical, Myanmar, Number, Ogham, Old_Italic, Optical_Character_Recognition, Oriya, Osmanya, Phonetic, Punctuation, Runic, Shape, Shavian, Sinhala, Subscript, Superscript, Syriac, Tag, Tagalog, Tagbanwa, Tai_Le, Tai_Xuan_Jing, Tamil, Technical, Telugu, Thaana, Thai, Tibetan, Ugaritic, Variation_Selector, Yi, Yiyang_Hexagram
- character_use: public, private, future, invalid-surrogate, invalid-non-character
- character_class: letter, combining-mark, punctuation, number, symbol, invisible
- titlecase_character_code: 0-10ffff hexadecimal
- uppercase_character_code: list of 0-10ffff hexadecimal delimited by '+' (only character 00df maps to more than one character code)
- lowercase_character_code: 0-10ffff hexadecimal
- punctuation_subclass: newline, space, stop, exclamation, question, colon, semicolon, comma, dash, repetition, quotation, group, other
- punctuation_group_open_character_code: 0-10ffff hexadecimal
- punctuation_group_close_character_code: 0-10ffff hexadecimal
- combining_mark_position: <see combining_mark_sequence_number labels below>
- 5 = overlay, 6 = nukta, 7 = kana-voice, 8 = virama
- 12 = connected-below, 14 = connected-left, 16 = connected-right, 18 = connected-above
- 21 = below-left, 22 = below, 23 = below-right, 24 = left, 26 = right, 27 = above-left, 28 = above, 29 = above-right
- 32 = farther-below, 38 = farther-above
- 42 = farthest-below
- 50 = surround
- numeric_value: decimal or fractional value of this character, e.g. 100 or 2/3 or -1/2
- canonical_decomposition_character_codes: list of 0-10ffff hexadecimal delimited by '+'
- description_of_invisible_character: short description of any character without a corresponding glyph in most font character sets
Database version: common-character-database-of-2003-without-embedded-glyphs.tsv
This file, with MD5 checksum 2a829abe4734d4687dbae4f9aa57ac29, is
the same as the version above without the first field.
Notes and Exceptions
- uppercase of character 0069 is character 0130 when writing Turkish
- lowercase of character 0049 is character 0131 when writing Turkish
- lowercase of character 0197 may sometimes be character 0268 (instead of the value in the database specified by ISO 6438)
- characters 2018, 2019, 201b, 201c, 201d and 201f may be opening or closing quotation marks depending on country of use
Why, when, and how was it made?
In 2018 I was unable to locate a character database that covered the major
writing systems of the world, was compatible with most web browsers and had
no legal restrictions. Since the facts in books are not usually subject to
copyright or other laws, a book seemed to be a good source of data for a new
database. And in case the European Database Directive would apply, the book
should be at least fifteen years old. So beginning in 2019 I created this
database using a book published in 2003 (ISBN 0321185781). Part of the data
was generated with custom programs while the rest was manually entered. None
came from the CDROM included with the book.
How has it changed over time?
- 2019-05-09: Characters 0xfe20-3 and 0xfff9-d given a script name and type
- 2019-05-02: Characters 0x20 and 0xa0 reclassified as punctuation, 0x303f reclassified as a symbol and 0x3000 given a description
- 2019-03-05: Initial publication
Questions or Comments
Numerous scribes developed the writing systems of the world. I am grateful
to them and to the various national, international and commercial groups that
have organized and published this information, and to the authors of ISBN
0321185781 (not named to avoid using a trademark) for a very clear and helpful
reference resource. Thanks most of all to my father in heaven, the creator of
everything, for providing me the ability, resources and motivation to
undertake this work.
Public Domain Dedication
I dedicate this version of the "Common Character Database of 2003", created
May 9, 2019, to the public domain. --Scot Doyle