Common Character Database of 2003
What is it?
This public domain database contains over 90000 characters
covering the major languages of the world.
It is intended to be compatible with the ISO 10646:2003 standard
and most web browsers.
The downloadable file
contains this web page and two versions of the database.
Database version: common-character-database-of-2003.tsv
This tab-separated value file
contains one character entry per line with these fields and values.
The MD5 checksum is 8c972bf302e95f0f46f598071765d0d2.
- character: visual representation
- character_code: 0-10ffff hexadecimal
- character_byte_encoding: a list of hexadecimal bytes delimited by ':' which represent the UTF-8 encoding
- script_type: alphabet, abjad, abugida, syllabary, ideographs, symbols, internal
- script_name: Aegean, Arabic, Armenian, Arrow, Bengali, Block, Bopomofo, Box_Segment, Braille, Buhid, Byzantine_Musical, Canadian_Aboriginal, Cherokee, Combining_Mark, Control, Coptic, Currency, Cypriot, Cyrillic, Deseret, Devanagari, Dingbat, Ethiopic, Georgian, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Kanbun, Kangxi, Kannada, Katakana, Khmer, Lao, Latin, Letterlike, Limbu, Linear_B, Malayalam, Mathematical, Miscellaneous, Mongolian, Musical, Myanmar, Number, Ogham, Old_Italic, Optical_Character_Recognition, Oriya, Osmanya, Phonetic, Punctuation, Runic, Shape, Shavian, Sinhala, Subscript, Superscript, Syriac, Tag, Tagalog, Tagbanwa, Tai_Le, Tai_Xuan_Jing, Tamil, Technical, Telugu, Thaana, Thai, Tibetan, Ugaritic, Variation_Selector, Yi, Yiyang_Hexagram
- character_use: public, private, future, invalid-surrogate, invalid-non-character
- character_class: letter, combining-mark, punctuation, number, symbol, invisible
- titlecase_character_code: 0-10ffff hexadecimal
- uppercase_character_code: list of 0-10ffff hexadecimal delimited by '+' (only character 00df maps to more than one character code)
- lowercase_character_code: 0-10ffff hexadecimal
- punctuation_subclass: newline, space, stop, exclamation, question, colon, semicolon, comma, dash, repetition, quotation, group, other
- punctuation_group_open_character_code: 0-10ffff hexadecimal
- punctuation_group_close_character_code: 0-10ffff hexadecimal
- combining_mark_position: <see combining_mark_sequence_number labels below>
- 5 = overlay, 6 = nukta, 7 = kana-voice, 8 = virama
- 12 = connected-below, 14 = connected-left, 16 = connected-right, 18 = connected-above
- 21 = below-left, 22 = below, 23 = below-right, 24 = left, 26 = right, 27 = above-left, 28 = above, 29 = above-right
- 32 = farther-below, 38 = farther-above
- 42 = farthest-below
- 50 = surround
- numeric_value: decimal or fractional value of this character, e.g. 100 or 2/3 or -1/2
- canonical_decomposition_character_codes: list of 0-10ffff hexadecimal delimited by '+'
- description_of_invisible_character: short description of any character without a corresponding glyph in most font character sets
Database version: common-character-database-of-2003-without-embedded-characters.tsv
This file is the same as the version above withouth the first field.
The MD5 checksum for this version is 2584b64867bbbc5fbc969f796073c061.
Notes and Exceptions
- uppercase of character 0069 is character 0130 when writing Turkish
- lowercase of character 0049 is character 0131 when writing Turkish
- lowercase of character 0197 may sometimes be character 0268 (instead of the value in the database specified by ISO 6438)
- characters 2018, 2019, 201b, 201c, 201d and 201f may be opening or closing quotation marks depending on country of use
Why, when, and how was it made?
In 2018 I was unable to locate a character database for use in another project
that covered the major writing systems of the world,
was compatible with most web browsers and had no legal restrictions.
Since the facts in books are not usually subject to copyright or other laws, a book seemed to be a good
source of data for a new database. And in case the European Database Directive would apply, the book should be at
least fifteen years old. So beginning in 2019 I
created this database from scratch using a book published in 2003 (ISBN 0321185781).
Part of the data was generated with custom programs while the rest was hand entered.
None came from the CDROM included with the book.
Questions or Comments
Numerous scribes developed the writing systems of the world. I am grateful to them and to the various national,
international and commercial groups that have organized and published this information, and
to the authors of ISBN 0321185781 (not named to avoid using a trademark) for a very clear and helpful reference resource.
Thanks most of all to my father in heaven, the creator of everything,
for providing me the ability, resources and motivation to undertake this work.
Public Domain Dedication
I dedicate this version of the "Common Character Database of 2003" to the public domain on March 5, 2019. --Scot Doyle