The project libUnihan provides a normalized SQLite Unihan database and corresponding C library. All tables in this database are in fifth normal form.

The database and its corresponding database benefits in many areas, such as Chinese character (Hanzi) standard query, variant character research, and input method development. The Hanzi can be searched by its unicode (decimal and hexadecimal), pronunciation, radical-strokes index, major standard, and so on.

Indeed, there are many similar projects that convert Unihan.txt to SQLite format; some even claimed themselves as "normalized". To test whether the database is normalized, query on the tag kSemanticVariant of character U+5275 (創), if it returns:

U+5205<kMatthews U+5231<kMeyerWempe,kHanYu U+6227<kMatthews

then it is not normalized as it violates the "no-multiple values in one cell" requirement of 1NF.

Announce (2009/01/13)

Since we nearly rewrite the whole libUnihan, we hereby make name the upcoming release as 1.0.0. The major changes are:
  1. Database scheme changes: The database is reformed to become more consist. Fields with same name has same semantic meaning, and carry sensible information when performing union, intersection and exception. The values of same field are also in same format, including number padding, cases and tone format. Thus, following changes are made:
    • Some Unihan tags such as kXHC1983 and dictionary references are further split to sub-fields.
    • Romanized pronunciation (e.g. mandarin) are always stored as lowercase.
    • Major standards are now stored as integer, display as hexdecimal.
  2. API changes:
    • Functions in Unihan.h deals not only built-in DBs, but custom DB as well.
    • String formatted combine functions. These functions like printf() in C, they combine a list of string arguments into a new string according to the directives from format string. So far they can do: conditional substitution, case changing, padding, sub string, and counter support. Moreover, the directives are nestable, which enables further output control. They can also combine with regex functions for making versatile parsers.
    API of Older version (0.5.3) is kept here for referencing.

Note: For original Unihan compatibility, the output of Unihan original fields (the fields that appear in Unicode's Unihan.txt are identical as in Unihan.txt.

Features

The project has two parts, one is the Unihan character database and another is the C library that produces and operates the database.

Database

C library/API

Frequent asked questions

  1. Why do you develop libUnihan?
    To address U+8AAA (說) and U+8AAC(説) bug. The information provides by Unihan.txt is sufficient for resolution, however, a C API for SQLized Unihan character database is more convenient for C developers.
  2. What the database schema for libUnihan?
    See the table and field descriptions for details.
  3. Can libUnihan tell whether the which region/country does the character belongs to?
    Not exactly, because usually a character appears in many regions. For example, U+4E94 (五) can appears in China, Japan, Korea and Vietnam. However, there are three functions which provide region sensitive information:
    1. unihanChar_is_in_source() : tests whether the character is in the given Ideographic Rapporteur Group (IRG) source
    2. unihanChar_is_in_sources(): A convenient wrapper of 1), but returns the first matched IRG source.
    3. unihanChar_is_common_in_locale() : A convenient wrapper of 2) tells whether the character is frequently appeared in the locale.
    Normally, 3) is preferable as if avoid the confusion like U+8AAA(說)/U+8AAC(説) and U+52FB(勻)/U+5300(匀), unless input of rare characters is needed.
  4. What's the licenses of libUnihan?
    The libUnihan itself is released under LGPLv2, while its database, UnihanDb, is released under MIT.
  5. How many characters are covered by libUnihan?
    libUnihan is based on Unihan character database, it has every characters that Unihan has. The code point range covered by current Unihan (5.1.0):
    • U+3400..U+4DB5: CJK Unified Ideographs Extension A
    • U+4E00..U+9FA5: CJK Unified Ideographs
    • U+9FA6..U+9FBB: CJK Unified Ideographs (4.1)
    • U+9FBC..U+9FC3: CJK Unified Ideographs (5.1)
    • U+F900..U+FA2D: CJK Compatibility Ideographs (a)
    • U+FA30..U+FA6A: CJK Compatibility Ideographs (b)
    • U+FA70..U+FAD9: CJK Compatibility Ideographs (4.1)
    • U+20000..U+2A6D6: CJK Unified Ideographs Extension B
    • U+2F800..U+2FA1D: CJK Compatibility Supplement
    Totally 71234 characters.
  6. Can we use Python or other language to access the database?
    Yes, the libUnihan database is based on SQLite, so it can be accessed by any languages that provide SQLite binding. See the table and field descriptions for details. Nevertheless, the libUnihan provide various C functions for convenient.

News

Version 0.5.3 Released: 2008-10-20

This release fixes the no API documents, also correct some functions in collection.[ch], file_functions.[ch] for prepareation of libUnihan 0.6

Version 0.5.2 Released: 2008-10-06

This release provides further support of ZhuYin and PinYin, such as ZhuYin pseudo field, and new unihan_query options: -Z, -z, -P, -p.

Now unihan_query is not only capable of showing the result fields, but also showing given fields with -Oflags. Thus it will be more convenient for result checking, especially for SQL like queries.

Test suite is now introduced into libUnihan. Many bugs have been found with it. :-)

Version 0.5.1 Released: 2008-09-23

Version 0.4.1 Released: 2008-08-08

Version 0.3.1 Released: 2008-07-04

Add kMandarin frequency rank support.

Version 0.3.0 Released: 2008-07-01

Initial public release