| * Copyright (C) 2016 and later: Unicode, Inc. and others. |
| * License & terms of use: http://www.unicode.org/copyright.html |
| * Copyright (C) 2004-2016, International Business Machines |
| * Corporation and others. All Rights Reserved. |
| * |
| * file name: changes.txt |
| * encoding: US-ASCII |
| * tab size: 8 (not used) |
| * indentation:4 |
| * |
| * created on: 2004may06 |
| * created by: Markus W. Scherer |
| * |
| * change log for Unicode updates |
| * |
| * For each new Unicode version, during the beta period, |
| * I copy the change log for the previous version to the top of this file. |
| * I adjust the versions, tickets, URLs, and paths. |
| * I work my way through the steps listed in the log, top to bottom, |
| * adjusting the log as necessary. |
| * I report problems to the UTC and/or CLDR and/or ICU. |
| * Before the data is final, I "turn the crank" several more times, |
| * using appropriate subsets of the steps. |
| |
| ---------------------------------------------------------------------------- *** |
| |
| * New ISO 15924 script codes |
| |
| Starting with ICU 55, we do not add UScriptCode constants for new scripts any more |
| until they are encoded in Unicode, |
| or can be assumed to be encoded in the next Unicode version. |
| Script enum constant names want to follow the Unicode script property value aliases, |
| which are assigned only when the scripts are encoded. |
| When we encode scripts early and guess wrong, then we have confusing enum constants |
| and have sometimes added aliases. |
| |
| Variant script codes like Latf and Aran that are not subject to separate encoding |
| can be added at any time. |
| (For example, Aran could be added as USCRIPT_ARABIC_NASTALIQ.) |
| |
| We add script codes used in CLDR or in the spoof checker. |
| This includes combination/alias codes like Hanb and Jamo. |
| See http://unicode.org/reports/tr35/#unicode_script_subtag_validity |
| and look for "alias" on http://unicode.org/iso15924/iso15924-codes.html |
| |
| We add special Z* script codes like Zsye. |
| |
| For new script codes see http://www.unicode.org/iso15924/codechanges.html |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 13.0 update for ICU 66 |
| |
| https://www.unicode.org/versions/Unicode13.0.0/ |
| https://www.unicode.org/versions/beta-13.0.0.html |
| https://www.unicode.org/Public/13.0.0/ucd/ |
| https://www.unicode.org/reports/uax-proposed-updates.html |
| https://www.unicode.org/reports/tr44/tr44-25.html |
| |
| https://unicode-org.atlassian.net/browse/CLDR-13387 |
| https://unicode-org.atlassian.net/browse/ICU-20893 |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni13/20200212 |
| CLDR_SRC=~/cldr/uni/src |
| ICU_ROOT=~/icu/uni |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt66b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| cd $ICU_ROOT/dbg/icu4c |
| ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh |
| |
| *** data files & enums & parser code |
| |
| * download files |
| - mkdir -p $UNICODE_DATA |
| - download Unicode files into $UNICODE_DATA |
| + subfolders: emoji, idna, security, ucd, uca |
| + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| + split Unihan into single-property files |
| ~/unitools/trunk/src$ py/splitunihan.py $UNICODE_DATA/ucd/Unihan |
| + get GraphemeBreakTest-cldr.txt from $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt |
| or from the ucd/cldr/ output folder of the Unicode Tools: |
| Since Unicode 12/CLDR 35/ICU 64 CLDR uses modified break rules. |
| cp $CLDR_SRC/common/properties/segments/GraphemeBreakTest.txt icu4c/source/test/testdata |
| |
| * for manual diffs and for Unicode Tools input data updates: |
| remove version suffixes from the file names |
| ~$ unidata/desuffixucd.py $UNICODE_DATA |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA |
| |
| * new constants for new property values |
| - preparseucd.py error: |
| ValueError: missing uchar.h enum constants for some property values: |
| [(u'blk', set([u'Symbols_For_Legacy_Computing', u'Dives_Akuru', u'Yezidi', |
| u'Tangut_Sup', u'CJK_Ext_G', u'Khitan_Small_Script', u'Chorasmian', u'Lisu_Sup'])), |
| (u'sc', set([u'Chrs', u'Diak', u'Kits', u'Yezi'])), |
| (u'InPC', set([u'Top_And_Bottom_And_Left']))] |
| = PropertyValueAliases.txt new property values (diff old & new .txt files) |
| blk; Chorasmian ; Chorasmian |
| blk; CJK_Ext_G ; CJK_Unified_Ideographs_Extension_G |
| blk; Dives_Akuru ; Dives_Akuru |
| blk; Khitan_Small_Script ; Khitan_Small_Script |
| blk; Lisu_Sup ; Lisu_Supplement |
| blk; Symbols_For_Legacy_Computing ; Symbols_For_Legacy_Computing |
| blk; Tangut_Sup ; Tangut_Supplement |
| blk; Yezidi ; Yezidi |
| -> add to uchar.h before UBLOCK_COUNT |
| use long property names for enum constants, |
| for the trailing comment get the block start code point: diff old & new Blocks.txt |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| |
| sc ; Chrs ; Chorasmian |
| sc ; Diak ; Dives_Akuru |
| sc ; Kits ; Khitan_Small_Script |
| sc ; Yezi ; Yezidi |
| -> uscript.h & com.ibm.icu.lang.UScript |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| InPC; Top_And_Bottom_And_Left ; Top_And_Bottom_And_Left |
| -> uchar.h enum UIndicPositionalCategory & UCharacter.java IndicPositionalCategory |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt |
| |
| * build ICU (make install) |
| to make sure that there are no syntax errors, and |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * update spoof checker UnicodeSet initializers: |
| inclusionPat & recommendedPat in i18n/uspoof.cpp |
| INCLUSION & RECOMMENDED in SpoofChecker.java |
| - make sure that the Unicode Tools tree contains the latest security data files |
| - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator |
| - update the hardcoded version number there in the DIRECTORY path |
| - run the tool (no special environment variables needed) |
| - copy & paste from the Console output into the .cpp & .java files |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg/icu4c |
| bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) |
| |
| $ICU_ROOT/dbg$ |
| mkdir -p tools/unicode/c |
| cd tools/unicode/c |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| - tool failure: |
| genprops: Script_Extensions indexes overflow bit field |
| genprops: error parsing or setting values from ppucd.txt line 32696 - U_BUFFER_OVERFLOW_ERROR |
| -> uprops.icu data file format : |
| add two more bits to store a script code or Script_Extensions index |
| -> generator code, C++ & Java runtime, uprops.icu format version 7.7 |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..13.0: U+2260, U+226E, U+226F |
| - nothing new in this Unicode version, no test file to update |
| |
| * run & fix ICU4C tests |
| - fix Unicode Tools class Segmenter to generate correct *BreakTest.txt files |
| - Andy helps with RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| diff the main mapping file, look for bad changes |
| (for example, more bytes per weight for common characters) |
| ~/svn.unitools/trunk$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/13.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-13.0.txt |
| ~/svn.unitools/trunk$ meld ../frac-12.1.txt ../frac-13.0.txt |
| |
| - CLDR root data files are checked into $CLDR_SRC/common/uca/ |
| cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ |
| |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt |
| |
| - run genuca |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c |
| - rebuild ICU4C |
| |
| * Unihan collators |
| https://sites.google.com/site/unicodetools/unihan |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -ea |
| -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools |
| -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src |
| -DUVERSION=13.0.0 |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd $CLDR_SRC |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd $CLDR_SRC |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - run CLDR unit tests, commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /usr/local/google/home/mscherer/cldr/uni/src/common/collation |
| -m /usr/local/google/home/mscherer/cldr/uni/src/common/supplemental |
| -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll |
| -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation |
| zh |
| and VM arguments |
| -ea |
| -DCLDR_DIR=/usr/local/google/home/mscherer/cldr/uni/src |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt66b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt66l.dat ./out/icu4j/icudt66b.dat -s ./out/build/icudt66l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt66b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt66b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt66b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt66b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt66b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data |
| or |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC/icu4c/source/data/unidata |
| cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** CLDR numbering systems |
| - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR |
| for example, look for |
| ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt |
| in new blocks (Blocks.txt) |
| Unicode 13: |
| diak 11950..11959 Dives_Akuru |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools are checked in: |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 12.1 update for ICU 64.2 |
| |
| ** This is an abbreviated update with one new character for the new |
| ** Japanese era expected to start on 2019-May-01: U+32FF SQUARE ERA NAME REIWA |
| https://en.wikipedia.org/wiki/Reiwa_period |
| |
| http://www.unicode.org/versions/Unicode12.1.0/ |
| |
| ICU-20497 Unicode 12.1 |
| |
| cldrbug 11978: Unicode 12.1 |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni121/20190403 |
| CLDR_SRC=~/svn.cldr/uni |
| ICU_ROOT=~/icu/uni |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt64b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| cd $ICU_ROOT/dbg/icu4c |
| ICU_DATA_BUILDTOOL_OPTS=--include_uni_core_data ../../../doconfig-clang-dbg.sh |
| |
| *** data files & enums & parser code |
| |
| * download files |
| - mkdir -p $UNICODE_DATA |
| - download Unicode files into $UNICODE_DATA |
| + subfolders: emoji, idna, security, ucd, uca |
| + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| |
| * for manual diffs and for Unicode Tools input data updates: |
| remove version suffixes from the file names |
| ~$ unidata/desuffixucd.py $UNICODE_DATA |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| - cp -v $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * update spoof checker UnicodeSet initializers: |
| inclusionPat & recommendedPat in uspoof.cpp |
| INCLUSION & RECOMMENDED in SpoofChecker.java |
| - make sure that the Unicode Tools tree contains the latest security data files |
| - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator |
| - update the hardcoded version number there in the DIRECTORY path |
| - run the tool (no special environment variables needed) |
| - copy & paste from the Console output into the .cpp & .java files |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg/icu4c |
| bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) |
| |
| $ICU_ROOT/dbg$ |
| mkdir -p tools/unicode/c |
| cd tools/unicode/c |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..12.1: U+2260, U+226E, U+226F |
| - nothing new in this Unicode version, no test file to update |
| |
| * run & fix ICU4C tests |
| - Andy handles RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| diff the main mapping file, look for bad changes |
| (for example, more bytes per weight for common characters) |
| ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.1.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.1.txt |
| ~/svn.unitools/trunk$ meld ../frac-12.txt ../frac-12.1.txt |
| |
| - CLDR root data files are checked into $CLDR_SRC/common/uca/ |
| cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ |
| |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp -v $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp -v $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp -v $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp -v $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp -v $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt |
| |
| - run genuca, see command line above |
| - rebuild ICU4C |
| |
| * Unihan collators |
| https://sites.google.com/site/unicodetools/unihan |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -ea |
| -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools |
| -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| -DUVERSION=12.1.0 |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd $CLDR_SRC |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd $CLDR_SRC |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - run CLDR unit tests, commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation |
| -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental |
| -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll |
| -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation |
| zh |
| and VM arguments |
| -ea |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| make[1]: Entering directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt64b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt64l.dat ./out/icu4j/icudt64b.dat -s ./out/build/icudt64l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt64b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt64b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt64b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt64b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt64b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data |
| or |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC/icu4c/source/data/unidata |
| cp -v confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp -v BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp -v $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** CLDR numbering systems |
| - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR |
| for example, look for |
| ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt |
| in new blocks (Blocks.txt) |
| Unicode 12: using Unicode 12 CLDR ticket #11478 |
| hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong |
| wcho 1E2F0..1E2F9 Wancho |
| Unicode 11: using Unicode 11 CLDR ticket #10978 |
| rohg 10D30..10D39 Hanifi_Rohingya |
| gong 11DA0..11DA9 Gunjala_Gondi |
| Earlier: CLDR tickets specific to adding new numbering systems. |
| Unicode 10: http://unicode.org/cldr/trac/ticket/10219 |
| Unicode 9: http://unicode.org/cldr/trac/ticket/9692 |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools are checked in: |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 12.0 update for ICU 64 |
| |
| http://www.unicode.org/versions/Unicode12.0.0/ |
| http://unicode.org/versions/beta-12.0.0.html |
| https://www.unicode.org/review/pri389/ |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/reports/tr44/tr44-23.html |
| |
| ICU-20203 Unicode 12 |
| |
| ICU-20111 move text layout properties data into a data file |
| |
| cldrbug 11478: Unicode 12 |
| Accidentally used ^/trunk instead of ^/branches/markus/uni12 |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni12/20190309 |
| CLDR_SRC=~/svn.cldr/uni |
| ICU_ROOT=~/icu/uni |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt63b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * download files |
| - mkdir -p $UNICODE_DATA |
| - download Unicode files into $UNICODE_DATA |
| + subfolders: emoji, idna, security, ucd, uca |
| + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| |
| * for manual diffs and for Unicode Tools input data updates: |
| remove version suffixes from the file names |
| ~$ unidata/desuffixucd.py $UNICODE_DATA |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; date; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * new constants for new property values |
| - preparseucd.py error: |
| ValueError: missing uchar.h enum constants for some property values: |
| [(u'blk', set([u'Symbols_And_Pictographs_Ext_A', u'Elymaic', |
| u'Ottoman_Siyaq_Numbers', u'Nandinagari', u'Nyiakeng_Puachue_Hmong', |
| u'Small_Kana_Ext', u'Egyptian_Hieroglyph_Format_Controls', u'Wancho', u'Tamil_Sup'])), |
| (u'sc', set([u'Nand', u'Wcho', u'Elym', u'Hmnp']))] |
| = PropertyValueAliases.txt new property values (diff old & new .txt files) |
| blk; Egyptian_Hieroglyph_Format_Controls; Egyptian_Hieroglyph_Format_Controls |
| blk; Elymaic ; Elymaic |
| blk; Nandinagari ; Nandinagari |
| blk; Nyiakeng_Puachue_Hmong ; Nyiakeng_Puachue_Hmong |
| blk; Ottoman_Siyaq_Numbers ; Ottoman_Siyaq_Numbers |
| blk; Small_Kana_Ext ; Small_Kana_Extension |
| blk; Symbols_And_Pictographs_Ext_A ; Symbols_And_Pictographs_Extended_A |
| blk; Tamil_Sup ; Tamil_Supplement |
| blk; Wancho ; Wancho |
| -> add to uchar.h |
| use long property names for enum constants, |
| for the trailing comment get the block start code point: diff old & new Blocks.txt |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \3 |
| |
| sc ; Elym ; Elymaic |
| sc ; Hmnp ; Nyiakeng_Puachue_Hmong |
| sc ; Nand ; Nandinagari |
| sc ; Wcho ; Wancho |
| -> uscript.h & com.ibm.icu.lang.UScript |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt |
| |
| * update spoof checker UnicodeSet initializers: |
| inclusionPat & recommendedPat in uspoof.cpp |
| INCLUSION & RECOMMENDED in SpoofChecker.java |
| - make sure that the Unicode Tools tree contains the latest security data files |
| - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator |
| - update the hardcoded version number there in the DIRECTORY path |
| - run the tool (no special environment variables needed) |
| - copy & paste from the Console output into the .cpp & .java files |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg/icu4c |
| bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install &> out.txt ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/uni/src/icu4c) |
| |
| $ICU_ROOT/dbg$ |
| mkdir -p tools/unicode/c |
| cd tools/unicode/c |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| genuca/genuca --hanOrder implicit $ICU_SRC/icu4c && \ |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..12.0: U+2260, U+226E, U+226F |
| - nothing new in this Unicode version, no test file to update |
| |
| * run & fix ICU4C tests |
| - update test of default bidi classes: |
| Bidi range \U0001ED00-\U0001ED4F changes default from R to AL, |
| see diffs in DerivedBidiClass.txt |
| + /tsutil/cucdtst/TestUnicodeData enumDefaultsRange() defaultBidi[] |
| + UCharacterTest.java TestIteration() defaultBidi[] |
| - Andy handles RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| diff the main mapping file, look for bad changes |
| (for example, more bytes per weight for common characters) |
| ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/UCA/12.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-12.txt |
| ~/svn.unitools/trunk$ meld ../frac-11.txt ../frac-12.txt |
| |
| - CLDR root data files are checked into $CLDR_SRC/common/uca/ |
| cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ |
| |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt |
| |
| - run genuca, see command line above; |
| deal with |
| Error: Unknown script for first-primary sample character U+119CE on line 29233 of /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: |
| FDD1 119CE; [71 CD 02, 05, 05] # Nandinagari first primary (compressible) |
| (add the character to genuca.cpp sampleCharsToScripts[]) |
| + This time, I added code to genuca.cpp to use uscript_getSampleUnicodeString(script) |
| and cache its values. |
| Works as long as the script metadata is updated before the collation data. |
| - rebuild ICU4C |
| |
| * Unihan collators |
| https://sites.google.com/site/unicodetools/unihan |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -ea |
| -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools |
| -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| -DUVERSION=12.0.0 |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd $CLDR_SRC |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd $CLDR_SRC |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - run CLDR unit tests, commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation |
| -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental |
| -d /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/coll |
| -p /usr/local/google/home/mscherer/icu/uni/src/icu4c/source/data/xml/collation |
| zh |
| and VM arguments |
| -ea |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt63l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt63b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt63l.dat ./out/icu4j/icudt63b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt63l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt63b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt63b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt63b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt63b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt63b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory '/usr/local/google/home/mscherer/icu/uni/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp -v com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp -v com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp -v com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp -v com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data |
| or |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC/icu4c/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** CLDR numbering systems |
| - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR |
| for example, look for |
| ~/icu/uni/src$ egrep ';gc=Nd.+;nv=4' icu4c/source/data/unidata/ppucd.txt |
| in new blocks (Blocks.txt) |
| Unicode 12: using Unicode 12 CLDR ticket #11478 |
| hmnp 1E140..1E149 Nyiakeng_Puachue_Hmong |
| wcho 1E2F0..1E2F9 Wancho |
| Unicode 11: using Unicode 11 CLDR ticket #10978 |
| rohg 10D30..10D39 Hanifi_Rohingya |
| gong 11DA0..11DA9 Gunjala_Gondi |
| Earlier: CLDR tickets specific to adding new numbering systems. |
| Unicode 10: http://unicode.org/cldr/trac/ticket/10219 |
| Unicode 9: http://unicode.org/cldr/trac/ticket/9692 |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools are checked in: |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| |
| ---------------------------------------------------------------------------- *** |
| |
| ICU 63 addition of ICU support of text layout properties InPC, InSC, vo |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni11/20180609 |
| CLDR_SRC=~/svn.cldr/uni |
| ICU_ROOT=~/icu/mine |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt62b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** Links |
| |
| https://unicode-org.atlassian.net/browse/ICU-8966 InPC & InSC |
| https://unicode-org.atlassian.net/browse/ICU-12850 vo |
| |
| *** data files & enums & parser code |
| |
| * API additions |
| - for each of the three new enumerated properties |
| + uchar.h: add the enum UProperty constant UCHAR_<long prop name> |
| + uchar.h: update UCHAR_INT_LIMIT |
| + uchar.h: add the enum U<long prop name> |
| with constants U_<short prop name>_<long value name> |
| + UProperty.java: add the constant <long prop name> |
| + UProperty.java: update INT_LIMIT |
| + UCharacter.java: add the interface <long prop name> |
| with constants <long value name> |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + It also writes tools/unicode/c/genprops/pnames_data.h with property and value |
| names and aliases. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| * preparseucd.py changes |
| - add new property short names (uppercase) to _prop_and_value_re |
| so that ParseUCharHeader() parses the new enum constants |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/icu/mine/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/icu/mine/src/icu4c) |
| |
| $ICU_ROOT/dbg$ |
| mkdir -p tools/unicode/c |
| cd tools/unicode/c |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| - rebuild ICU (make install) & tools |
| |
| * write data for runtime, hardcoded for now |
| - add genprops/layoutpropsbuilder.cpp with pieces from sibling files |
| - generate new icu4c/source/common/ulayout_props_data.h |
| - for each of the three new enumerated properties |
| + int property max value |
| + small, 8-bit UCPTrie |
| (A small 16-bit trie with bit fields for these three properties |
| is very nearly the same size as the sum of the three.) |
| |
| * wire into C++ |
| - uprops.cpp: #include ulayout_props_data.h |
| - uprops.cpp: add getInPC() etc. functions |
| - uprops.cpp: add lines to intProps[], include max values |
| - uprops.h: add UPropertySource constants |
| - uprops.cpp: add uprops_addPropertyStarts(src) |
| - uniset_props.cpp: add to UnicodeSet_initInclusion() |
| - intltest/ucdtest.cpp: write unit tests |
| |
| * update Java data files |
| - refresh just the pnames.icu file with the new property [value] names, just to be safe |
| - see $ICU_SRC/icu4c/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| cp com/ibm/icu/impl/data/$ICUDT/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * wire into Java |
| - UCharacterProperty.java: add new SRC_INPC etc. constants as in C++ |
| - UCharacterProperty.java: for each new property |
| + create a nested class to hold its CodePointTrie |
| + initialize it from a string literal |
| + paste in the initializer printed by genprops |
| + add a new IntProperty object to the intProps[] array |
| + use the correct max int value for each property, also printed by genprops |
| - UCharacterProperty.java: add ulayout_addPropertyStarts(src, set) |
| - UnicodeSet.java: add to getInclusions() |
| - UCharacterTest.java: write unit tests |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 11.0 update for ICU 62 |
| |
| http://www.unicode.org/versions/Unicode11.0.0/ |
| http://unicode.org/versions/beta-11.0.0.html |
| https://www.unicode.org/review/pri372/ |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/reports/tr44/tr44-21.html |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni11/20180521 |
| CLDR_SRC=~/svn.cldr/uni |
| ICU_ROOT=~/svn.icu/uni |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt61b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** ICU Trac |
| |
| - ticket:13630: Unicode 11 |
| - ^/branches/markus/uni11 |
| |
| *** CLDR Trac |
| |
| - cldrbug 10978: Unicode 11 |
| - ^/branches/markus/uni11 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * download files |
| - mkdir -p $UNICODE_DATA |
| - download Unicode files into $UNICODE_DATA |
| + subfolders: emoji, idna, security, ucd, uca |
| + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| |
| * for manual diffs and for Unicode Tools input data updates: |
| remove version suffixes from the file names |
| ~$ unidata/desuffixucd.py $UNICODE_DATA |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * preparseucd.py changes |
| - fix other errors |
| NameError: unknown property Extended_Pictographic |
| -> add Extended_Pictographic binary property |
| -> add new short names for all Emoji properties |
| |
| * new constants for new property values |
| - preparseucd.py error: |
| ValueError: missing uchar.h enum constants for some property values: |
| [(u'blk', set([u'Georgian_Ext', u'Hanifi_Rohingya', u'Medefaidrin', u'Sogdian', u'Makasar', |
| u'Old_Sogdian', u'Dogra', u'Gunjala_Gondi', u'Chess_Symbols', u'Mayan_Numerals', |
| u'Indic_Siyaq_Numbers'])), |
| (u'jg', set([u'Hanifi_Rohingya_Kinna_Ya', u'Hanifi_Rohingya_Pa'])), |
| (u'sc', set([u'Medf', u'Sogd', u'Dogr', u'Rohg', u'Maka', u'Sogo', u'Gong'])), |
| (u'GCB', set([u'LinkC', u'Virama'])), |
| (u'WB', set([u'WSegSpace']))] |
| = PropertyValueAliases.txt new property values (diff old & new .txt files) |
| blk; Chess_Symbols ; Chess_Symbols |
| blk; Dogra ; Dogra |
| blk; Georgian_Ext ; Georgian_Extended |
| blk; Gunjala_Gondi ; Gunjala_Gondi |
| blk; Hanifi_Rohingya ; Hanifi_Rohingya |
| blk; Indic_Siyaq_Numbers ; Indic_Siyaq_Numbers |
| blk; Makasar ; Makasar |
| blk; Mayan_Numerals ; Mayan_Numerals |
| blk; Medefaidrin ; Medefaidrin |
| blk; Old_Sogdian ; Old_Sogdian |
| blk; Sogdian ; Sogdian |
| -> add to uchar.h |
| use long property names for enum constants, |
| for the trailing comment get the block start code point: diff old & new Blocks.txt |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| |
| GCB; LinkC ; LinkingConsonant |
| GCB; Virama ; Virama |
| -> uchar.h & UCharacter.GraphemeClusterBreak |
| -> these two later removed again: http://www.unicode.org/L2/L2018/18115.htm#155-A76 |
| |
| InSC; Consonant_Initial_Postfixed ; Consonant_Initial_Postfixed |
| -> ignore: ICU does not yet support this property |
| |
| jg ; Hanifi_Rohingya_Kinna_Ya ; Hanifi_Rohingya_Kinna_Ya |
| jg ; Hanifi_Rohingya_Pa ; Hanifi_Rohingya_Pa |
| -> uchar.h & UCharacter.JoiningGroup |
| |
| sc ; Dogr ; Dogra |
| sc ; Gong ; Gunjala_Gondi |
| sc ; Maka ; Makasar |
| sc ; Medf ; Medefaidrin |
| sc ; Rohg ; Hanifi_Rohingya |
| sc ; Sogd ; Sogdian |
| sc ; Sogo ; Old_Sogdian |
| -> uscript.h & com.ibm.icu.lang.UScript |
| -> Nushu had been added already |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| WB ; WSegSpace ; WSegSpace |
| -> uchar.h & UCharacter.WordBreak |
| |
| * New short names for emoji properties |
| - see UTS #51 |
| - short names set in preparseucd.py |
| |
| * New properties |
| - boolean emoji property Extended_Pictographic |
| -> added in preparseucd.py |
| -> uchar.h & UProperty.java |
| - misc. property Equivalent_Unified_Ideograph (EqUIdeo) |
| as shown in PropertyValueAliases.txt |
| -> ignore for now |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt |
| |
| * update spoof checker UnicodeSet initializers: |
| inclusionPat & recommendedPat in uspoof.cpp |
| INCLUSION & RECOMMENDED in SpoofChecker.java |
| - make sure that the Unicode Tools tree contains the latest security data files |
| - go to Unicode Tools org.unicode.text.tools.RecommendedSetGenerator |
| - update the hardcoded version number there in the DIRECTORY path |
| - run the tool (no special environment variables needed) |
| - copy & paste from the Console output into the .cpp & .java files |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg/icu4c |
| bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c) |
| |
| $ICU_ROOT/dbg$ |
| mkdir -p tools/unicode/c |
| cd tools/unicode/c |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| genuca/genuca --hanOrder implicit $ICU_SRC/icu4c |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c |
| - rebuild ICU (make install) & tools |
| |
| * Fix case props |
| genprops error: casepropsbuilder: too many exceptions words |
| genprops error: failure finalizing the data - U_BUFFER_OVERFLOW_ERROR |
| - With the addition of Georgian Mtavruli capital letters, |
| there are now too many simple case mappings with big mapping deltas |
| that yield uncompressible exceptions. |
| - Changing the data structure (now formatVersion 4), |
| adding one bit for no-simple-case-folding (for Cherokee), and |
| one optional slot for a big delta (for most faraway mappings), |
| together with another bit for whether that is negative. |
| This makes most Cherokee & Georgian etc. case mappings compressible, |
| reducing the number of exceptions words. |
| - Further changes to gain one more bit for the exceptions index, |
| for future growth. Details see casepropsbuilder.cpp. |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..11.0: U+2260, U+226E, U+226F |
| - nothing new in this Unicode version, no test file to update |
| |
| * run & fix ICU4C tests |
| - Andy handles RBBI & spoof check test failures |
| |
| - Errors in char.txt, word.txt, word_POSIX.txt like |
| createRuleBasedBreakIterator: ICU Error "U_BRK_RULE_EMPTY_SET" at line 46, column 16 |
| because \p{Grapheme_Cluster_Break = EBG} and \p{Word_Break = EBG} are empty. |
| -> Temporary(!) workaround: Add an arbitrary code point to these sets to make them |
| not empty, just to get ICU building. |
| -> Intermediate workaround: Remove $E_Base_GAZ and other now-unused variables |
| and properties together with the rules that used them (GB 10, WB 14). |
| -> Andy adjusts the rule sets further to sync with |
| Unicode 11 grapheme, word, and line break spec changes. |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| diff the main mapping file, look for bad changes |
| (for example, more bytes per weight for common characters) |
| ~/svn.unitools/trunk$ sed -r -f ~/svn.cldr/uni/tools/scripts/uca/blankweights.sed ../Generated/uca/11.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-11.txt |
| ~/svn.unitools/trunk$ meld ../frac-10.txt ../frac-11.txt |
| |
| - CLDR root data files are checked into $CLDR_SRC/common/uca/ |
| cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ |
| |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt |
| |
| - run genuca, see command line above; |
| deal with |
| Error: Unknown script for first-primary sample character U+1180B on line 28649 of /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/unidata/FractionalUCA.txt: |
| FDD1 1180B; [71 CC 02, 05, 05] # Dogra first primary (compressible) |
| (add the character to genuca.cpp sampleCharsToScripts[]) |
| + look up the USCRIPT_ code for the new sample characters |
| (should be obvious from the comment in the error output) |
| + *add* mappings to sampleCharsToScripts[], do not replace them |
| (in case the script sample characters flip-flop) |
| + insert new scripts in DUCET script order, see the top_byte table |
| at the beginning of FractionalUCA.txt |
| - rebuild ICU4C |
| |
| * Unihan collators |
| https://sites.google.com/site/unicodetools/unihan |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -ea |
| -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools |
| -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| -DUVERSION=11.0.0 |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd $CLDR_SRC |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd $CLDR_SRC |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - run CLDR unit tests, commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /usr/local/google/home/mscherer/svn.cldr/uni/common/collation |
| -m /usr/local/google/home/mscherer/svn.cldr/uni/common/supplemental |
| -d /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/coll |
| -p /usr/local/google/home/mscherer/svn.icu/uni/src/icu4c/source/data/xml/collation |
| zh |
| and VM arguments |
| -ea |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt61l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt61b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt61l.dat ./out/icu4j/icudt61b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt61l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt61b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt61b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt61b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt61b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt61b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory '/usr/local/google/home/mscherer/svn.icu/uni/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data |
| or |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC/icu4c/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt IdnaTestV2.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** CLDR numbering systems |
| - look for new sets of decimal digits (gc=ND & nv=4) and add to CLDR |
| Unicode 11: using Unicode 11 CLDR ticket #10978 |
| rohg 10D30..10D39 Hanifi_Rohingya |
| gong 11DA0..11DA9 Gunjala_Gondi |
| Earlier: CLDR tickets specific to adding new numbering systems. |
| Unicode 10: http://unicode.org/cldr/trac/ticket/10219 |
| Unicode 9: http://unicode.org/cldr/trac/ticket/9692 |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools are checked in: |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 10.0 update for ICU 60 |
| |
| http://www.unicode.org/versions/Unicode10.0.0/ |
| http://www.unicode.org/versions/beta-10.0.0.html |
| http://blog.unicode.org/2017/03/unicode-100-beta-review.html |
| http://www.unicode.org/review/pri350/ |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/reports/tr44/tr44-19.html |
| |
| * Command-line environment setup |
| |
| UNICODE_DATA=~/unidata/uni10/20170605 |
| CLDR_SRC=~/svn.cldr/uni10 |
| ICU_ROOT=~/svn.icu/uni10 |
| ICU_SRC=$ICU_ROOT/src |
| ICUDT=icudt60b |
| ICU4C_DATA_IN=$ICU_SRC/icu4c/source/data/in |
| ICU4C_UNIDATA=$ICU_SRC/icu4c/source/data/unidata |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/icu4c/lib |
| |
| *** ICU Trac |
| |
| - ticket:12985: Unicode 10 |
| - ticket:13061: undo hacks from emoji 5.0 update |
| - ticket:13062: add Emoji_Component property |
| - ^/branches/markus/uni10 |
| |
| *** CLDR Trac |
| |
| - cldrbug 10055: Unicode 10 |
| - cldrbug 9882: Unicode 10 script metadata |
| - cldrbug 10219: numbering systems for Unicode 10 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * download files |
| - mkdir -p $UNICODE_DATA |
| - download Unicode 10.0 files into $UNICODE_DATA |
| + subfolders: ucd, uca, idna, security |
| + inside ucd: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| - download emoji 5.0 files into $UNICODE_DATA/emoji |
| |
| * for manual diffs: remove version suffixes from the file names |
| ~$ unidata/desuffixucd.py $UNICODE_DATA |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| |
| * process and/or copy files |
| - $ICU_SRC/tools/unicode$ py/preparseucd.py $UNICODE_DATA $ICU_SRC |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| + For debugging, and tweaking how ppucd.txt is written, |
| the tool has an --only_ppucd option: |
| py/preparseucd.py $UNICODE_DATA --only_ppucd path/to/ppucd/outputfile |
| |
| - cp $UNICODE_DATA/security/confusables.txt $ICU4C_UNIDATA |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * preparseucd.py changes |
| - remove or add new Unicode scripts from/to the |
| only-in-ISO-15924 list according to the error messages: |
| ValueError: remove ['Nshu'] from _scripts_only_in_iso15924 |
| -> adjust _scripts_only_in_iso15924 as indicated |
| - fix other errors |
| Exception: no default values (@missing lines) for some Catalog or Enumerated properties: [u'vo'] |
| -> add vo=Vertical_Orientation to _ignored_properties |
| -> later removed again, parsing the file, even though we do not yet store data for runtime use |
| |
| * new constants for new property values |
| - preparseucd.py error: |
| ValueError: missing uchar.h enum constants for some property values: |
| [(u'blk', set([u'Zanabazar_Square', u'Nushu', u'CJK_Ext_F', |
| u'Kana_Ext_A', u'Syriac_Sup', u'Masaram_Gondi', u'Soyombo'])), |
| (u'jg', set([u'Malayalam_Bha', u'Malayalam_Llla', u'Malayalam_Nya', u'Malayalam_Lla', |
| u'Malayalam_Nga', u'Malayalam_Ssa', u'Malayalam_Tta', u'Malayalam_Ra', |
| u'Malayalam_Nna', u'Malayalam_Ja', u'Malayalam_Nnna'])), |
| (u'sc', set([u'Soyo', u'Gonm', u'Zanb']))] |
| = PropertyValueAliases.txt new property values (diff old & new .txt files) |
| blk; CJK_Ext_F ; CJK_Unified_Ideographs_Extension_F |
| blk; Kana_Ext_A ; Kana_Extended_A |
| blk; Masaram_Gondi ; Masaram_Gondi |
| blk; Nushu ; Nushu |
| blk; Soyombo ; Soyombo |
| blk; Syriac_Sup ; Syriac_Supplement |
| blk; Zanabazar_Square ; Zanabazar_Square |
| -> add to uchar.h |
| use long property names for enum constants, |
| for the trailing comment get the block start code point: diff old & new Blocks.txt |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| |
| jg ; Malayalam_Bha ; Malayalam_Bha |
| jg ; Malayalam_Ja ; Malayalam_Ja |
| jg ; Malayalam_Lla ; Malayalam_Lla |
| jg ; Malayalam_Llla ; Malayalam_Llla |
| jg ; Malayalam_Nga ; Malayalam_Nga |
| jg ; Malayalam_Nna ; Malayalam_Nna |
| jg ; Malayalam_Nnna ; Malayalam_Nnna |
| jg ; Malayalam_Nya ; Malayalam_Nya |
| jg ; Malayalam_Ra ; Malayalam_Ra |
| jg ; Malayalam_Ssa ; Malayalam_Ssa |
| jg ; Malayalam_Tta ; Malayalam_Tta |
| -> uchar.h & UCharacter.JoiningGroup |
| |
| sc ; Gonm ; Masaram_Gondi |
| sc ; Nshu ; Nushu |
| sc ; Soyo ; Soyombo |
| sc ; Zanb ; Zanabazar_Square |
| -> uscript.h & com.ibm.icu.lang.UScript |
| -> Nushu had been added already |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| * New properties as shown in PropertyValueAliases.txt changes |
| - boolean Emoji_Component from emoji 5 |
| -> uchar.h & UProperty.java |
| - boolean |
| # Regional_Indicator (RI) |
| |
| RI ; N ; No ; F ; False |
| RI ; Y ; Yes ; T ; True |
| -> uchar.h & UProperty.java |
| -> single immutable range, to be hardcoded |
| - boolean |
| # Prepended_Concatenation_Mark (PCM) |
| |
| PCM; N ; No ; F ; False |
| PCM; Y ; Yes ; T ; True |
| -> was new in Unicode 9 |
| -> uchar.h & UProperty.java |
| - enumerated |
| # Vertical_Orientation (vo) |
| |
| vo ; R ; Rotated |
| vo ; Tr ; Transformed_Rotated |
| vo ; Tu ; Transformed_Upright |
| vo ; U ; Upright |
| -> only pre-parsed for now, but not yet stored for runtime use |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| $ICU_SRC/tools/unicode$ py/parsescriptmetadata.py $ICU_SRC/icu4c/source/common/unicode/uscript.h $CLDR_SRC/common/properties/scriptMetadata.txt |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg/icu4c |
| bin/gennorm2 -o $ICU_SRC/icu4c/source/common/norm2_nfc_data.h -s $ICU4C_UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/nfkc_cf.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $ICU4C_DATA_IN/uts46.nrm -s $ICU4C_UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| $ICU_SRC/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c) |
| |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| $ICU_ROOT/dbg/tools/unicode/c$ |
| genprops/genprops $ICU_SRC/icu4c |
| genuca/genuca --hanOrder implicit $ICU_SRC/icu4c |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC/icu4c |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..10.0: U+2260, U+226E, U+226F |
| - nothing new in this Unicode version, no test file to update |
| |
| * run & fix ICU4C tests |
| - Andy handles RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| - CLDR root data files are checked into $CLDR_SRC/common/uca/ |
| cp (Unicode Tools UCA generated)/CollationAuxiliary/* $CLDR_SRC/common/uca/ |
| |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp $CLDR_SRC/common/uca/FractionalUCA_SHORT.txt $ICU4C_UNIDATA/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp $ICU4C_UNIDATA/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp $CLDR_SRC/common/uca/UCA_Rules_SHORT.txt $ICU4C_UNIDATA/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU4C_UNIDATA/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp $CLDR_SRC/common/uca/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC/icu4c/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC/icu4c/source/test/testdata/CollationTest_*.txt $ICU_SRC/icu4j/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC/icu4c/source/data/coll/root.txt |
| |
| - run genuca, see command line above; |
| deal with |
| Error: Unknown script for first-primary sample character U+11D10 on line 28117 of /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/unidata/FractionalUCA.txt: |
| FDD1 11D10; [70 D5 02, 05, 05] # Masaram_Gondi first primary (compressible) |
| (add the character to genuca.cpp sampleCharsToScripts[]) |
| + look up the USCRIPT_ code for the new sample characters |
| (should be obvious from the comment in the error output) |
| + *add* mappings to sampleCharsToScripts[], do not replace them |
| (in case the script sample characters flip-flop) |
| + insert new scripts in DUCET script order, see the top_byte table |
| at the beginning of FractionalUCA.txt |
| - rebuild ICU4C |
| |
| * Unihan collators |
| https://sites.google.com/site/unicodetools/unihan |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -ea |
| -DSVN_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/usr/local/google/home/mscherer/svn.unitools |
| -DUCD_DIR=/usr/local/google/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 |
| -DUVERSION=10.0.0 |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd $CLDR_SRC |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd $CLDR_SRC |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - run CLDR unit tests, commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /usr/local/google/home/mscherer/svn.cldr/uni10/common/collation |
| -m /usr/local/google/home/mscherer/svn.cldr/uni10/common/supplemental |
| -d /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/coll |
| -p /usr/local/google/home/mscherer/svn.icu/uni10/src/icu4c/source/data/xml/collation |
| zh |
| and VM arguments |
| -ea |
| -DCLDR_DIR=/usr/local/google/home/mscherer/svn.cldr/uni10 |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt60l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt60b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt60l.dat ./out/icu4j/icudt60b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt60l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt60b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt60b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt60b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt60b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt60b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/uni10/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd $ICU_ROOT/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf $ICU_SRC/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar $ICU_SRC/icu4j/main/shared/data |
| or |
| - $ICU_ROOT/dbg/icu4c$ make ICU4J_ROOT=$ICU_SRC/icu4j icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC/icu4c/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt IdnaTest.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp $UNICODE_DATA/ucd/CompositionExclusions.txt $ICU_SRC/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** CLDR numbering systems |
| - look for new sets of decimal digits (gc=ND & nv=4) and submit a CLDR ticket |
| Unicode 10: http://unicode.org/cldr/trac/ticket/10219 |
| Unicode 9: http://unicode.org/cldr/trac/ticket/9692 |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools are checked in: |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Emoji 5.0 update for ICU 59 |
| - ICU 59 mostly remains on Unicode 9.0 |
| - except updates bidi and segmentation data to Unicode 10 beta |
| |
| First run of tools on combined icu4c/icu4j/tools trunk after svn repository reorg. |
| |
| * Command-line environment setup |
| |
| ICU_ROOT=~/svn.icu/trunk |
| ICU_SRC_DIR=$ICU_ROOT/src |
| ICU4C_SRC_DIR=$ICU_SRC_DIR/icu4c |
| ICUDT=icudt59b |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| SRC_DATA_IN=$ICU4C_SRC_DIR/source/data/in |
| UNIDATA=$ICU4C_SRC_DIR/source/data/unidata |
| |
| *** ICU Trac |
| |
| - ticket:12900: take Emoji 5.0 properties data into ICU 59 once it's released |
| - changes directly on trunk |
| |
| *** data files & enums & parser code |
| |
| * download files |
| |
| - download Unicode 9.0 files into a uni90e50 folder: ucd, idna, security (skip uca) |
| - download emoji 5.0 beta files into the same uni90e50 folder |
| - download Unicode 10.0 beta files: ucd |
| + copy Unicode 10 bidi files to the uni90e50/ucd folder: |
| BidiBrackets.txt |
| BidiCharacterTest.txt |
| BidiMirroring.txt |
| BidiTest.txt |
| extracted/DerivedBidiClass.txt |
| + copy Unicode 10 segmentation files to the uni90e50/ucd folder: |
| LineBreak.txt |
| auxiliary/* |
| |
| * preparseucd.py changes |
| - adjust for combined trunks |
| - write new copyright lines |
| - ignore new Emoji_Component property for now |
| |
| * process and/or copy files |
| - ~/svn.icu/trunk/src/tools/unicode$ py/preparseucd.py ~/unidata/uni90e50/20170322 $ICU_SRC_DIR |
| + This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| |
| - cp ~/unidata/uni90e50/20170322/security/confusables.txt $UNIDATA |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg/icu4c$ echo;echo; make -j7 install > out.txt 2>&1 ; tail -n 30 out.txt ; date |
| |
| * build Unicode tools using CMake+make |
| |
| ~/svn.icu/trunk/src/tools/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /usr/local/google/home/mscherer/svn.icu/trunk/inst/icu4c) |
| # Location of the ICU4C source tree. |
| set(ICU4C_SRC_DIR /usr/local/google/home/mscherer/svn.icu/trunk/src/icu4c) |
| |
| ~/svn.icu/trunk/dbg/tools/unicode/c$ |
| cmake ../../../../src/tools/unicode/c |
| make |
| |
| * generate core properties data files |
| ~/svn.icu/trunk/dbg/tools/unicode/c$ |
| genprops/genprops $ICU4C_SRC_DIR |
| - rebuild ICU (make install) & tools |
| |
| * run & fix ICU4C tests |
| - Andy handles RBBI & spoof check test failures |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt59l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt59b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt59l.dat ./out/icu4j/icudt59b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt59l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt59b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt59b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt59b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt59b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt59b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/usr/local/google/home/mscherer/svn.icu/trunk/dbg/icu4c/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd ~/svn.icu/trunk/dbg/icu4c/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf ~/svn.icu/trunk/src/icu4j/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu/trunk/src/icu4j/main/shared/data |
| or |
| - ~/svn.icu/trunk/dbg/icu4c$ make ICU4J_ROOT=~/svn.icu/trunk/src/icu4j icu4j-data-install |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU4C_SRC_DIR/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp ~/unidata/uni90e50/20170322/ucd/CompositionExclusions.txt ~/svn.icu/trunk/src/icu4j/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 9.0 update for ICU 58 |
| |
| * Command-line environment setup |
| |
| ICU_ROOT=~/svn.icu/trunk |
| ICU_SRC_DIR=$ICU_ROOT/src |
| ICUDT=icudt58b |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
| UNIDATA=$ICU_SRC_DIR/source/data/unidata |
| |
| http://www.unicode.org/review/pri323/ -- beta review |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/versions/beta-9.0.0.html |
| http://www.unicode.org/versions/Unicode9.0.0/ |
| http://www.unicode.org/reports/tr44/tr44-17.html |
| |
| *** ICU Trac |
| |
| - ticket:12526: integrate Unicode 9 |
| - C++ ^/icu/branches/markus/uni90, ^/icu/branches/markus/uni90b |
| - Java ^/icu4j/branches/markus/uni90, ^/icu4j/branches/markus/uni90b |
| |
| *** CLDR Trac |
| |
| - cldrbug 9414: UCA 9 |
| - ^/branches/markus/uni90 at r11518 from trunk at r11517 |
| |
| - cldrbug 8745: Unicode 9.0 script metadata |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| - download UCD & IDNA files |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
| - only for manual diffs: remove version suffixes from the file names |
| ~/unidata/uni70/20140403$ ../../desuffixucd.py . |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni90/20160603 $ICU_SRC_DIR ~/svn.icutools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| |
| - also: from http://unicode.org/Public/security/9.0.0/ download new confusables.txt |
| and copy to $UNIDATA |
| cp ~/unidata/uni90/20160603/security/confusables.txt $UNIDATA |
| |
| * preparseucd.py changes |
| - remove or add new Unicode scripts from/to the |
| only-in-ISO-15924 list according to the error messages: |
| ValueError: remove ['Tang'] from _scripts_only_in_iso15924 |
| ValueError: sc = Hanb (uchar.h USCRIPT_HAN_WITH_BOPOMOFO) not in the UCD |
| ValueError: sc = Jamo (uchar.h USCRIPT_JAMO) not in the UCD |
| ValueError: sc = Zsye (uchar.h USCRIPT_SYMBOLS_EMOJI) not in the UCD |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - DerivedNumericValues.txt new numeric values |
| 0D58 ; 0.00625 ; ; 1/160 # No MALAYALAM FRACTION ONE ONE-HUNDRED-AND-SIXTIETH |
| 0D59 ; 0.025 ; ; 1/40 # No MALAYALAM FRACTION ONE FORTIETH |
| 0D5A ; 0.0375 ; ; 3/80 # No MALAYALAM FRACTION THREE EIGHTIETHS |
| 0D5B ; 0.05 ; ; 1/20 # No MALAYALAM FRACTION ONE TWENTIETH |
| 0D5D ; 0.15 ; ; 3/20 # No MALAYALAM FRACTION THREE TWENTIETHS |
| -> change uprops.h, corepropsbuilder.cpp/encodeNumericValue(), |
| uchar.c, UCharacterProperty.java |
| to support a new series of values |
| - adjust preparseucd.py for Tangut algorithmic names |
| in ppucd.txt: |
| algnamesrange;17000..187EC;han;CJK UNIFIED IDEOGRAPH- |
| -> |
| algnamesrange;17000..187EC;han;TANGUT IDEOGRAPH- |
| - avoid block-compressing most String/Miscellaneous property values, |
| triggered by genprops not coping with a multi-code point Case_Folding on |
| block;1C80..1C8F;...;Cased;cf=0442;CWCF;... |
| keep block-compressing empty-string mappings NFKC_CF="" for tags and variation selectors |
| |
| * PropertyAliases.txt changes |
| - 1 new property PCM=Prepended_Concatenation_Mark |
| Ignore: Only useful for layout engines. |
| Ok to list in ppucd.txt. |
| |
| * PropertyValueAliases.txt new property values |
| blk; Adlam ; Adlam |
| blk; Bhaiksuki ; Bhaiksuki |
| blk; Cyrillic_Ext_C ; Cyrillic_Extended_C |
| blk; Glagolitic_Sup ; Glagolitic_Supplement |
| blk; Ideographic_Symbols ; Ideographic_Symbols_And_Punctuation |
| blk; Marchen ; Marchen |
| blk; Mongolian_Sup ; Mongolian_Supplement |
| blk; Newa ; Newa |
| blk; Osage ; Osage |
| blk; Tangut ; Tangut |
| blk; Tangut_Components ; Tangut_Components |
| -> add to uchar.h |
| use long property names for enum constants |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| |
| GCB; EB ; E_Base |
| GCB; EBG ; E_Base_GAZ |
| GCB; EM ; E_Modifier |
| GCB; GAZ ; Glue_After_Zwj |
| GCB; ZWJ ; ZWJ |
| -> uchar.h & UCharacter.GraphemeClusterBreak |
| |
| jg ; African_Feh ; African_Feh |
| jg ; African_Noon ; African_Noon |
| jg ; African_Qaf ; African_Qaf |
| -> uchar.h & UCharacter.JoiningGroup |
| |
| lb ; EB ; E_Base |
| lb ; EM ; E_Modifier |
| lb ; ZWJ ; ZWJ |
| -> uchar.h & UCharacter.LineBreak |
| |
| sc ; Adlm ; Adlam |
| sc ; Bhks ; Bhaiksuki |
| sc ; Marc ; Marchen |
| sc ; Newa ; Newa |
| sc ; Osge ; Osage |
| sc ; Tang ; Tangut |
| -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript |
| |
| WB ; EB ; E_Base |
| WB ; EBG ; E_Base_GAZ |
| WB ; EM ; E_Modifier |
| WB ; GAZ ; Glue_After_Zwj |
| WB ; ZWJ ; ZWJ |
| -> uchar.h & UCharacter.WordBreak |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg |
| bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 30 out.txt |
| |
| * build Unicode tools using CMake+make |
| |
| ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) |
| # Location of the ICU source tree. |
| set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) |
| |
| ~/svn.icutools/trunk/dbg/unicode/c$ |
| cmake ../../../src/unicode/c |
| make |
| |
| * generate core properties data files |
| ~/svn.icutools/trunk/dbg/unicode/c$ |
| genprops/genprops $ICU_SRC_DIR |
| genuca/genuca --hanOrder implicit $ICU_SRC_DIR |
| genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..9.0: U+2260, U+226E, U+226F |
| - nothing new in 9.0, no test file to update |
| |
| * run & fix ICU4C tests |
| - Andy handles RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| - CLDR root data files are checked into (CLDR UCA branch)/common/uca/ |
| cp (UCA generated)/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ |
| |
| - cd (CLDR UCA branch)/common/uca/ |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt |
| |
| - run genuca, see command line above; |
| deal with |
| Error: Unknown script for first-primary sample character U+104B5 on line 32599 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt: |
| FDD1 104B5; [75 B8 02, 05, 05] # Osage first primary (compressible) |
| (add the character to genuca.cpp sampleCharsToScripts[]) |
| + look up the USCRIPT_ code for the new sample characters |
| (should be obvious from the comment in the error output) |
| + *add* mappings to sampleCharsToScripts[], do not replace them |
| (in case the script sample characters flip-flop) |
| + insert new scripts in DUCET script order, see the top_byte table |
| at the beginning of FractionalUCA.txt |
| - rebuild ICU4C |
| |
| * Unihan collators |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollators |
| with VM arguments |
| -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk |
| -DOTHER_WORKSPACE=/home/mscherer/svn.unitools |
| -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data |
| -DCLDR_DIR=/home/mscherer/svn.cldr/trunk |
| -DUVERSION=9.0.0 |
| -ea |
| - run Unicode Tools |
| org.unicode.draft.GenerateUnihanCollatorFiles |
| with the same arguments |
| - check CLDR diffs |
| cd ~/svn.cldr/trunk |
| meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml |
| meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml |
| - copy to CLDR |
| cd ~/svn.cldr/trunk |
| cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml |
| cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml |
| - commit to CLDR |
| - generate ICU zh collation data: run CLDR |
| org.unicode.cldr.icu.NewLdml2IcuConverter |
| with program arguments |
| -t collation |
| -s /home/mscherer/svn.cldr/trunk/common/collation |
| -m /home/mscherer/svn.cldr/trunk/common/supplemental |
| -d /home/mscherer/svn.icu/trunk/src/source/data/coll |
| -p /home/mscherer/svn.icu/trunk/src/source/data/xml/collation |
| zh |
| and VM arguments |
| -DCLDR_DIR=/home/mscherer/svn.cldr/trunk |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt58l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt58b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt58l.dat ./out/icu4j/icudt58b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt58l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt58b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt58b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt58b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt58b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt58b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd ~/svn.icu/trunk/dbg/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uvf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC_DIR/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp ~/unidata/uni90/20160603/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** LayoutEngine script information |
| |
| * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
| This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
| in the working directory. |
| |
| (It also generates ScriptRunData.cpp, which is no longer needed.) |
| |
| It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages |
| (a plain text file) |
| which maps ICU versions to the numbers of script/language constants |
| that were added then. |
| (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) |
| |
| The generated files have a current copyright date and "@deprecated" statement. |
| |
| * Review changes, fix Java tool if necessary, and copy to ICU4C |
| cd ~/svn.icu4j/trunk/src |
| meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools & ICU tools are checked in |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| http://bugs.icu-project.org/trac/log/tools/trunk |
| |
| ---------------------------------------------------------------------------- *** |
| |
| New script codes early in ICU 58: http://bugs.icu-project.org/trac/ticket/11764 |
| |
| Adding |
| - new scripts in Unicode 9: Adlm, Bhks, Marc, Newa, Osge |
| - new combination/alias codes: Hanb, Jamo |
| - used in CLDR 29 and in spoof checker |
| - new Z* code: Zsye |
| |
| Add new codes to uscript.h & UScript.java, see Unicode update logs. |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2; \3 |
| |
| Manually edit ppucd.txt and icutools:unicode/c/genprops/pnames_data.h, |
| add new script codes. |
| "Long" script names only where established in Unicode 9 PropertyValueAliases.txt. |
| |
| Note: If we have to run preparseucd.py again before the Unicode 9 update, |
| then we need to manually keep/restore the new script codes. |
| |
| ICU_ROOT=~/svn.icu/trunk |
| ICU_SRC_DIR=$ICU_ROOT/src |
| ICUDT=icudt57b |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
| UNIDATA=$ICU_SRC_DIR/source/data/unidata |
| |
| Adjust unicode/c/genprops/*builder.cpp for #ifndef/#ifdef changes in _data.h files, |
| see http://bugs.icu-project.org/trac/ticket/12141 |
| |
| make install, then icutools cmake & make, then |
| ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR |
| |
| Generate Java data as usual, only update pnames.icu & uprops.icu. |
| |
| *** LayoutEngine script information |
| |
| * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
| This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
| in the working directory. |
| |
| (It also generates ScriptRunData.cpp, which is no longer needed.) |
| |
| It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages |
| (a plain text file) |
| which maps ICU versions to the numbers of script/language constants |
| that were added then. |
| (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) |
| |
| The generated files have a current copyright date and "@deprecated" statement. |
| |
| * Review changes, fix Java tool if necessary, and copy to ICU4C |
| cd ~/svn.icu4j/trunk/src |
| meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Emoji properties added in ICU 57: http://bugs.icu-project.org/trac/ticket/11802 |
| |
| Edit preparseucd.py to add & parse new properties. |
| They share the UCD property namespace but are not listed in PropertyAliases.txt. |
| |
| Add emoji-data.txt to the input files, from http://www.unicode.org/Public/emoji/ |
| Initial data from emoji/2.0/ |
| |
| ICU_ROOT=~/svn.icu/trunk |
| ICU_SRC_DIR=$ICU_ROOT/src |
| ICUDT=icudt56b |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
| UNIDATA=$ICU_SRC_DIR/source/data/unidata |
| |
| Add binary-property constants to uchar.h enum UProperty & UProperty.java. |
| |
| ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20151217 $ICU_SRC_DIR ~/svn.icutools/trunk/src |
| (Needs to be run after uchar.h additions, so that the new properties can be picked up by genprops.) |
| |
| Data structure: uprops.h/.cpp, corepropsbuilder.cpp, UCharacterProperty.java |
| |
| make install, then icutools cmake & make, then |
| ~/svn.icutools/trunk/dbg/unicode/c$ make && genprops/genprops $ICU_SRC_DIR |
| |
| Generate Java data as usual, only update pnames.icu & uprops.icu. |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 8.0 update for ICU 56 |
| |
| * Command-line environment setup |
| |
| ICU_ROOT=~/svn.icu/trunk |
| ICU_SRC_DIR=$ICU_ROOT/src |
| ICUDT=icudt56b |
| export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
| UNIDATA=$ICU_SRC_DIR/source/data/unidata |
| |
| http://www.unicode.org/review/pri297/ -- beta review |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://unicode.org/versions/beta-8.0.0.html |
| http://www.unicode.org/versions/Unicode8.0.0/ |
| http://www.unicode.org/reports/tr44/tr44-15.html |
| |
| *** ICU Trac |
| |
| - ticket:11574: Unicode 8 |
| - C++ branches/markus/uni80 at r37351 from trunk at r37343 |
| - Java branches/markus/uni80 at r37352 from trunk at r37338 |
| |
| *** CLDR Trac |
| |
| - cldrbug 8311: UCA 8 |
| - branches/markus/uni80 at r11518 from trunk at r11517 |
| |
| - cldrbug 8109: Unicode 8.0 script metadata |
| - cldrbug 8418: Updated segmentation for Unicode 8.0 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| - download UCD & IDNA files |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
| - only for manual diffs: remove version suffixes from the file names |
| ~/unidata/uni70/20140403$ ../../desuffixucd.py . |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni80/20150415 $ICU_SRC_DIR ~/svn.icutools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| |
| - also: from http://unicode.org/Public/security/8.0.0/ download new |
| confusables.txt & confusablesWholeScript.txt |
| and copy to $UNIDATA |
| ~/unidata$ cp uni80/20150415/security/confusables.txt $UNIDATA |
| ~/unidata$ cp uni80/20150415/security/confusablesWholeScript.txt $UNIDATA |
| |
| * initial preparseucd.py changes |
| - remove new Unicode scripts from the |
| only-in-ISO-15924 list according to the error message: |
| ValueError: remove ['Ahom', 'Hatr', 'Hluw', 'Hung', 'Mult', 'Sgnw'] |
| from _scripts_only_in_iso15924 |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - property and file name change: |
| IndicMatraCategory -> IndicPositionalCategory |
| - UnicodeData.txt unusual numeric values (improper fractions) |
| 109F6;MEROITIC CURSIVE FRACTION ONE TWELFTH;No;0;R;;;;1/12;N;;;;; |
| 109F7;MEROITIC CURSIVE FRACTION TWO TWELFTHS;No;0;R;;;;2/12;N;;;;; |
| 109F8;MEROITIC CURSIVE FRACTION THREE TWELFTHS;No;0;R;;;;3/12;N;;;;; |
| 109F9;MEROITIC CURSIVE FRACTION FOUR TWELFTHS;No;0;R;;;;4/12;N;;;;; |
| 109FA;MEROITIC CURSIVE FRACTION FIVE TWELFTHS;No;0;R;;;;5/12;N;;;;; |
| 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; |
| 109FC;MEROITIC CURSIVE FRACTION SEVEN TWELFTHS;No;0;R;;;;7/12;N;;;;; |
| 109FD;MEROITIC CURSIVE FRACTION EIGHT TWELFTHS;No;0;R;;;;8/12;N;;;;; |
| 109FE;MEROITIC CURSIVE FRACTION NINE TWELFTHS;No;0;R;;;;9/12;N;;;;; |
| 109FF;MEROITIC CURSIVE FRACTION TEN TWELFTHS;No;0;R;;;;10/12;N;;;;; |
| -> change preparseucd.py to map them to proper fractions (e.g., 1/6) |
| which are listed in DerivedNumericValues.txt; |
| keeps storage in data file simple |
| |
| * PropertyValueAliases.txt changes |
| - 10 new Block (blk) values: |
| blk; Ahom ; Ahom |
| blk; Anatolian_Hieroglyphs ; Anatolian_Hieroglyphs |
| blk; Cherokee_Sup ; Cherokee_Supplement |
| blk; CJK_Ext_E ; CJK_Unified_Ideographs_Extension_E |
| blk; Early_Dynastic_Cuneiform ; Early_Dynastic_Cuneiform |
| blk; Hatran ; Hatran |
| blk; Multani ; Multani |
| blk; Old_Hungarian ; Old_Hungarian |
| blk; Sup_Symbols_And_Pictographs ; Supplemental_Symbols_And_Pictographs |
| blk; Sutton_SignWriting ; Sutton_SignWriting |
| -> add to uchar.h |
| use long property names for enum constants |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| - 6 new Script (sc) values: |
| sc ; Ahom ; Ahom |
| sc ; Hatr ; Hatran |
| sc ; Hluw ; Anatolian_Hieroglyphs |
| sc ; Hung ; Old_Hungarian |
| sc ; Mult ; Multani |
| sc ; Sgnw ; SignWriting |
| -> all of them had been added already to uscript.h & com.ibm.icu.lang.UScript |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt |
| |
| * generate normalization data files |
| cd $ICU_ROOT/dbg |
| bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource |
| bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| $ICU_ROOT/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt |
| |
| * build Unicode tools using CMake+make |
| |
| ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /home/mscherer/svn.icu/trunk/inst) |
| # Location of the ICU source tree. |
| set(ICU_SRC_DIR /home/mscherer/svn.icu/trunk/src) |
| |
| ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c |
| ~/svn.icutools/trunk/dbg/unicode/c$ make |
| |
| * generate core properties data files |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder implicit $ICU_SRC_DIR |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca --hanOrder radical-stroke $ICU_SRC_DIR |
| - rebuild ICU (make install) & tools |
| - run genuca again (see step above) so that it picks up the new nfc.nrm |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..8.0: U+2260, U+226E, U+226F |
| - nothing new in 8.0, no test file to update |
| |
| * run & fix ICU4C tests |
| - bad Cherokee case folding due to difference in fallbacks: |
| UCD case folding falls back to no mapping, |
| ICU runtime case folding falls back to lowercasing; |
| fixed casepropsbuilder.cpp to generate scf mappings to self |
| when there is an slc mapping but no scf |
| - Andy handles RBBI & spoof check test failures |
| |
| * collation: CLDR collation root, UCA DUCET |
| |
| - UCA DUCET goes into Mark's Unicode tools, see |
| https://sites.google.com/site/unicodetools/home#TOC-UCA |
| - CLDR root data files are checked into (CLDR UCA branch)/common/uca/ |
| - cd (CLDR UCA branch)/common/uca/ |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| cp $ICU_SRC_DIR/source/data/unidata/UCARules.txt /tmp/UCARules-old.txt |
| (note removing the underscore before "Rules") |
| cp UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
| - restore TODO diffs in UCARules.txt |
| meld /tmp/UCARules-old.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| from the CLDR root files (..._CLDR_..._SHORT.txt) |
| cp CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data |
| - if CLDR common/uca/unihan-index.txt changes, then update |
| CLDR common/collation/root.xml <collation type="private-unihan"> |
| and regenerate (or update in parallel) $ICU_SRC_DIR/source/data/coll/root.txt |
| - run genuca, see command line above; |
| deal with |
| Error: Unknown script for first-primary sample character U+07d8 on line 23005 of /home/mscherer/svn.icu/trunk/src/source/data/unidata/FractionalUCA.txt |
| (add the character to genuca.cpp sampleCharsToScripts[]) |
| + look up the script for the new sample characters |
| (e.g., in FractionalUCA.txt) |
| + *add* mappings to sampleCharsToScripts[], do not replace them |
| (in case the script sample characters flip-flop) |
| + insert new scripts in DUCET script order, see the top_byte table |
| at the beginning of FractionalUCA.txt |
| - rebuild ICU4C |
| |
| * run & fix ICU4C tests, now with new CLDR collation root data |
| - run all tests with the collation test data *_SHORT.txt or the full files |
| (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| - fixed bug in CollationWeights::getWeightRanges() |
| exposed by new data and CollationTest::TestRootElements |
| |
| * update Java data files |
| - refresh just the UCD/UCA-related/derived files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt56l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt56b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b |
| echo pnames.icu uprops.icu ucase.icu ubidi.icu nfc.nrm > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt56l.dat ./out/icu4j/icudt56b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt56l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt56b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt56b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt56b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt56b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt56b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/dbg/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files, |
| and then refresh ICU4J |
| cd ~/svn.icu/trunk/dbg/data/out/icu4j |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC_DIR/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp ~/unidata/uni80/20150415/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * run & fix ICU4J tests |
| |
| *** LayoutEngine script information |
| |
| * ICU 56: Modify ScriptIDModuleWriter.java to not output @stable tags any more, |
| because the layout engine was deprecated in ICU 54. |
| Modify ScriptIDModuleWriter.java and ScriptTagModuleWriter.java |
| to write lines that we used to add manually. |
| |
| * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
| This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
| in the working directory. |
| |
| (It also generates ScriptRunData.cpp, which is no longer needed.) |
| |
| It also reads and regenerates tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguages |
| (a plain text file) |
| which maps ICU versions to the numbers of script/language constants |
| that were added then. |
| (This mapping is probably obsolete since we do not print "@stable ICU xy" any more.) |
| |
| The generated files have a current copyright date and "@deprecated" statement. |
| |
| * Review changes, fix Java tool if necessary, and copy to ICU4C |
| cd ~/svn.icu4j/trunk/src |
| meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/*.h $ICU_SRC_DIR/source/layout |
| cp tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptAndLanguageTags.cpp $ICU_SRC_DIR/source/layout |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| - make sure that changes to Unicode tools & ICU tools are checked in |
| http://www.unicode.org/utility/trac/log/trunk/unicodetools |
| http://bugs.icu-project.org/trac/log/tools/trunk |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 7.0 update for ICU 54 |
| |
| http://www.unicode.org/review/pri271/ -- beta review |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/versions/beta-7.0.0.html#notable_issues |
| http://www.unicode.org/reports/tr44/tr44-13.html |
| |
| *** ICU Trac |
| |
| - ticket 10821: Unicode 7.0, UCA 7.0 |
| - C++ branches/markus/uni70 at r35584 from trunk at r35580 |
| - Java branches/markus/uni70 at r35587 from trunk at r35545 |
| |
| *** CLDR Trac |
| |
| - ticket 7195: UCA 7.0 CLDR root collation |
| - branches/markus/uni70 at r10062 from trunk at r10061 |
| |
| - ticket 6762: script metadata for Unicode 7.0 new scripts |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| - download UCD & IDNA files |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
| - only for manual diffs: remove version suffixes from the file names |
| ~/unidata/uni70/20140403$ ../../desuffixucd.py . |
| (see https://sites.google.com/site/unicodetools/inputdata) |
| - only for manual diffs: extract Unihan.zip to "here" (.../ucd/Unihan/*.txt), delete Unihan.zip |
| - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni70/20140403 $ICU_SRC_DIR ~/svn.icutools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| - Restore TODO diffs in source/data/unidata/UCARules.txt |
| cd $ICU_SRC_DIR |
| meld ../../trunk/src/source/data/unidata/UCARules.txt source/data/unidata/UCARules.txt |
| - Restore ICU patches for ticket #10176 in source/test/testdata/LineBreakTest.txt |
| |
| - also: from http://unicode.org/Public/security/7.0.0/ download new |
| confusables.txt & confusablesWholeScript.txt |
| and copy to $ICU_ROOT/src/source/data/unidata/ |
| |
| * initial preparseucd.py changes |
| - remove new Unicode scripts from the |
| only-in-ISO-15924 list according to the error message: |
| ValueError: remove ['Hmng', 'Lina', 'Perm', 'Mani', 'Phlp', 'Bass', |
| 'Dupl', 'Elba', 'Gran', 'Mend', 'Narb', 'Nbat', 'Palm', |
| 'Sind', 'Wara', 'Mroo', 'Khoj', 'Tirh', 'Aghb', 'Mahj'] |
| from _scripts_only_in_iso15924 |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - NamesList.txt now has a heading with a non-ASCII character |
| + keep ppucd.txt in platform charset, rather than changing tool/test parsers |
| + escape non-ASCII characters in heading comments |
| - gets Unicode copyright line from PropertyAliases.txt which is currently still at 2013 |
| + get the copyright from the first file whose copyright line contains the current year |
| |
| * PropertyValueAliases.txt changes |
| - 32 new Block (blk) values: |
| blk; Bassa_Vah ; Bassa_Vah |
| blk; Caucasian_Albanian ; Caucasian_Albanian |
| blk; Coptic_Epact_Numbers ; Coptic_Epact_Numbers |
| blk; Diacriticals_Ext ; Combining_Diacritical_Marks_Extended |
| blk; Duployan ; Duployan |
| blk; Elbasan ; Elbasan |
| blk; Geometric_Shapes_Ext ; Geometric_Shapes_Extended |
| blk; Grantha ; Grantha |
| blk; Khojki ; Khojki |
| blk; Khudawadi ; Khudawadi |
| blk; Latin_Ext_E ; Latin_Extended_E |
| blk; Linear_A ; Linear_A |
| blk; Mahajani ; Mahajani |
| blk; Manichaean ; Manichaean |
| blk; Mende_Kikakui ; Mende_Kikakui |
| blk; Modi ; Modi |
| blk; Mro ; Mro |
| blk; Myanmar_Ext_B ; Myanmar_Extended_B |
| blk; Nabataean ; Nabataean |
| blk; Old_North_Arabian ; Old_North_Arabian |
| blk; Old_Permic ; Old_Permic |
| blk; Ornamental_Dingbats ; Ornamental_Dingbats |
| blk; Pahawh_Hmong ; Pahawh_Hmong |
| blk; Palmyrene ; Palmyrene |
| blk; Pau_Cin_Hau ; Pau_Cin_Hau |
| blk; Psalter_Pahlavi ; Psalter_Pahlavi |
| blk; Shorthand_Format_Controls ; Shorthand_Format_Controls |
| blk; Siddham ; Siddham |
| blk; Sinhala_Archaic_Numbers ; Sinhala_Archaic_Numbers |
| blk; Sup_Arrows_C ; Supplemental_Arrows_C |
| blk; Tirhuta ; Tirhuta |
| blk; Warang_Citi ; Warang_Citi |
| -> add to uchar.h |
| use long property names for enum constants |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| - 28 new Joining_Group (jg) values: |
| jg ; Manichaean_Aleph ; Manichaean_Aleph |
| jg ; Manichaean_Ayin ; Manichaean_Ayin |
| jg ; Manichaean_Beth ; Manichaean_Beth |
| jg ; Manichaean_Daleth ; Manichaean_Daleth |
| jg ; Manichaean_Dhamedh ; Manichaean_Dhamedh |
| jg ; Manichaean_Five ; Manichaean_Five |
| jg ; Manichaean_Gimel ; Manichaean_Gimel |
| jg ; Manichaean_Heth ; Manichaean_Heth |
| jg ; Manichaean_Hundred ; Manichaean_Hundred |
| jg ; Manichaean_Kaph ; Manichaean_Kaph |
| jg ; Manichaean_Lamedh ; Manichaean_Lamedh |
| jg ; Manichaean_Mem ; Manichaean_Mem |
| jg ; Manichaean_Nun ; Manichaean_Nun |
| jg ; Manichaean_One ; Manichaean_One |
| jg ; Manichaean_Pe ; Manichaean_Pe |
| jg ; Manichaean_Qoph ; Manichaean_Qoph |
| jg ; Manichaean_Resh ; Manichaean_Resh |
| jg ; Manichaean_Sadhe ; Manichaean_Sadhe |
| jg ; Manichaean_Samekh ; Manichaean_Samekh |
| jg ; Manichaean_Taw ; Manichaean_Taw |
| jg ; Manichaean_Ten ; Manichaean_Ten |
| jg ; Manichaean_Teth ; Manichaean_Teth |
| jg ; Manichaean_Thamedh ; Manichaean_Thamedh |
| jg ; Manichaean_Twenty ; Manichaean_Twenty |
| jg ; Manichaean_Waw ; Manichaean_Waw |
| jg ; Manichaean_Yodh ; Manichaean_Yodh |
| jg ; Manichaean_Zayin ; Manichaean_Zayin |
| jg ; Straight_Waw ; Straight_Waw |
| -> uchar.h & UCharacter.JoiningGroup |
| - 23 new Script (sc) values: |
| sc ; Aghb ; Caucasian_Albanian |
| sc ; Bass ; Bassa_Vah |
| sc ; Dupl ; Duployan |
| sc ; Elba ; Elbasan |
| sc ; Gran ; Grantha |
| sc ; Hmng ; Pahawh_Hmong |
| sc ; Khoj ; Khojki |
| sc ; Lina ; Linear_A |
| sc ; Mahj ; Mahajani |
| sc ; Mani ; Manichaean |
| sc ; Mend ; Mende_Kikakui |
| sc ; Modi ; Modi |
| sc ; Mroo ; Mro |
| sc ; Narb ; Old_North_Arabian |
| sc ; Nbat ; Nabataean |
| sc ; Palm ; Palmyrene |
| sc ; Pauc ; Pau_Cin_Hau |
| sc ; Perm ; Old_Permic |
| sc ; Phlp ; Psalter_Pahlavi |
| sc ; Sidd ; Siddham |
| sc ; Sind ; Khudawadi |
| sc ; Tirh ; Tirhuta |
| sc ; Wara ; Warang_Citi |
| -> uscript.h (many were added before) |
| comment "Mende Kikakui" for USCRIPT_MENDE |
| add USCRIPT_KHUDAWADI, make USCRIPT_SINDHI an alias |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2; \3 |
| - 6 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html |
| (added 2012-11-01) |
| Ahom 338 Ahom |
| Hatr 127 Hatran |
| Mult 323 Multani |
| (added 2013-10-12) |
| Modi 324 Modi |
| Pauc 263 Pau Cin Hau |
| Sidd 302 Siddham |
| -> uscript.h (some overlap with additions from Unicode) |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2; \3 |
| -> add Ahom, Hatr, Mult to preparseucd.py _scripts_only_in_iso15924 |
| -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| * update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| ~/svn.icutools/trunk/src/unicode$ py/parsescriptmetadata.py $ICU_SRC_DIR/source/common/unicode/uscript.h ~/svn.cldr/trunk/common/properties/scriptMetadata.txt |
| |
| * generate normalization data files |
| - cd $ICU_ROOT/dbg |
| - export LD_LIBRARY_PATH=$ICU_ROOT/dbg/lib |
| - SRC_DATA_IN=$ICU_SRC_DIR/source/data/in |
| - UNIDATA=$ICU_SRC_DIR/source/data/unidata |
| - bin/gennorm2 -o $ICU_SRC_DIR/source/common/norm2_nfc_data.h -s $UNIDATA/norm2 nfc.txt --csource |
| - bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| - bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| - bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| - bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| ~/svn.icu/uni70/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt |
| |
| * build Unicode tools using CMake+make |
| |
| ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /home/mscherer/svn.icu/uni70/inst) |
| # Location of the ICU source tree. |
| set(ICU_SRC_DIR /home/mscherer/svn.icu/uni70/src) |
| |
| ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c |
| ~/svn.icutools/trunk/dbg/unicode/c$ make |
| |
| * genprops work |
| - new code point range for Joining_Group values: 10AC0..10AFF Manichaean |
| + add second array of Joining_Group values for at most 10800..10FFF |
| icutools: unicode/c/genprops/bidipropsbuilder.cpp |
| icu: source/common/ubidi_props.h/.c/_data.h |
| icu4j: main/classes/core/src/com/ibm/icu/impl/UBiDiProps.java |
| |
| * generate core properties data files |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops $ICU_SRC_DIR |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca $ICU_SRC_DIR |
| - rebuild ICU (make install) & tools |
| - run genuca again (see step above) so that it picks up the new nfc.nrm |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..7.0: U+2260, U+226E, U+226F |
| - nothing new in 7.0, no test file to update |
| |
| * run & fix ICU4C tests |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt53l |
| echo timestamp > uni-core-data |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt53b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b |
| echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt53l.dat ./out/icu4j/icudt53b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt53l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt53b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt53b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt53b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt53b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt53b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/uni70/dbg/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| ICUDT=icudt54b |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| cd ~/svn.icu/uni70/dbg/data/out/icu4j |
| cp com/ibm/icu/impl/data/$ICUDT/confusables.cfu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| rm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/cnvalias.icu |
| cp com/ibm/icu/impl/data/$ICUDT/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT |
| cp com/ibm/icu/impl/data/$ICUDT/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| cp com/ibm/icu/impl/data/$ICUDT/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/brkitr |
| - refresh ICU4J |
| ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| |
| * update CollationFCD.java |
| + copy & paste the initializers of lcccIndex[] etc. from |
| ICU4C/source/i18n/collationfcd.cpp to |
| ICU4J/main/classes/collate/src/com/ibm/icu/impl/coll/CollationFCD.java |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd $ICU_SRC_DIR/source/data/unidata |
| cp confusables.txt confusablesWholeScript.txt NormalizationCorrections.txt NormalizationTest.txt SpecialCasing.txt UnicodeData.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cd ../../test/testdata |
| cp BidiCharacterTest.txt BidiTest.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| cp ~/unidata/uni70/20140409/ucd/CompositionExclusions.txt ~/svn.icu4j/trunk/src/main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * UCA |
| |
| - download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/ |
| - run desuffixucd.py (see https://sites.google.com/site/unicodetools/inputdata) |
| - update the input files for Mark's UCA tools, in ~/svn.unitools/trunk/data/uca/7.0.0/ |
| - run Mark's UCA Main: https://sites.google.com/site/unicodetools/home#TOC-UCA |
| - output files are in ~/svn.unitools/Generated/uca/7.0.0/ |
| - review data; compare files, use blankweights.sed or similar |
| ~/svn.unitools$ sed -r -f blankweights.sed Generated/uca/7.0.0/CollationAuxiliary/FractionalUCA.txt > frac-7.0.txt |
| - cd ~/svn.unitools/Generated/uca/7.0.0/ |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| cp CollationAuxiliary/FractionalUCA_SHORT.txt $ICU_SRC_DIR/source/data/unidata/FractionalUCA.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| (note removing the underscore before "Rules") |
| cp CollationAuxiliary/UCA_Rules_SHORT.txt $ICU_SRC_DIR/source/data/unidata/UCARules.txt |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) |
| cp CollationAuxiliary/CollationTest_CLDR_NON_IGNORABLE_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_NON_IGNORABLE_SHORT.txt |
| cp CollationAuxiliary/CollationTest_CLDR_SHIFTED_SHORT.txt $ICU_SRC_DIR/source/test/testdata/CollationTest_SHIFTED_SHORT.txt |
| cp $ICU_SRC_DIR/source/test/testdata/CollationTest_*.txt ~/svn.icu4j/trunk/src/main/tests/collate/src/com/ibm/icu/dev/data |
| - run genuca, see command line above |
| - rebuild ICU4C |
| - refresh ICU4J collation data: |
| (subset of instructions above for properties data refresh, except copies all coll/*) |
| ICUDT=icudt54b |
| ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| ~/svn.icu/uni70/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| ~/svn.icu/uni70/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/$ICUDT/coll/* /tmp/icu4j/com/ibm/icu/impl/data/$ICUDT/coll |
| ~/svn.icu/uni70/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/$ICUDT |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| - copy all output from Mark's UCA tool to unicode.org for review & staging by Ken & editors |
| - copy most of ~/svn.unitools/Generated/uca/7.0.0/CollationAuxiliary/* to CLDR branch |
| ~/svn.unitools$ cp Generated/uca/7.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/ |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/uni70/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| * run & fix ICU4J tests |
| |
| *** LayoutEngine script information |
| |
| (For details see the Unicode 5.2 change log below.) |
| |
| * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
| This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
| in the working directory. |
| (It also generates ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@stable" statement. |
| ICU 54: Fixed tools/misc/src/com/ibm/icu/dev/tool/layout/ScriptIDModuleWriter.java |
| for "born stable" Unicode API constants, and to stop parsing ICU version numbers |
| which may not contain dots any more. |
| |
| - diff current <icu>/source/layout files vs. generated ones |
| ~/svn.icu4j/trunk/src$ meld $ICU_SRC_DIR/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
| review and manually merge desired changes; |
| fix gratuitous changes, incorrect @draft/@stable and missing aliases; |
| Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. |
| - if you just copy the above files, then |
| fix mixed line endings, review the diffs as above and restore changes to API tags etc.; |
| manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h |
| |
| *** API additions |
| - send notice to icu-design about new born-@stable API (enum constants etc.) |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 6.3 update |
| |
| http://www.unicode.org/review/pri249/ -- beta review |
| http://www.unicode.org/reports/uax-proposed-updates.html |
| http://www.unicode.org/versions/beta-6.3.0.html#notable_issues |
| http://www.unicode.org/reports/tr44/tr44-11.html |
| |
| *** ICU Trac |
| |
| - ticket 10128: update ICU to Unicode 6.3 beta |
| - ticket 10168: update ICU to Unicode 6.3 final |
| - C++ branches/markus/uni63 at r33552 from trunk at r33551 |
| - Java branches/markus/uni63 at r33550 from trunk at r33553 |
| |
| - ticket 10142: implement Unicode 6.3 bidi algorithm additions |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| (configure.in & configure: have been modified to extract the version from uchar.h) |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| - Run ICU4C "configure" _after_ updating the Unicode version number in uchar.h |
| so that the makefiles see the new version number. |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| - download UCD, UCA & IDNA files |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
| - modify preparseucd.py: |
| parse new file BidiBrackets.txt |
| with new properties bpb=Bidi_Paired_Bracket and bpt=Bidi_Paired_Bracket_Type |
| - ~/svn.icutools/trunk/src/unicode$ py/preparseucd.py ~/unidata/uni63/20130425 ~/svn.icu/uni63/src ~/svn.icutools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| - Check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out. |
| |
| * PropertyAliases.txt changes |
| - 1 new Enumerated Property |
| bpt ; Bidi_Paired_Bracket_Type |
| -> uchar.h & UProperty.java & UCharacter.BidiPairedBracketType |
| -> ubidi_props.h & .c & UBiDiProps.java |
| -> remember to write the max value at UBIDI_MAX_VALUES_INDEX |
| -> uprops.cpp |
| -> change ubidi.icu format version from 2.0 to 2.1 |
| - 1 new Miscellaneous Property |
| bpb ; Bidi_Paired_Bracket |
| -> uchar.h & UProperty.java |
| -> ppucd.h & .cpp |
| |
| * PropertyValueAliases.txt changes |
| - 3 Bidi_Paired_Bracket_Type (bpt) values: |
| bpt; c ; Close |
| bpt; n ; None |
| bpt; o ; Open |
| -> uchar.h & UCharacter.BidiPairedBracketType |
| -> ubidi_props.h & .c & UBiDiProps.java |
| -> change ubidi.icu format version from 2.0 to 2.1 |
| - 4 new Bidi_Class (bc) values: |
| bc ; FSI ; First_Strong_Isolate |
| bc ; LRI ; Left_To_Right_Isolate |
| bc ; RLI ; Right_To_Left_Isolate |
| bc ; PDI ; Pop_Directional_Isolate |
| -> uchar.h & UCharacterEnums.ECharacterDirection |
| -> until the bidi code gets updated, |
| Roozbeh suggests mapping the new bc values to ON (Other_Neutral) |
| - 3 new Word_Break (WB) values: |
| WB ; HL ; Hebrew_Letter |
| WB ; SQ ; Single_Quote |
| WB ; DQ ; Double_Quote |
| -> uchar.h & UCharacter.WordBreak |
| -> first time Word_Break numeric constants exceed 4 bits (now 17 values) |
| - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html |
| (added 2012-10-16) |
| Aghb 239 Caucasian Albanian |
| Mahj 314 Mahajani |
| -> uscript.h |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2;\3 |
| -> preparseucd.py _scripts_only_in_iso15924 |
| -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| -> update Script metadata: SCRIPT_PROPS[] in uscript_props.cpp & UScript.ScriptMetadata |
| (not strictly necessary for NOT_ENCODED scripts) |
| |
| * generate normalization data files |
| - ~/svn.icu/uni63/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni63/dbg/lib |
| - ~/svn.icu/uni63/dbg$ SRC_DATA_IN=~/svn.icu/uni63/src/source/data/in |
| - ~/svn.icu/uni63/dbg$ UNIDATA=~/svn.icu/uni63/src/source/data/unidata |
| - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| - ~/svn.icu/uni63/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| |
| ~/svn.icu/uni63/dbg$ echo;echo;make -j5 install > out.txt 2>&1 ; tail -n 20 out.txt |
| |
| * build Unicode tools using CMake+make |
| |
| ~/svn.icutools/trunk/src/unicode/c/icudefs.txt: |
| |
| # Location (--prefix) of where ICU was installed. |
| set(ICU_INST_DIR /home/mscherer/svn.icu/uni63/inst) |
| # Location of the ICU source tree. |
| set(ICU_SRC_DIR /home/mscherer/svn.icu/uni63/src) |
| |
| ~/svn.icutools/trunk/dbg/unicode/c$ cmake ../../../src/unicode/c |
| ~/svn.icutools/trunk/dbg/unicode/c$ make |
| |
| * generate core properties data files |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genprops/genprops ~/svn.icu/uni63/src |
| - ~/svn.icutools/trunk/dbg/unicode/c$ genuca/genuca -i ~/svn.icu/uni63/dbg/data/out/build/icudt52l ~/svn.icu/uni63/src |
| - rebuild ICU (make install) & tools |
| - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..6.3: U+2260, U+226E, U+226F |
| - nothing new in 6.3, no test file to update |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt52l |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt52b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b |
| echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt52l.dat ./out/icu4j/icudt52b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt52l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt52b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt52b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt52b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt52b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt52b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/uni63/dbg/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/cnvalias.icu |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt52b |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/brkitr |
| - refresh ICU4J |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * UCA -- mostly skipped for ICU 52 / Unicode 6.3, except update coll/* files |
| |
| - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ |
| - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| (note removing the underscore before "Rules") |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) |
| - check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out |
| - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani |
| - run genuca, see command line above |
| - rebuild ICU4C |
| - refresh ICU4J collation data: |
| (subset of instructions above for properties data refresh, except copies all coll/*) |
| ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| ~/svn.icu/uni63/dbg$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt52b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt52b/coll |
| ~/svn.icu/uni63/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt52b |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * test ICU, fix test code where necessary |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/uni63/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| *** LayoutEngine script information |
| - skipped for Unicode 6.3: no new scripts |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 6.2 update |
| |
| http://www.unicode.org/review/pri230/ |
| http://www.unicode.org/versions/beta-6.2.0.html |
| http://www.unicode.org/reports/tr44/tr44-9.html#Unicode_6.2.0 |
| http://www.unicode.org/review/pri227/ Changes to Script Extensions Property Values |
| http://www.unicode.org/review/pri228/ Changing some common characters from Punctuation to Symbol |
| http://www.unicode.org/review/pri229/ Linebreaking Changes for Pictographic Symbols |
| http://www.unicode.org/reports/tr46/tr46-8.html IDNA |
| http://unicode.org/Public/idna/6.2.0/ |
| |
| *** ICU Trac |
| |
| - ticket 9515: Unicode 6.2: final ICU update |
| |
| - ticket 9514: UCA 6.2: fix UCARules.txt |
| |
| - ticket 9437: update ICU to Unicode 6.2 |
| - C++ branches/markus/uni62 at r32050 from trunk at r32041 |
| - Java branches/markus/uni62 at r32068 from trunk at r32066 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| (configure.in & configure: have been modified to extract the version from uchar.h) |
| - com.ibm.icu.util.VersionInfo |
| - com.ibm.icu.dev.test.lang.UCharacterTest.VERSION_ |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| - download UCD, UCA & IDNA files |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of the latest IdnaMappingTable.txt (can be in some subfolder) |
| - modify preparseucd.py: NamesList.txt is now in UTF-8 |
| - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni62/20120816 ~/svn.icu/uni62/src ~/svn.icu/tools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| - Check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out. |
| |
| * PropertyValueAliases.txt changes |
| - 1 new Line_Break (lb) value: |
| lb ; RI ; Regional_Indicator |
| -> uchar.h & UCharacter.LineBreak |
| - 1 new Word_Break (WB) value: |
| WB ; RI ; Regional_Indicator |
| -> uchar.h & UCharacter.WordBreak |
| - 1 new Grapheme_Cluster_Break (GCB) value: |
| GCB; RI ; Regional_Indicator |
| -> uchar.h & UCharacter.GraphemeClusterBreak |
| |
| * 3 new numeric values |
| The new value -1, which was really supposed to be NaN but that would have required |
| new UnicodeData.txt syntax, can already be represented as a "fraction" of -1/1, |
| but encodeNumericValue() in corepropsbuilder.cpp had to be fixed. |
| cp;12456;na=CUNEIFORM NUMERIC SIGN NIGIDAMIN;nv=-1 |
| cp;12457;na=CUNEIFORM NUMERIC SIGN NIGIDAESH;nv=-1 |
| The two new values 216000 and 432000 require an addition to the encoding of numeric values. |
| cp;12432;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH;nv=216000 |
| cp;12433;na=CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN;nv=432000 |
| -> uprops.h, uchar.c & UCharacterProperty.java |
| -> cucdtst.c & UCharacterTest.java |
| |
| * generate normalization data files |
| - ~/svn.icu/uni62/dbg$ export LD_LIBRARY_PATH=~/svn.icu/uni62/dbg/lib |
| - ~/svn.icu/uni62/dbg$ SRC_DATA_IN=~/svn.icu/uni62/src/source/data/in |
| - ~/svn.icu/uni62/dbg$ UNIDATA=~/svn.icu/uni62/src/source/data/unidata |
| - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| - ~/svn.icu/uni62/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| * build Unicode tools using CMake+make |
| |
| * generate core properties data files |
| - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/uni62/src |
| - in initial bootstrapping, change the UCA version |
| in source/data/unidata/FractionalUCA.txt to match the new Unicode version |
| - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/uni62/dbg/data/out/build/icudt50l ~/svn.icu/uni62/src |
| - rebuild ICU (make install) & tools |
| + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, |
| check if the UCA version in FractionalUCA.txt matches the new Unicode version |
| (see step above) |
| - run genuca again (see step above) so that it picks up the new case mappings and nfc.nrm |
| - rebuild ICU (make install) & tools |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..6.2: U+2260, U+226E, U+226F |
| - nothing new in 6.2, no test file to update |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt50l |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt50b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b |
| echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt50l.dat ./out/icu4j/icudt50b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt50l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt50b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt50b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt50b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt50b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt50b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/uni62/dbg/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/cnvalias.icu |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt50b |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/brkitr |
| - refresh ICU4J |
| ~/svn.icu/uni62/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * UCA |
| |
| - get output from Mark's tools; look in http://www.unicode.org/Public/UCA/<beta version>/ |
| - CLDR root files for ICU are in CollationAuxiliary.zip; unpack that |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| (note removing the underscore before "Rules") |
| - update (ICU4C)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) |
| - check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out |
| - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani |
| - run genuca, see command line above |
| - rebuild ICU4C |
| - refresh ICU4J collation data: |
| (subset of instructions above for properties data refresh, except copies all coll/*) |
| ~/svn.icu/uni62/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| ~/svn.icu/uni62/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll |
| ~/svn.icu/uni62/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt50b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt50b/coll |
| ~/svn.icu/uni62/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt50b |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * test ICU, fix test code where necessary |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/uni62/dbg$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| *** LayoutEngine script information |
| - skipped for Unicode 6.2: no new scripts |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Future Unicode update |
| |
| Tools simplified since the Unicode 6.1 update. See |
| - http://site.icu-project.org/design/props/ppucd |
| - http://bugs.icu-project.org/trac/wiki/Markus/ReviewTicket8972 |
| |
| * Unicode version numbers |
| - icutools/unicode/makedefs.sh was deleted, so one fewer place for version & path updates |
| |
| * file preparation |
| - ucdcopy.py, idna2nrm.py and genpname/preparse.pl replaced by preparseucd.py: |
| - ~/svn.icu/tools/trunk/src/unicode$ py/preparseucd.py ~/uni61/20120118 ~/svn.icu/trunk/src ~/svn.icu/tools/trunk/src |
| - This writes files (especially ppucd.txt) to the ICU4C unidata and testdata subfolders. |
| - Check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out. |
| |
| * PropertyValueAliases.txt changes |
| - Script codes that are in ISO 15924 but not in Unicode are now listed in |
| preparseucd.py, in the _scripts_only_in_iso15924 variable. |
| If there are new ISO codes, then add them. |
| If Unicode adds some of them, then remove them from the .py variable. |
| |
| * UnicodeData.txt changes |
| - No more manual changes for CJK ranges for algorithmic names; |
| those are now written to ppucd.txt and genprops reads them from there. |
| |
| * generate core properties data files (makeprops.sh was deleted) |
| - ~/svn.icu/tools/trunk/dbg/unicode$ c/genprops/genprops ~/svn.icu/trunk/src |
| |
| * no more manual updates of source/data/unidata/norm2/nfkc_cf.txt |
| - it is now generated by preparseucd.py |
| |
| * no more separate idna2nrm.py run and manual copying to generate source/data/unidata/norm2/uts46.txt |
| - it is now generated by preparseucd.py |
| - make sure that the Unicode data folder passed into preparseucd.py |
| includes a copy of http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt |
| (can be in some subfolder) |
| |
| * generate normalization data files |
| - ~/svn.icu/trunk/dbg$ export LD_LIBRARY_PATH=~/svn.icu/trunk/dbg/lib |
| - ~/svn.icu/trunk/dbg$ SRC_DATA_IN=~/svn.icu/trunk/src/source/data/in |
| - ~/svn.icu/trunk/dbg$ UNIDATA=~/svn.icu/trunk/src/source/data/unidata |
| - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfc.nrm -s $UNIDATA/norm2 nfc.txt |
| - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt |
| - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/nfkc_cf.nrm -s $UNIDATA/norm2 nfc.txt nfkc.txt nfkc_cf.txt |
| - ~/svn.icu/trunk/dbg$ bin/gennorm2 -o $SRC_DATA_IN/uts46.nrm -s $UNIDATA/norm2 nfc.txt uts46.txt |
| |
| * build ICU (make install) |
| * build Unicode tools using CMake+make |
| |
| * new way to call genuca (makeuca.sh was deleted) |
| - ~/svn.icu/tools/trunk/dbg/unicode$ c/genuca/genuca -i ~/svn.icu/trunk/dbg/data/out/build/icudt49l ~/svn.icu/trunk/src |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 6.1 update |
| |
| *** ICU Trac |
| |
| - ticket 8995 final update to Unicode 6.1 |
| - ticket 8994 regenerate source/layout/CanonData.cpp |
| |
| - ticket 8961 support Unicode "Age" value *names* |
| - ticket 8963 support multiple character name aliases & types |
| |
| - ticket 8827 "update ICU to Unicode 6.1" |
| - C++ branches/markus/uni61 at r30864 from trunk at r30843 |
| - Java branches/markus/uni61 at r30865 from trunk at r30863 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| (configure.in & configure: have been modified to extract the version from uchar.h) |
| - com.ibm.icu.util.VersionInfo |
| - icutools/unicode/makedefs.sh |
| + also review & update other definitions in that file, |
| e.g. the ICU version in this path: BLD_DATA_FILES=$ICU_BLD/data/out/build/icudt49l |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni61/20111205/ucd ~/uni61/processed |
| - This prepares both unidata and testdata files in respective output subfolders. |
| - Check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out. |
| |
| * PropertyValueAliases.txt changes |
| - 11 new block names: |
| Arabic_Extended_A |
| Arabic_Mathematical_Alphabetic_Symbols |
| Chakma |
| Meetei_Mayek_Extensions |
| Meroitic_Cursive |
| Meroitic_Hieroglyphs |
| Miao |
| Sharada |
| Sora_Sompeng |
| Sundanese_Supplement |
| Takri |
| -> add to uchar.h |
| -> add to UCharacter.UnicodeBlock IDs |
| Eclipse find UBLOCK_([^ ]+) = ([0-9]+), (/.+) |
| replace public static final int \1_ID = \2; \3 |
| -> add to UCharacter.UnicodeBlock objects |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| - 1 new Joining_Group (jg) value: |
| Rohingya_Yeh |
| -> uchar.h & UCharacter.JoiningGroup |
| - 2 new Line_Break (lb) values: |
| CJ=Conditional_Japanese_Starter |
| HL=Hebrew_Letter |
| -> uchar.h & UCharacter.LineBreak |
| - 7 new scripts: |
| sc ; Cakm ; Chakma |
| sc ; Merc ; Meroitic_Cursive |
| sc ; Mero ; Meroitic_Hieroglyphs |
| sc ; Plrd ; Miao |
| sc ; Shrd ; Sharada |
| sc ; Sora ; Sora_Sompeng |
| sc ; Takr ; Takri |
| -> remove these from SyntheticPropertyValueAliases.txt |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - 2 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html |
| (added 2011-06-21) |
| Khoj 322 Khojki |
| Tirh 326 Tirhuta |
| and another one added 2011-12-09 |
| Hluw 080 Anatolian Hieroglyphs (Luwian Hieroglyphs, Hittite Hieroglyphs) |
| -> uscript.h |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2;\3 |
| -> SyntheticPropertyValueAliases.txt |
| -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| * UnicodeData.txt changes |
| - the last Unihan code point changes from U+9FCB to U+9FCC |
| search for both 9FCB (end) and 9FCC (limit) (regex 9FC[BC], case-insensitive) |
| + do change gennames.c |
| + do change swapCJK() in ucol.cpp & ImplicitCEGenerator.java |
| |
| * DerivedBidiClass.txt changes |
| - 2 new default-AL blocks: |
| # Arabic Extended-A: U+08A0 - U+08FF (was default-R) |
| # Arabic Mathematical Alphabetic Symbols: |
| # U+1EE00 - U+1EEFF (was default-R) |
| - 2 new default-R blocks: |
| # Meroitic Hieroglyphs: |
| # U+10980 - U+1099F |
| # Meroitic Cursive: U+109A0 - U+109FF |
| -> should be picked up by the explicit data in the file |
| |
| * NameAliases.txt changes |
| - from |
| # Each line has two fields |
| # First field: Code point |
| # Second field: Alias |
| - to |
| # Each line has three fields, as described here: |
| # |
| # First field: Code point |
| # Second field: Alias |
| # Third field: Type |
| - Also, the file previously allowed multiple aliases but only now does it |
| actually provide multiple, even multiple of the same type. For example, |
| FEFF;BYTE ORDER MARK;alternate |
| FEFF;BOM;abbreviation |
| FEFF;ZWNBSP;abbreviation |
| - This breaks our gennames parser, unames.icu data structure, and API. |
| Fix gennames to only pick up "correction" aliases. |
| New ticket #8963 for further changes. |
| |
| * run genpname/preparse.pl (on Linux) |
| + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname |
| + make sure that data.h is writable |
| + perl preparse.pl ~/svn.icu/trunk/src > out.txt |
| + preparse.pl shows no errors, out.txt Info and Warning lines look ok |
| |
| * build ICU (make install) |
| so that the tools build can pick up the new definitions from the installed header files. |
| * build Unicode tools (at least genpname) using CMake+make |
| |
| * run genpname |
| (builds both pnames.icu and propname_data.h) |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource |
| |
| * build ICU (make install) |
| * build Unicode tools using CMake+make |
| |
| * update source/data/unidata/norm2/nfkc_cf.txt |
| - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt |
| |
| * update source/data/unidata/norm2/uts46.txt |
| - download http://www.unicode.org/Public/idna/6.1.0/IdnaMappingTable.txt |
| to ~/svn.icu/tools/trunk/src/unicode/py |
| - adjust idna2nrm.py to remove "; NV8": For UTS #46, we do not care about "not valid in IDNA2008". |
| - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py |
| - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0..6.1: U+2260, U+226E, U+226F |
| - nothing new in 6.1, no test file to update |
| |
| * generate core properties data files |
| - in initial bootstrapping, change the UCA version |
| in source/data/unidata/FractionalUCA.txt to match the new Unicode version |
| - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU & tools |
| + if genrb fails to build coll/root.res with an U_INVALID_FORMAT_ERROR, |
| check if the UCA version in FractionalUCA.txt matches the new Unicode version |
| (see step above) |
| - run makeuca.sh so that genuca picks up the new case mappings and nfc.nrm: |
| ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU & tools |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt49l |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt49b |
| mkdir -p ./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b |
| echo pnames.icu ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt49l.dat ./out/icu4j/icudt49b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt49l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt49b |
| mv ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/zoneinfo64.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/metaZones.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/timezoneTypes.res" ./out/icu4j/"com/ibm/icu/impl/data/icudt49b/windowsZones.res" "./out/icu4j/tzdata/com/ibm/icu/impl/data/icudt49b" |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt49b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| jar cf ./out/icu4j/icutzdata.jar -C ./out/icu4j/tzdata com/ibm/icu/impl/data/icudt49b/ |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icutzdata.jar /tmp/icu4j/main/shared/data |
| make[1]: Leaving directory `/home/mscherer/svn.icu/trunk/bld/data' |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b |
| ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/cnvalias.icu |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt49b |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/brkitr |
| - refresh ICU4J |
| ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * test ICU so far, fix test code where necessary |
| - temporarily ignore collation issues that look like UCA/UCD mismatches, |
| until UCA data is updated |
| |
| * UCA |
| |
| - get output from Mark's tools; look in |
| http://www.unicode.org/Public/UCA/6.1.0/CollationAuxiliary-<dev. version>.txt |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| (note removing the underscore before "Rules") |
| - update (ICU)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| with output from Mark's Unicode tools (..._CLDR_..._SHORT.txt) |
| - check test file diffs for previously commented-out, known-failing data lines; |
| probably need to keep those commented out |
| - check FractionalUCA.txt for manual changes of lead bytes from IMPLICIT to Hani |
| - run makeuca.sh: |
| ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU4C |
| - refresh ICU4J collation data: |
| (subset of instructions above for properties data refresh, except copies all coll/*) |
| ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| ~/svn.icu/trunk/bld$ mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt49b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt49b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt49b |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments, useful for debugging) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| *** LayoutEngine script information |
| |
| (For details see the Unicode 5.2 change log below.) |
| |
| * Run icu4j-tools: com.ibm.icu.dev.tool.layout.ScriptNameBuilder. |
| This generates LEScripts.h, LELanguages.h, ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp |
| in the working directory. |
| (It also generates ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@draft" statement. |
| |
| - diff current <icu>/source/layout files vs. generated ones |
| ~/svn.icu4j/trunk/src$ kdiff3 ~/svn.icu/trunk/src/source/layout tools/misc/src/com/ibm/icu/dev/tool/layout |
| review and manually merge desired changes; |
| fix gratuitous changes, incorrect @draft and missing aliases; |
| Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. |
| - if you just copy the above files, then |
| fix mixed line endings, review the diffs as above and restore changes to API tags etc.; |
| manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h |
| |
| *** merge the Unicode update branches back onto the trunk |
| - do not merge the icudata.jar and testdata.jar, |
| instead rebuild them from merged & tested ICU4C |
| |
| ---------------------------------------------------------------------------- *** |
| |
| ICU 4.8 (no Unicode update, just new script codes) |
| |
| * 9 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html |
| (added 2010-12-21) |
| Afak 439 Afaka |
| Jurc 510 Jurchen |
| Mroo 199 Mro, Mru |
| Nshu 499 Nüshu |
| Shrd 319 Sharada, Śāradā |
| Sora 398 Sora Sompeng |
| Takr 321 Takri, Ṭākrī, Ṭāṅkrī |
| Tang 520 Tangut |
| Wole 480 Woleai |
| -> uscript.h |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2;\3 |
| -> genpname/SyntheticPropertyValueAliases.txt |
| -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| |
| * run genpname/preparse.pl (on Linux) |
| + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname |
| + make sure that data.h is writable |
| + perl preparse.pl ~/svn.icu/trunk/src > out.txt |
| + preparse.pl shows no errors, out.txt Info and Warning lines look ok |
| |
| * rebuild Unicode tools (at least genpname) using make |
| - You might first need to "make install" ICU so that the tools build can pick |
| up the new definitions from the installed header files. |
| |
| * run genpname |
| (builds both pnames.icu and propname_data.h) |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genpname/genpname -v -d ~/svn.icu/trunk/src/source/common --csource |
| - rebuild ICU & tools |
| |
| * run genprops |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/data/in -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 |
| - ~/svn.icu/tools/trunk/bld/unicode/c$ genprops/genprops -d ~/svn.icu/trunk/src/source/common --csource -s ~/svn.icu/trunk/src/source/data/unidata -i ~/svn.icu/trunk/dbg/data/out/build/icudt48l -u 6.0 |
| - rebuild ICU & tools |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/dbg$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt48b |
| ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/pnames.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b |
| ~/svn.icu/trunk/dbg/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt48b/uprops.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt48b |
| - refresh ICU4J |
| ~/svn.icu/trunk/dbg/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt48b |
| |
| * should have updated the layout engine script codes but forgot |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 6.0 update |
| |
| *** related ICU Trac tickets |
| |
| 7264 Unicode 6.0 Update |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| (configure.in & configure: have been modified to extract the version from uchar.h) |
| - com.ibm.icu.util.VersionInfo |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| ~/svn.icu/tools/trunk/src/unicode/c/genprops/misc$ ./ucdcopy.py ~/uni60/20100720/ucd ~/uni60/processed |
| - This now prepares both unidata and testdata files in respective output subfolders. |
| |
| * PropertyAliases.txt changes |
| - new Script_Extensions property defined in the new ScriptExtensions.txt file |
| but not listed in PropertyAliases.txt; reported to unicode.org; |
| -> added to tools/trunk/src/unicode/c/genpname/SyntheticPropertyAliases.txt |
| scx; Script_Extensions |
| -> uchar.h with new UProperty section |
| -> com.ibm.icu.lang.UProperty, parallel with uchar.h |
| |
| * PropertyValueAliases.txt changes |
| - 12 new block names: |
| Alchemical_Symbols |
| Bamum_Supplement |
| Batak |
| Brahmi |
| CJK_Unified_Ideographs_Extension_D |
| Emoticons |
| Ethiopic_Extended_A |
| Kana_Supplement |
| Mandaic |
| Miscellaneous_Symbols_And_Pictographs |
| Playing_Cards |
| Transport_And_Map_Symbols |
| -> add to uchar.h |
| -> add to UCharacter.UnicodeBlock |
| Eclipse find UBLOCK_([^ ]+) = [0-9]+, (/.+) |
| replace public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| - Joining_Group (jg) values: |
| Teh_Marbuta_Goal becomes the new canonical value for the old Hamza_On_Heh_Goal which becomes an alias |
| -> uchar.h & UCharacter.JoiningGroup |
| - 3 new scripts: |
| sc ; Batk ; Batak |
| sc ; Brah ; Brahmi |
| sc ; Mand ; Mandaic |
| -> remove these from SyntheticPropertyValueAliases.txt |
| -> add alias USCRIPT_MANDAIC to USCRIPT_MANDAEAN |
| -> fix expectedLong names in cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - 13 new script codes from ISO 15924 http://www.unicode.org/iso15924/codechanges.html |
| (added 2009-11-11..2010-07-18) |
| Bass 259 Bassa Vah |
| Dupl 755 Duployan shortand |
| Elba 226 Elbasan |
| Gran 343 Grantha |
| Kpel 436 Kpelle |
| Loma 437 Loma |
| Mend 438 Mende |
| Merc 101 Meroitic Cursive |
| Narb 106 Old North Arabian |
| Nbat 159 Nabataean |
| Palm 126 Palmyrene |
| Sind 318 Sindhi |
| Wara 262 Warang Citi |
| -> uscript.h |
| -> com.ibm.icu.lang.UScript |
| find USCRIPT_([^ ]+) *= ([0-9]+),(.+) |
| replace public static final int \1 = \2;\3 |
| -> SyntheticPropertyValueAliases.txt |
| -> add to expectedLong and expectedShort names in cintltst/cucdapi.c/TestUScriptCodeAPI() |
| and in com.ibm.icu.dev.test.lang.TestUScript.java |
| - ISO 15924 name change |
| Mero 100 Meroitic Hieroglyphs (was Meroitic) |
| -> add new alias USCRIPT_MEROITIC_HIEROGLYPHS to USCRIPT_MEROITIC |
| - property value alias added for Cham, was already moved out of SyntheticPropertyValueAliases.txt |
| |
| * UnicodeData.txt changes |
| - new CJK block: |
| 2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;; |
| 2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;; |
| -> add to tools/trunk/src/unicode/c/gennames/gennames.c, with new ucdVersion |
| |
| * build Unicode tools using CMake+make |
| |
| * run genpname/preparse.pl (on Linux) |
| + cd ~/svn.icu/tools/trunk/src/unicode/c/genpname |
| + make sure that data.h is writable |
| + perl preparse.pl ~/svn.icu/trunk/src > out.txt |
| + preparse.pl shows no errors, out.txt Info and Warning lines look ok |
| |
| * rebuild Unicode tools (at least genpname) using make |
| - You might first need to "make install" ICU so that the tools build can pick |
| up the new definitions from the installed header files. |
| |
| * run genpname |
| - ~/svn.icu/tools/trunk/bld/unicode$ c/genpname/genpname -v -d ~/svn.icu/trunk/src/source/data/in |
| - rebuild ICU & tools |
| |
| * update source/data/unidata/norm2/nfkc_cf.txt |
| - follow the instructions in nfkc_cf.txt for updating it from DerivedNormalizationProps.txt |
| |
| * update source/data/unidata/norm2/uts46.txt |
| - download http://www.unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt |
| to ~/svn.icu/tools/trunk/src/unicode/py |
| - adjust idna2nrm.py to handle new disallowed_STD3_valid and disallowed_STD3_mapped values |
| - ~/svn.icu/tools/trunk/src/unicode/py$ ./idna2nrm.py |
| - ~/svn.icu/tools/trunk/src/unicode/py$ cp uts46.txt ~/svn.icu/trunk/src/source/data/unidata/norm2 |
| |
| * update uts46test.cpp and UTS46Test.java if there are new characters that are equivalent to |
| sequences with non-LDH ASCII (that is, their decompositions contain '=' or similar) |
| - grep IdnaMappingTable.txt or uts46.txt for "disallowed_STD3_valid" on non-ASCII characters |
| - Unicode 6.0: U+2260, U+226E, U+226F |
| |
| * generate core properties data files |
| - ~/svn.icu/tools/trunk/src/unicode$ ./makeprops.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU & tools |
| - run makeuca.sh so that genuca picks up the new nfc.nrm: |
| ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU & tools |
| |
| * implement new Script_Extensions property (provisional) |
| - parser & generator: genprops & uprops.icu |
| - uscript.h, uprops.h, uchar.c, uniset_props.cpp and others, plus cintltst/cucdapi.c & intltest/usettest.cpp |
| - UScript.java, UCharacterProperty.java, UnicodeSet.java, TestUScript.java, UnicodeSetTest.java |
| |
| * switch ubidi.icu, ucase.icu and uprops.icu from UTrie to UTrie2 |
| - (one-time change) |
| - genbidi/gencase/genprops tools changes |
| - re-run makeprops.sh (see above) |
| - UCharacterProperty.java, UCharacterTypeIterator.java, |
| UBiDiProps.java, UCaseProps.java, and several others with minor changes; |
| UCharacterPropertyReader.java deleted and its code folded into UCharacterProperty.java |
| |
| * update Java data files |
| - refresh just the UCD-related files, just to be safe |
| - see (ICU4C)/source/data/icu4j-readme.txt |
| - mkdir /tmp/icu4j |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| output: |
| ... |
| Unicode .icu files built to ./out/build/icudt45l |
| mkdir -p ./out/icu4j/com/ibm/icu/impl/data/icudt45b |
| echo ubidi.icu ucase.icu uprops.icu > ./out/icu4j/add.txt |
| LD_LIBRARY_PATH=../lib:../stubdata:../tools/ctestfw:$LD_LIBRARY_PATH ../bin/icupkg ./out/tmp/icudt45l.dat ./out/icu4j/icudt45b.dat -a ./out/icu4j/add.txt -s ./out/build/icudt45l -x '*' -tb -d ./out/icu4j/com/ibm/icu/impl/data/icudt45b |
| jar cf ./out/icu4j/icudata.jar -C ./out/icu4j com/ibm/icu/impl/data/icudt45b |
| mkdir -p /tmp/icu4j/main/shared/data |
| cp ./out/icu4j/icudata.jar /tmp/icu4j/main/shared/data |
| - copy the big-endian Unicode data files to another location, |
| separate from the other data files |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b |
| ~/svn.icu/trunk/bld/data/out/icu4j$ rm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/cnvalias.icu |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/*.nrm /tmp/icu4j/com/ibm/icu/impl/data/icudt45b |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/*.icu /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/brkitr/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/brkitr |
| - refresh ICU4J |
| ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b |
| |
| * refresh Java test .txt files |
| - copy new .txt files into ICU4J's main/tests/core/src/com/ibm/icu/dev/data/unicode |
| |
| * un-hardcode normalization skippable (NF*_Inert) test data |
| - removes one manual step from the Unicode upgrade, and removes dependency on one of Mark's tools |
| |
| * copy updated break iterator test files |
| - now handled by early ucdcopy.py and |
| copying the uni60/processed/testdata files to ~/svn.icu/trunk/src/source/test/testdata |
| (old instructions: |
| copy from (Unicode 6.0)/ucd/auxiliary/*BreakTest-6....txt |
| to ~/svn.icu/trunk/src/source/test/testdata) |
| - they are not used in ICU4J |
| |
| * UCA |
| |
| - get output from Mark's tools; look in |
| http://www.unicode.org/~book/incoming/mark/uca6.0.0/ |
| http://www.macchiato.com/unicode/utc/additional-uca-files |
| http://www.unicode.org/Public/UCA/6.0.0/ |
| http://www.unicode.org/~mdavis/uca/ |
| - update source/data/unidata/FractionalUCA.txt with FractionalUCA_SHORT.txt |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt |
| - update Han-implicit ranges for new CJK extensions: |
| swapCJK() in ucol.cpp & ImplicitCEGenerator.java |
| - genuca: allow bytes 02 for U+FFFE, new merge-sort character; |
| do not add it into invuca so that tailoring primary-after an ignorable works |
| - genuca: permit space between [variable top] bytes |
| - ucol.cpp: treat noncharacters like unassigned rather than ignorable |
| - run makeuca.sh: |
| ~/svn.icu/tools/trunk/src/unicode$ ./makeuca.sh ~/svn.icu/trunk/src ~/svn.icu/trunk/bld |
| - rebuild ICU4C |
| - refresh ICU4J collation data: |
| (subset of instructions above for properties data refresh, except copies all coll/*) |
| ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| mkdir -p /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ cp com/ibm/icu/impl/data/icudt45b/coll/* /tmp/icu4j/com/ibm/icu/impl/data/icudt45b/coll |
| ~/svn.icu/trunk/bld/data/out/icu4j$ jar uf ~/svn.icu4j/trunk/src/main/shared/data/icudata.jar -C /tmp/icu4j com/ibm/icu/impl/data/icudt45b |
| - update (ICU)/source/test/testdata/CollationTest_*.txt |
| and (ICU4J)/main/tests/collate/src/com/ibm/icu/dev/data/CollationTest_*.txt |
| with output from Mark's Unicode tools |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments) |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| * When refreshing all of ICU4J data from ICU4C |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=/tmp/icu4j icu4j-data-install |
| - cp /tmp/icu4j/main/shared/data/icudata.jar ~/svn.icu4j/trunk/src/main/shared/data |
| or |
| - ~/svn.icu/trunk/bld$ make ICU4J_ROOT=~/svn.icu4j/trunk/src icu4j-data-install |
| |
| *** LayoutEngine script information |
| |
| (For details see the Unicode 5.2 change log below.) |
| |
| * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, |
| ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates |
| ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@draft" statement. |
| |
| * copy the above files into <icu>/source/layout, replacing the old files. |
| * fix mixed line endings |
| * review the diffs and fix incorrect @draft and missing aliases; |
| Unicode-derived script codes should be "born stable" like constants in uchar.h, uscript.h etc. |
| * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 5.2 update |
| |
| *** related ICU Trac tickets |
| |
| 7084 Unicode 5.2 |
| |
| 7167 verify collation bytes |
| 7235 Java test NAME_ALIAS |
| 7236 Java DerivedCoreProperties.txt test |
| 7237 Java BidiTest.txt |
| 7238 UTrie2 in core unidata |
| 7239 test for tailoring gaps |
| 7240 Java fix CollationMiscTest |
| 7243 update layout engine for Unicode 5.2 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - configure.in & configure |
| - update ucdVersion in gennames.c if an algorithmic range changes |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| |
| python source\tools\genprops\misc\ucdcopy.py "C:\Documents and Settings\mscherer\My Documents\unicode\ucd\5.2.0" C:\svn\icuproj\icu\trunk\source\data\unidata |
| - includes finding files regardless of version numbers, |
| copying them, and performing the equivalent processing of the |
| ucdstrip and ucdmerge tools on the desired set of files |
| |
| * notes on changes |
| - PropertyAliases.txt |
| moved from numeric to enumerated: |
| ccc ; Canonical_Combining_Class |
| new string properties: |
| NFKC_CF ; NFKC_Casefold |
| Name_Alias; Name_Alias |
| new binary properties: |
| Cased ; Cased |
| CI ; Case_Ignorable |
| CWCF ; Changes_When_Casefolded |
| CWCM ; Changes_When_Casemapped |
| CWKCF ; Changes_When_NFKC_Casefolded |
| CWL ; Changes_When_Lowercased |
| CWT ; Changes_When_Titlecased |
| CWU ; Changes_When_Uppercased |
| new CJK Unihan properties (not supported by ICU) |
| - PropertyValueAliases.txt |
| new block names |
| new scripts |
| one script code change: |
| sc ; Qaai ; Inherited |
| -> |
| sc ; Zinh ; Inherited ; Qaai |
| new Line_Break (lb) value: |
| lb ; CP ; Close_Parenthesis |
| new Joining_Group (jg) values: Farsi_Yeh, Nya |
| other new values: |
| ccc; 214; ATA ; Attached_Above |
| - DerivedBidiClass.txt |
| new default-R range: U+1E800 - U+1EFFF |
| - UnicodeData.txt |
| all of the ISO comments are gone |
| new CJK block end: |
| 9FC3;<CJK Ideograph, Last> -> 9FCB;<CJK Ideograph, Last> |
| new CJK block: |
| 2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;; |
| 2B734;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;; |
| |
| * genpname |
| - run preparse.pl |
| + cd \svn\icuproj\icu\trunk\source\tools\genpname |
| + make sure that data.h is writable |
| + perl preparse.pl \svn\icuproj\icu\trunk > out.txt |
| + preparse.pl complains with errors like the following: |
| Error: sc:Egyp already set to Egyptian_Hieroglyphs, cannot set to Egyp at preparse.pl line 1322, <GEN6> line 34. |
| This is because ICU 4.0 had scripts from ISO 15924 which are now |
| added to Unicode 5.2, and the Perl script shows a conflict between SyntheticPropertyValueAliases.txt |
| and PropertyValueAliases.txt. |
| -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: |
| Egyp, Java, Lana, Mtei, Orkh, Armi, Avst, Kthi, Phli, Prti, Samr, Tavt |
| + preparse.pl complains with errors about block names missing from uchar.h; add them |
| |
| * uchar.h & uscript.h & uprops.h & uprops.c & genprops |
| - new block & script values |
| + 26 new blocks |
| copy new blocks from Blocks.txt |
| MS VC++ 2008 regular expression: |
| find "^{[0-9A-F]+}\.\.{[0-9A-F]+}; {[A-Z].+}$" |
| replace with " UBLOCK_\3 = 172, /*[\1]*/" |
| + several new script values already added in ICU 4.0 for ISO 15924 coverage |
| (removed from SyntheticPropertyValueAliases.txt, see genpname notes above) |
| + 3 new script values added for ISO 15924 and Unicode 5.2 coverage |
| + 1 new script value added for ISO 15924 coverage (not in Unicode 5.2) |
| (added to SyntheticPropertyValueAliases.txt) |
| - new Joining Group (JG) values: Farsi_Yeh, Nya |
| - new Line_Break (lb) value: |
| lb ; CP ; Close_Parenthesis |
| |
| * hardcoded Unihan range end/limit |
| - Unihan range end moves from 9FC3 to 9FCB |
| search for both 9FC3 (end) and 9FC4 (limit) (regex 9FC[34], case-insensitive) |
| + do change gennames.c |
| |
| * Compare definitions of new binary properties with what we used to use |
| in algorithms, to see if the definitions changed. |
| - Verified that definitions for Cased and Case_Ignorable are unchanged. |
| The gencase tool now parses the newly public Case_Ignorable values |
| in case the definition changes in the future. |
| |
| * uchar.c & uprops.h & uprops.c & genprops |
| - new numeric values that didn't exist in Unicode data before: |
| 1/7, 1/9, 1/10, 3/10, 1/16, 3/16 |
| the ones with denominators >9 cannot be supported by uprops.icu formatVersion 5, |
| therefore redesign the encoding of numeric types and values for formatVersion 6; |
| design for simple numbers up to at least 144 ("one gross"), |
| large values up to at least 10^20, |
| and fractions with numerators -1..17 and denominators 1..16 |
| to cover current and expected future values |
| (e.g., more Han numeric values, Meroitic twelfths) |
| |
| * reimplement Hangul_Syllable_Type for new Jamo characters |
| - the old code assumed that all Jamo characters are in the 11xx block |
| - Unicode 5.2 fills holes there and adds new Jamo characters in |
| A960..A97F; Hangul Jamo Extended-A |
| and in |
| D7B0..D7FF; Hangul Jamo Extended-B |
| - Hangul_Syllable_Type can be trivially derived from a subset of |
| Grapheme_Cluster_Break values |
| |
| * build Unicode data source code for hardcoding core data |
| C:\svn\icuproj\icu\trunk\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\trunk\source\data\ CFG=x86\release uni-core-data |
| |
| ICU data make path is \svn\icuproj\icu\trunk\source\data\ |
| ICU root path is \svn\icuproj\icu\trunk |
| Information: cannot find "ucmlocal.mk". Not building user-additional converter files. |
| Information: cannot find "brklocal.mk". Not building user-additional break iterator files. |
| Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "collocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. |
| Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. |
| Information: cannot find "spreplocal.mk". Not building user-additional stringprep files. |
| Creating data file for Unicode Property Names |
| Creating data file for Unicode Character Properties |
| Creating data file for Unicode Case Mapping Properties |
| Creating data file for Unicode BiDi/Shaping Properties |
| Creating data file for Unicode Normalization |
| Unicode .icu files built to "\svn\icuproj\icu\trunk\source\data\out\build\icudt43l" |
| Unicode .c source files built to "\svn\icuproj\icu\trunk\source\data\out\tmp" |
| |
| - copy the .c source files to C:\svn\icuproj\icu\trunk\source\common |
| and rebuild the common library |
| |
| *** UCA |
| |
| - update FractionalUCA.txt with new canonical closure (output from Mark's Unicode tools) |
| - update source/data/unidata/UCARules.txt with UCA_Rules_SHORT.txt from Mark's Unicode tools |
| - update source/test/testdata/CollationTest_*.txt with output from Mark's Unicode tools |
| [ Begin obsolete instructions: |
| Starting with UCA 5.2, we use the CollationTest_*_SHORT.txt files not the *_STUB.txt files. |
| - generate the source/test/testdata/CollationTest_*_STUB.txt files via source/tools/genuca/genteststub.py |
| on Windows: |
| python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_NON_IGNORABLE_SHORT.txt CollationTest_NON_IGNORABLE_STUB.txt |
| python C:\svn\icuproj\icu\trunk\source\tools\genuca\genteststub.py CollationTest_SHIFTED_SHORT.txt CollationTest_SHIFTED_STUB.txt |
| End obsolete instructions] |
| - run all tests with the *_SHORT.txt or the full files (the full ones have comments) |
| not just the *_STUB.txt files |
| - note on intltest: if collate/UCAConformanceTest fails, then |
| utility/MultithreadTest/TestCollators will fail as well; |
| fix the conformance test before looking into the multi-thread test |
| |
| *** Implement Cased & Case_Ignorable properties |
| - via UProperty; call ucase.h functions ucase_getType() and ucase_getTypeOrIgnorable() |
| - Problem: These properties should be disjoint, but aren't |
| - UTC 2009nov decision: skip all Case_Ignorable regardless of whether they are Cased or not |
| - change ucase.icu to be able to store any combination of Cased and Case_Ignorable |
| |
| *** Implement Changes_When_Xyz properties |
| - without stored data |
| |
| *** Implement Name_Alias property |
| - add it as another name field in unames.icu |
| - make it available via u_charName() and UCharNameChoice and |
| - consider it in u_charFromName() |
| |
| *** Break iterators |
| |
| * Update break iterator rules to new UAX versions and new property values |
| * Update source/test/testdata/<boundary>Test.txt files from <unicode.org ucd>/ucd/auxiliary |
| |
| *** new BidiTest file |
| - review format and data |
| - copy BidiTest.txt to source/test/testdata |
| - write test code using this data |
| - fix ICU code where it fails the conformance test |
| |
| *** Java |
| - generally, find and update code corresponding to C/C++ |
| - UCharacter.UnicodeBlock constants: |
| a) add an _ID integer per new block, update COUNT |
| b) add a class instance per new block |
| Visual Studio regex: |
| find UBLOCK_{[^ ]+} = [0-9]+, {/.+} |
| replace with public static final UnicodeBlock \1 = new UnicodeBlock("\1", \1_ID); \2 |
| - CHAR_NAME_ALIAS -> UCharacter.getNameAlias() and getCharFromNameAlias() |
| |
| - port test changes to Java |
| |
| *** LayoutEngine script information |
| |
| (For comparison, see the Unicode 5.1 update: http://bugs.icu-project.org/trac/changeset/23833) |
| |
| * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguages.h, |
| ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (It also generates |
| ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@draft" statement. |
| |
| -> Eric Mader wrote in email on 20090930: |
| "I think the tool has been modified to update @draft to @stable for |
| older scripts and to add @draft for new scripts. |
| (I worked with an intern on this last year.) |
| You should check the output after you run it." |
| |
| * copy the above files into <icu>/source/layout, replacing the old files. |
| * fix mixed line endings |
| * review the diffs and fix incorrect @draft and missing aliases |
| * manually re-add the "Indic script xyz v.2" tags in ScriptAndLanguageTags.h |
| |
| Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp |
| and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) |
| |
| -> Eric Mader wrote in email on 20090930: |
| "This is just a matter of making sure that all the per-script tables have |
| entries for any new scripts that were added. |
| If any new Indic characters were added, then the class tables in |
| IndicClassTables.cpp should be updated to reflect this. |
| John Emmons should know how to do this if it's required." |
| |
| * rebuild the layout and layoutex libraries. |
| |
| *** Documentation |
| - Update User Guide |
| + Jamo_Short_Name, sfc->scf, binary property value aliases |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 5.1 update |
| |
| *** related ICU Trac tickets |
| |
| 5696 Update to Unicode 5.1 |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - configure.in & configure |
| - update ucdVersion in gennames.c if an algorithmic range changes |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| - ucdstrip: |
| DerivedCoreProperties.txt |
| DerivedNormalizationProps.txt |
| NormalizationTest.txt |
| PropList.txt |
| Scripts.txt |
| GraphemeBreakProperty.txt |
| SentenceBreakProperty.txt |
| WordBreakProperty.txt |
| - ucdstrip and ucdmerge: |
| EastAsianWidth.txt |
| LineBreak.txt |
| |
| * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) |
| copy 5.1.0\ucd\BidiMirroring.txt ..\unidata\ |
| copy 5.1.0\ucd\Blocks.txt ..\unidata\ |
| copy 5.1.0\ucd\CaseFolding.txt ..\unidata\ |
| copy 5.1.0\ucd\DerivedAge.txt ..\unidata\ |
| copy 5.1.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ |
| copy 5.1.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ |
| copy 5.1.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ |
| copy 5.1.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ |
| copy 5.1.0\ucd\NormalizationCorrections.txt ..\unidata\ |
| copy 5.1.0\ucd\PropertyAliases.txt ..\unidata\ |
| copy 5.1.0\ucd\PropertyValueAliases.txt ..\unidata\ |
| copy 5.1.0\ucd\SpecialCasing.txt ..\unidata\ |
| copy 5.1.0\ucd\UnicodeData.txt ..\unidata\ |
| |
| ucdstrip < 5.1.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt |
| ucdstrip < 5.1.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt |
| ucdstrip < 5.1.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt |
| ucdstrip < 5.1.0\ucd\PropList.txt > ..\unidata\PropList.txt |
| ucdstrip < 5.1.0\ucd\Scripts.txt > ..\unidata\Scripts.txt |
| ucdstrip < 5.1.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt |
| ucdstrip < 5.1.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt |
| ucdstrip < 5.1.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt |
| ucdstrip < 5.1.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt |
| ucdstrip < 5.1.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt |
| |
| * genpname |
| - run preparse.pl |
| + cd \svn\icuproj\icu\uni51\source\tools\genpname |
| + make sure that data.h is writable |
| + perl preparse.pl \svn\icuproj\icu\uni51 > out.txt |
| + preparse.pl complains with errors like the following: |
| Error: sc:Cari already set to Carian, cannot set to Cari at preparse.pl line 1308, <GEN6> line 30. |
| This is because ICU 3.8 had scripts from ISO 15924 which are now |
| added to Unicode 5.1, and the script shows a conflict between SyntheticPropertyValueAliases.txt |
| and PropertyValueAliases.txt. |
| -> Removed duplicate script entries from SyntheticPropertyValueAliases.txt: |
| Cari, Cham, Kali, Lepc, Lyci, Lydi, Olck, Rjng, Saur, Sund, Vaii |
| + PropertyValueAliases.txt now explicitly contains values for boolean properties: |
| N/Y, No/Yes, F/T, False/True |
| -> Added N/No and Y/Yes to preparse.pl function read_PropertyValueAliases. |
| It will use further values from the file if present. |
| |
| * uchar.h & uscript.h & uprops.h & uprops.c & genprops |
| - new block & script values |
| + 17 new blocks |
| + 11 new script values already added in ICU 3.8 for ISO 15924 coverage |
| (removed from SyntheticPropertyValueAliases.txt) |
| + 14 new script values added for ISO 15924 coverage (not in Unicode 5.1) |
| (added to SyntheticPropertyValueAliases.txt) |
| - uprops.icu (uprops.h) only provides 7 bits for script codes. |
| In ICU 4.0 there are USCRIPT_CODE_LIMIT=130 script codes now. |
| There is none above 127 yet which is the script code for an |
| assigned Unicode character, so ICU 4.0 uprops.icu does not store any |
| script code values greater than 127. |
| However, it does need to store the maximum script value=USCRIPT_CODE_LIMIT-1=129 |
| in a parallel bit field, and that overflows now. |
| Also, future values >=128 would be incompatible anyway. |
| uprops.h is modified to move around several of the bit fields |
| in the properties vector words, and now uses 8 bits for the script code. |
| Two other bit fields also grow to accommodate future growth: |
| Block (current count: 172) grows from 8 to 9 bits, |
| and Word_Break grows from 4 to 5 bits. |
| - renamed property Simple_Case_Folding (sfc->scf) |
| + nothing to be done: handled as normal alias |
| - new property JSN Jamo_Short_Name |
| + no new API: only contributes to the Name property |
| - new Grapheme_Cluster_Break (GCB) value: SM=SpacingMark |
| - new Joining Group (JG) value: Burushashki_Yeh_Barree |
| - new Sentence_Break (SB) values: |
| SB ; CR ; CR |
| SB ; EX ; Extend |
| SB ; LF ; LF |
| SB ; SC ; SContinue |
| - new Word_Break (WB) values: |
| WB ; CR ; CR |
| WB ; Extend ; Extend |
| WB ; LF ; LF |
| WB ; MB ; MidNumLet |
| |
| * Further changes in the 2008-02-29 update: |
| - Default_Ignorable_Code_Point: The new file removes Cc, Cs, noncharacters from DICP |
| because they should not normally be invisible. |
| - new Joining Group (JG) value Burushashki_Yeh_Barree was renamed to Burushaski_Yeh_Barree (one 'h' removed) |
| - new Grapheme_Cluster_Break (GCB) value: PP=Prepend |
| - new Word_Break (WB) value: NL=Newline |
| |
| * hardcoded Unihan range end/limit (see Unicode 4.1 update for comparison) |
| - Unihan range end moves from 9FBB to 9FC3 |
| search for both 9FBB (end) and 9FBC (limit) (regex 9FB[BC], case-insensitive) |
| + do change gennames.c |
| |
| * build Unicode data source code for hardcoding core data |
| C:\svn\icuproj\icu\uni51\source\data>NMAKE /f makedata.mak ICUMAKE=\svn\icuproj\icu\uni51\source\data\ CFG=debug uni-core-data |
| |
| ICU data make path is \svn\icuproj\icu\uni51\source\data\ |
| ICU root path is \svn\icuproj\icu\uni51 |
| Information: cannot find "ucmlocal.mk". Not building user-additional converter files. |
| Information: cannot find "brklocal.mk". Not building user-additional break iterator files. |
| Information: cannot find "reslocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "collocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "rbnflocal.mk". Not building user-additional resource bundle files. |
| Information: cannot find "trnslocal.mk". Not building user-additional transliterator files. |
| Information: cannot find "misclocal.mk". Not building user-additional miscellaenous files. |
| Creating data file for Unicode Character Properties |
| Creating data file for Unicode Case Mapping Properties |
| Creating data file for Unicode BiDi/Shaping Properties |
| Creating data file for Unicode Normalization |
| Unicode .icu files built to "\svn\icuproj\icu\uni51\source\data\out\build\icudt39l" |
| Unicode .c source files built to "\svn\icuproj\icu\uni51\source\data\out\tmp" |
| |
| - copy the .c source files to C:\svn\icuproj\icu\uni51\source\common |
| and rebuild the common library |
| |
| *** Break iterators |
| |
| * Update break iterator rules to new UAX versions and new property values |
| |
| *** UCA |
| |
| * update FractionalUCA.txt and UCARules.txt with new canonical closure |
| |
| *** Test suites |
| - Test that APIs using Unicode property value aliases (like UnicodeSet) |
| support all of the boolean values N/Y, No/Yes, F/T, False/True |
| -> TestBinaryValues() tests in both cintltst and intltest |
| |
| *** LayoutEngine script information |
| * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, |
| ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates |
| ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@draft" statement. |
| |
| * copy the above files into <icu>/source/layout, replacing the old files. |
| |
| Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp |
| and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) |
| |
| * rebuild the layout and layoutex libraries. |
| |
| *** Documentation |
| - Update User Guide |
| + Jamo_Short_Name, sfc->scf, binary property value aliases |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 5.0 update |
| |
| *** related Jitterbugs |
| |
| 5084 RFE: Update to Unicode 5.0 |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| - ucdstrip: |
| DerivedCoreProperties.txt |
| DerivedNormalizationProps.txt |
| NormalizationTest.txt |
| PropList.txt |
| Scripts.txt |
| GraphemeBreakProperty.txt |
| SentenceBreakProperty.txt |
| WordBreakProperty.txt |
| - ucdstrip and ucdmerge: |
| EastAsianWidth.txt |
| LineBreak.txt |
| |
| * my ucd2unidata.bat (needs to be updated each time with UCD and file version numbers) |
| copy 5.0.0\ucd\BidiMirroring.txt ..\unidata\ |
| copy 5.0.0\ucd\Blocks.txt ..\unidata\ |
| copy 5.0.0\ucd\CaseFolding.txt ..\unidata\ |
| copy 5.0.0\ucd\DerivedAge.txt ..\unidata\ |
| copy 5.0.0\ucd\extracted\DerivedBidiClass.txt ..\unidata\ |
| copy 5.0.0\ucd\extracted\DerivedJoiningGroup.txt ..\unidata\ |
| copy 5.0.0\ucd\extracted\DerivedJoiningType.txt ..\unidata\ |
| copy 5.0.0\ucd\extracted\DerivedNumericValues.txt ..\unidata\ |
| copy 5.0.0\ucd\NormalizationCorrections.txt ..\unidata\ |
| copy 5.0.0\ucd\PropertyAliases.txt ..\unidata\ |
| copy 5.0.0\ucd\PropertyValueAliases.txt ..\unidata\ |
| copy 5.0.0\ucd\SpecialCasing.txt ..\unidata\ |
| copy 5.0.0\ucd\UnicodeData.txt ..\unidata\ |
| |
| ucdstrip < 5.0.0\ucd\DerivedCoreProperties.txt > ..\unidata\DerivedCoreProperties.txt |
| ucdstrip < 5.0.0\ucd\DerivedNormalizationProps.txt > ..\unidata\DerivedNormalizationProps.txt |
| ucdstrip < 5.0.0\ucd\NormalizationTest.txt > ..\unidata\NormalizationTest.txt |
| ucdstrip < 5.0.0\ucd\PropList.txt > ..\unidata\PropList.txt |
| ucdstrip < 5.0.0\ucd\Scripts.txt > ..\unidata\Scripts.txt |
| ucdstrip < 5.0.0\ucd\auxiliary\GraphemeBreakProperty.txt > ..\unidata\GraphemeBreakProperty.txt |
| ucdstrip < 5.0.0\ucd\auxiliary\SentenceBreakProperty.txt > ..\unidata\SentenceBreakProperty.txt |
| ucdstrip < 5.0.0\ucd\auxiliary\WordBreakProperty.txt > ..\unidata\WordBreakProperty.txt |
| ucdstrip < 5.0.0\ucd\EastAsianWidth.txt | ucdmerge > ..\unidata\EastAsianWidth.txt |
| ucdstrip < 5.0.0\ucd\LineBreak.txt | ucdmerge > ..\unidata\LineBreak.txt |
| |
| * update FractionalUCA.txt and UCARules.txt with new canonical closure |
| |
| * genpname |
| - run preparse.pl |
| + make sure that data.h is writable |
| + perl preparse.pl \cvs\oss\icu > out.txt |
| |
| * uchar.h & uscript.h & uprops.h & uprops.c & genprops |
| - new block & script values |
| + script values already added in ICU 3.6 because all of ISO 15924 is now covered |
| |
| * build Unicode data source code for hardcoding core data |
| C:\cvs\oss\icu\source\data>NMAKE /f makedata.mak ICUMAKE=\cvs\oss\icu\source\data\ CFG=debug uni-core-data |
| |
| ICU data make path is \cvs\oss\icu\source\data\ |
| ICU root path is \cvs\oss\icu |
| Information: cannot find "ucmlocal.mk". Not building user-additional converter files. |
| [etc.] |
| Creating data file for Unicode Character Properties |
| Creating data file for Unicode Case Mapping Properties |
| Creating data file for Unicode BiDi/Shaping Properties |
| Creating data file for Unicode Normalization |
| Unicode .icu files built to "\cvs\oss\icu\source\data\out\build\icudt35l" |
| Unicode .c source files built to "\cvs\oss\icu\source\data\out\tmp" |
| |
| - copy the .c source files to C:\cvs\oss\icu\source\common |
| and rebuild the common library |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - configure.in |
| |
| *** LayoutEngine script information |
| * Run ICU4J com.ibm.icu.dev.tool.layout.ScriptNameBuilder. This generates LEScripts.h, LELanguage.h, |
| ScriptAndLanguageTags.h and ScriptAndLanguageTags.cpp in the working directory. (it also generates |
| ScriptRunData.cpp, which is no longer needed.) |
| |
| The generated files have a current copyright date and "@draft" statement. |
| |
| * copy the above files into <icu>/source/layout, replacing the old files. |
| |
| Add new default entries to the indicClassTables array in <icu>/source/layout/IndicClassTables.cpp |
| and the complexTable array in <icu>/source/layoutex/ParagraphLayout.cpp. (This step should be automated...) |
| |
| * rebuild the layout and layoutex libraries. |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 4.1 update |
| |
| *** related Jitterbugs |
| |
| 4332 RFE: Update to Unicode 4.1 |
| 4157 RBBI, TR29 4.1 updates |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| - ucdstrip: |
| DerivedCoreProperties.txt |
| DerivedNormalizationProps.txt |
| NormalizationTest.txt |
| GraphemeBreakProperty.txt |
| SentenceBreakProperty.txt |
| WordBreakProperty.txt |
| - ucdstrip and ucdmerge: |
| EastAsianWidth.txt |
| LineBreak.txt |
| |
| * add new files to the repository |
| GraphemeBreakProperty.txt |
| SentenceBreakProperty.txt |
| WordBreakProperty.txt |
| |
| * update FractionalUCA.txt and UCARules.txt with new canonical closure |
| |
| * genpname |
| - handle new enumerated properties in sub read_uchar |
| - run preparse.pl |
| |
| * uchar.h & uscript.h & uprops.h & uprops.c & genprops |
| - new binary properties |
| + Pattern_Syntax |
| + Pattern_White_Space |
| - new enumerated properties |
| + Grapheme_Cluster_Break |
| + Sentence_Break |
| + Word_Break |
| - new block & script & line break values |
| |
| * gencase |
| - case-ignorable changes |
| see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods |
| now: (D47a) Word_Break=MidLetter or Mn, Me, Cf, Lm, Sk |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - configure.in |
| |
| *** tests |
| - verify that u_charMirror() round-trips |
| - test all new properties and some new values of old properties |
| |
| *** other code |
| |
| * hardcoded Unihan range end/limit |
| - Unihan range end moves from 9FA5 to 9FBB |
| search for both 9FA5 (end) and 9FA6 (limit) (regex 9FA[56], case-insensitive) |
| + do not modify BOCU/BOCSU code because that would change the encoding |
| and break binary compatibility! |
| + similarly, do not change the GB 18030 range data (ucnvmbcs.c), |
| NamePrepProfile.txt |
| + ignore trietest.c: test data is arbitrary |
| + ignore tstnorm.cpp: test optimization, not important |
| + ignore collation: 9FA[56] only appears in comments; swapCJK() uses the whole block up to 9FFF |
| + do change line_th.txt and word_th.txt |
| by replacing hardcoded ranges with the new property values |
| + do change gennames.c |
| |
| source\data\brkitr\line_th.txt(229): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 |
| source\data\brkitr\word_th.txt(23): \u33E0-\u33FE \u3400-\u4DB5 \u4E00-\u9FA5 \uA000-\uA48C \uA490-\uA4C6 |
| source\tools\gennames\gennames.c(971): 0x4e00, 0x9fa5, |
| |
| * case mappings |
| - compare new special casing context conditions with previous ones |
| see http://www.unicode.org/versions/Unicode4.1.0/#CaseMods |
| |
| * genpname |
| - consider storing only the short name if it is the same as the long name |
| |
| *** other reviews |
| - UAX #29 changes (grapheme/word/sentence breaks) |
| - UAX #14 changes (line breaks) |
| - Pattern_Syntax & Pattern_White_Space |
| |
| ---------------------------------------------------------------------------- *** |
| |
| Unicode 4.0.1 update |
| |
| *** related Jitterbugs |
| |
| 3170 RFE: Update to Unicode 4.0.1 |
| 3171 Add new Unicode 4.0.1 properties |
| 3520 use Unicode 4.0.1 updates for break iteration |
| |
| *** data files & enums & parser code |
| |
| * file preparation |
| - ucdstrip: DerivedNormalizationProps.txt, NormalizationTest.txt, DerivedCoreProperties.txt |
| - ucdstrip and ucdmerge: EastAsianWidth.txt, LineBreak.txt |
| |
| * file fixes |
| - fix UnicodeData.txt general categories of Ethiopic digits Nd->No |
| according to PRI #26 |
| http://www.unicode.org/review/resolved-pri.html#pri26 |
| - undone again because no corrigendum in sight; |
| instead modified tests to not check consistency on this for Unicode 4.0.1 |
| |
| * ucdterms.txt |
| - update from http://www.unicode.org/copyright.html |
| formatted for plain text |
| |
| * uchar.h & uprops.h & uprops.c & genprops |
| - add UBLOCK_CYRILLIC_SUPPLEMENT because the block is renamed |
| - add U_LB_INSEPARABLE due to a spelling fix |
| + put short name comment only on line with new constant |
| for genpname perl script parser |
| - new binary properties |
| + STerm |
| + Variation_Selector |
| |
| * genpname |
| - fix genpname perl script so that it doesn't choke on more than 2 names per property value |
| - perl script: correctly calculate the maximum number of fields per row |
| |
| * uscript.h |
| - new script code Hrkt=Katakana_Or_Hiragana |
| |
| * gennorm.c track changes in DerivedNormalizationProps.txt |
| - "FNC" -> "FC_NFKC" |
| - single field "NFD_NO" -> two fields "NFD_QC; N" etc. |
| |
| * genprops/props2.c track changes in DerivedNumericValues.txt |
| - changed from 3 columns to 2, dropping the numeric type |
| + assume that the type is always numeric for Han characters, |
| and that only those are added in addition to what UnicodeData.txt lists |
| |
| *** Unicode version numbers |
| - makedata.mak |
| - uchar.h |
| - configure.in |
| |
| *** tests |
| - update test of default bidi classes according to PRI #28 |
| /tsutil/cucdtst/TestUnicodeData |
| http://www.unicode.org/review/resolved-pri.html#pri28 |
| - bidi tests: change exemplar character for ES depending on Unicode version |
| - change hardcoded expected property values where they change |
| |
| *** other code |
| |
| * name matching |
| - read UCD.html |
| |
| * scripts |
| - use new Hrkt=Katakana_Or_Hiragana |
| |
| * ZWJ & ZWNJ |
| - are now part of combining character sequences |
| - break iteration used to assume that LB classes did not overlap; now they do for ZWJ & ZWNJ |