Annals of Wu

a sinotibetoburman linguistics blog

Cantonese Coming To Google Translate tools

Google has been sending out emails regarding a push to get them to include Cantonese. It reads as follows:
Help improve Google Translate
We've heard your feedback about adding Cantonese to Google Translate. You can now improve Cantonese translations and help us get closer to our goal of adding it.
To get started, simply visit Google Translate Community and invite others to participate.
Thank you in advance,
Google Translate Community team
Internationally, it makes sense to do Cantonese before Shanghainese. Logistically as well since there's a much greater corpus of standardised Cantonese data. Still makes me sad not to see Hakka or Wu anywhere on their list.

Still, it's progress.

To contribute, go to this link and start translating.

    Developing A Cooperative Dictionary Creation Tool tools - dictionary

    I was frustrated that TLex, despite being a great piece of software, is so darn expensive. I don't fault the developers for that. It's their every right to try to make back some of the money spent in creating it. Still, I can't afford it. So instead, I'm working on my own such software, with the added intent to have a more cooperative focus to include crowd-sourced dictionaries, much the way we've been running Phonemica from the start. The idea is to have the data exist in the cloud as much as possible with locally stored revisions and logs for each contributor, since you can't always be online. Then that way multiple people can work on a single language's dictionary as they go, with edits and revisions checked and confirmed against the time they were done rather than the time they were uploaded. Here's a preview of one of the screens, still very much in progress:

    An additional hope for the future is that much of what's being done on the back end of this will have later implications for Phonemica and how data is handled there, especially in terms of being able to work on entries offline and then uploading the data later without fears of someone immediately wiping out another person's hard work without even realising it. Or as I call it, the "n00b effect" as is often seen on OpenStreetMap.

    I'm working on a few things still. The screenshot is just the minimum functionality to get it up to where the older PHP based version is at.

    The most recent edition, not pictured, is offline revision logging. You and a colleague/friend/neighbour/classmate are both working on a dictionary for Waxiang. You're up in the mountains with spotty internet access and knockoff Erguotou while they're chilling in their flat in Lujiazui streaming Netflix and drinking single malt. Not to worry because your edits are still being saved, even if they're not immediately applying to the database. Then when you have internet access again, your logged edits are checked against theirs to see if you both edited a single entry. If so, you have the option to choose one of the two most recent edits to be applied to the database. This way you can both be working even if you're not both online.

      Update Coming For Shanghainese Phonetic Corpus tools - corpus

      In a couple weeks I'll have a huge update done to the phonetic corpus. Previously I put together a rough tool of a few thousand characters being based on the Guanyun tables, but obviously this takes a big his in accuracy.

      The latest update will cover over 8400 characters, plus a pretty large set of mono- and multi-syllabic words, over 30,000 in all.

      In addition to the IPA data, the new set also includes uniform romanisation and tentative definitions pulled from a number of open source dictionaries and open forums covering this sort of thing.

      Also as part of this update I'll be updating the version of the data used on Tatoeba and similar sites.

      If you'd like to take the romanisation for a test run, you can do so at this page:

      Note that it currently only supports traditional characters.

      Very busy week coming up, but after June 20 I'll have a lot more free time, and I'll be trying to update here regularly.

        Shanghainese Corpus Is Back Up tools - corpus

        The Shanghainese IPA tool is back up, but with a caveat: The data is not guaranteed to be accurate. The current data set is taken from various resources, and then applied to and extrapolated from the 广韵 rime tables. As such some characters may not return accurate readings. The data will be updated, but likely not until this summer.

          Phonetic Corpora Updates tools - corpus

          As part of my efforts to improve the accuracy and usability of the Shanghainese phonetic data set, I'm going through and running parallel collections for Suzhou and Changzhou. This is all being done by filling out a copy of 方言调查字表 for each, putting that data into an online database and then applying it to a second database, itself based roughly on the 廣韻, but with revisions. If you look for more than 2 seconds on Google you can find a pdf of 方言调查字表, however the retail price for a nice clean published version is 16RMB or anywhere from NT$40 to NT$70 on Taiwan. That store to which I linked just now is a great place to buy Mainland-published books in Taiwan, by the way.

          This approach has yielded some interesting discoveries. For starter, there's a lot of evidence for plain old borrowing from Mandarin, tone class and all, for a number of words that otherwise should be entering tones, about which I wrote fairly recently. Beyond that, it's also good to see, side by side, pronunciations for common words in Suzhou, Changzhou and Shanghai. That brings me to another excellent book for fangyan research/comparison, 汉语方言字汇, edited by 王福堂. It's basically 方言调查字表, filled out for twenty different dialects including Suzhou and Wenzhou. While I'm relying more on 汪平's Suzhou dialect dictionary, it'll surely be useful to fill in some inevitable holes. I'm hesitant to include Wenzhou too much, as in general it's far removed from Taihu dialects like Changzhou, Suzhou and Shanghai, and at this stage it'll be too likely to influence my judgements of what's been borrowed versus what's a natural phonetic divergence. Still, a great book for a general overview.

          Beware shopping on Kongfz.com. While I've had great luck with them in the past for buying antique books, shipped from Jiangsu to Jiangsu, I'm a little weary about whether or not my latest purchase will actually make it to Ilha Formosa as it should.

            Phonetic Corpus Re-write tools - corpus

            Just a quick update to say that the phonetic corpus, as is, has a number of errors that need to be corrected, for reasons mentioned in the previous post. More than that, I'm trying to get a much more stable, accurate and comprehensive version put together so that it can be made available for public consumption.

            I'm also in the process of building a parallel Suzhou phonetic corpus. This is all quite time consuming, and it's being done in addition to my other grad work. I'll have it up as soon as I can, with progress reports in the mean time.

              HUGE Phonetic Corpus Update tools - corpus

              It took a few hours of solid work, but I've just upgraded the Shanghainese phonetic corpus. It previously supported about 5,300 characters. It now supports over 25,000. It's not complete, but it's much much better than it was. There have also been some corrections made, and more to come.

              Feel free to try it out. Keep in mind this is being actively worked on. Consider it a beta version that will likely need some more corrections and bug fixes.

                Big Changes To The Phonetic Corpus tools - corpus

                Aaaand we're back. I admit, I lost my password. Got it back now. Also some images are still gone from the move to the new server. I'll get those fixed up as soon as I can.

                The big reason for posting tonight: Huge changes are coming to the Shanghainese phonetic corpus. What kind of changes? Programming friendly changes for one. Tonal encoding changes for another. Oh and a larger data set. Now instead of a plain text file with nothing more than hanzi and poorly tone-marked IPA, I'm migrating the site to a proper SQL file format with much more data organised for easier sorting by web-based programs. But that's just the beginning.

                What's prompted the return to the blog? Mostly it's that I'm studying in a proper linguistics department so I can justify all the time spent on language related projects. I was in a philosophy program when the blog fell apart and it was hard to find the time to work on this. Now it's sort of a requirement to do it.

                Also, be on the look out for some small changes at Phonemica related to this blog being reactivated. I'm quietly plotting to introduce a Shanghainese localisation of the site without Syz catching on.

                Are you a native speaker? Care to help translate the Mandarin pages to Shanghainese? Or maybe you're a native Cantonese speaker and would like to help create Cantonese translations. Both would be appreciated. Shoot me an email or leave a comment if you'd like to help.

                  Wu IPA Keyboard Layout Update tools - IME

                  Despite the name, this layout lets you quickly type the IPA glyphs most commonly used not just in Wu but in all major Sinitic languages and dialects. At the moment this is only available for computers running the OS X operating system from Apple.

                  You can read the original post from October 2009, which includes images of the layout.

                  Recent changes
                  30 April 2010

                  swapped ŋ and ɲ between shift and option keys. This was after months of constantly hitting the wrong one

                  moved ɱ from option to shift key to match ŋ

                  added ᴴ ᴹ and ᴸ for marking Shanghainese tone sandhi

                  added superscript glottal stop ˀ on shift+?, standard verstion ʔ moved to option+?

                  Installation (Mac OS X)

                  1. Extract the .zip file’s contents (Wu.icns and Wu.keylayout) to ~/Library/Keyboard Layouts.

                  2. Under the International (Leopard & earlier) or Language & Text (Snow Leopard) preference pane in System Preferences, go to Input Sources
                  3. Scroll down in the list all the way to the bottom. Check “Wu- IPA”.
                  4. Log out of OS X and then log back in.

                  The keyboard layout should then be available in the input method menu in the menu bar.

                    Corpora!    tools - corpus

                    It's fun to say. Go ahead. Give it a shot.

                    I've spent the last week, probably averaging about 6+ hours a day in between school and my (uncoincidentally) limited social life crunching text, getting paper cuts and carpal tunnel syndrome and ending up with blackened fingertips and blacker keys. And now, thousands of lines of text later, I'd like to officially announce what may be the first Shanghainese phonetic data set of this size fully in IPA.

                    It's a collection of widely used characters (7000ish) with their pronunciation as would be heard in the Shanghai dialect of Wu, all done up in the International Phonetic Alphabet, complete with tones.

                    The reason behind it was primarily that a number of Mandarin dictionaries offer Cantonese pronunciation as an option. I have yet to see one that really covers Wu in any systematic way. The best thing I've seen that does handle Wu isn't a dictionary. Now, because of this data, some are starting to and others will hopefully follow suit.

                    Before I repeat much more of what can be found on the project page, why not head over and take a look. Further developments will be reflected there.

                    Thanks to Allan Simon and Christoph Burgmer for their contributions and help.

                      Wu Phonetics Corpus tools - corpus

                      One of the hurdles in learning Shanghainese, or for that matter any dialect of Wu, is the lack of easily accessible data. There are services that provide phonetic transcription based on character input, but often the results given are some proprietary form of pinyin. For more phonetically accurate results, i.e. IPA, there are books that provide that information, though often for only a limited number of characters.
                      In an effort to fix that, I’ve compiled a list of characters with their corresponding IPA pronunciation. It’s a tabbed text file, UTF-8 encoding. The original file is loosely based on a similar list of about 450 entries provided by Tatoeba.org.
                      The data set covers the most commonly used characters for writing Wu, as well as a number of other characters to cover things like family names and Wu-specific 语气词. It started as a list of just over 450, quickly expanding to 1400 entries and recently to just over 5300 now over 7520. More entries are continually being added.
                      Who uses this? For starters, this data set has been integrated into Tatoeba.com for both entries in Shanghainese to IPA tool as well is in their general Shanghainese sentences. Sentences entered on the site using characters will be converted as below.
                      ɦi⁵³ ɦɑ̃⁵³ ʦɤ lɛ⁵³ gəˀ¹²
                      It will also be included as part of the upcoming release of the Eclectus dictionary created by Christoph Burgmer and the related cjklib project..
                      Expect to see the data appear elsewhere in the near future.
                      If you’re interested in using the data for your project, send me an email at kellenparker在sinoglot.com explaining what the project is and how you plan to use the data.
                      The only thing I ask is that you credit me in some way for the many many hours I’ve put into collecting the data. I’m releasing this under the Creative Commons CC-BY license.
                      I’m looking in a few different directions as to what else to do to improve the data. I don’t want to get too much into it just yet, but keep an eye on this space for updates.
                      Thanks to Allan Simon of Tatoeba.org for providing me with an initial 450+ word set and for allowing me to contribute to Tatoeba’s data set. Also thanks to Christoph Burgmer for helping work out some kinks and for being willing to include the data into Eclectus.

                        Tools / Resources tools

                        Welcome to the Annals of Wu tools section. A number of tools are in the works. The list is small

                          Wu IPA Keyboard Layout tools - IME

                          I've been messing around with the built-in OS X character palette whenever I've needed IPA for the transcription here, which ends up being pretty freaking often. It was getting somewhat cumbersome to drag and drop every single character, tone number, diacritic etc. So much so that I put together a text file full of most of the characters plus a few that are used frequently in transcribing Chinese language.

                          Then when that became cumbersome, I looked for other options. I found an input palette but that required me to jump from keyboard to trackpad and back, which is also a pain, so that was a no-go. In the end I gave up and wrote my own keyboard layout.

                          The image here should be explanation enough. The first set is without modifiers. Second is under the option key, third the shift key. The fourth one is both shift and option together. So shift+option+i will print ɿ.

                          It may be worth noting that this keyboard layout can not be used to type normally. It's meant to be toggled along with other input methods for quickly.

                          I'm working out a few other keys but am mostly finished with this. If you're interested in trying it out, let me know and I can send you a link or an email with the files.

                          Currently this only works on Macs running OS X. I'm working on a .kmap version as well, compatible with BeOS/Haiku and OS 9, and maybe a Windows version as well, though all more for the experience than any perceived need.


                            A semi-academic linguistics blog about Sinotibetan, previously focused primarily on Wú, a Sinitic language spoken in the Yangtze Delta region. Topics now include historical linguistics, documentation, language rights, sociolinguistics and learning materials, as well as acting as the dev blog for Phonemica from time to time.

                            I'm a linguist based in Asia, working on documentation and historical development of Sinotibetan. In addition to academic research, I'm heavily involved in Phonemica, an organisation that promotes crowd-sourced preservation of local languages.

                            I'm currently in the field, so getting in touch isn't easy. However you can try to email me at the following address and I'll respond as soon as I'm able:

                            © 2009-2017