Annals of Wu

a sinotibetoburman linguistics blog

Update Coming For Shanghainese Phonetic Corpus tools - corpus

In a couple weeks I'll have a huge update done to the phonetic corpus. Previously I put together a rough tool of a few thousand characters being based on the Guanyun tables, but obviously this takes a big his in accuracy.

The latest update will cover over 8400 characters, plus a pretty large set of mono- and multi-syllabic words, over 30,000 in all.

In addition to the IPA data, the new set also includes uniform romanisation and tentative definitions pulled from a number of open source dictionaries and open forums covering this sort of thing.

Also as part of this update I'll be updating the version of the data used on Tatoeba and similar sites.

If you'd like to take the romanisation for a test run, you can do so at this page:

Note that it currently only supports traditional characters.

Very busy week coming up, but after June 20 I'll have a lot more free time, and I'll be trying to update here regularly.

    Shanghainese Corpus Is Back Up tools - corpus

    The Shanghainese IPA tool is back up, but with a caveat: The data is not guaranteed to be accurate. The current data set is taken from various resources, and then applied to and extrapolated from the 广韵 rime tables. As such some characters may not return accurate readings. The data will be updated, but likely not until this summer.

      Phonetic Corpora Updates tools - corpus

      As part of my efforts to improve the accuracy and usability of the Shanghainese phonetic data set, I'm going through and running parallel collections for Suzhou and Changzhou. This is all being done by filling out a copy of 方言调查字表 for each, putting that data into an online database and then applying it to a second database, itself based roughly on the 廣韻, but with revisions. If you look for more than 2 seconds on Google you can find a pdf of 方言调查字表, however the retail price for a nice clean published version is 16RMB or anywhere from NT$40 to NT$70 on Taiwan. That store to which I linked just now is a great place to buy Mainland-published books in Taiwan, by the way.

      This approach has yielded some interesting discoveries. For starter, there's a lot of evidence for plain old borrowing from Mandarin, tone class and all, for a number of words that otherwise should be entering tones, about which I wrote fairly recently. Beyond that, it's also good to see, side by side, pronunciations for common words in Suzhou, Changzhou and Shanghai. That brings me to another excellent book for fangyan research/comparison, 汉语方言字汇, edited by 王福堂. It's basically 方言调查字表, filled out for twenty different dialects including Suzhou and Wenzhou. While I'm relying more on 汪平's Suzhou dialect dictionary, it'll surely be useful to fill in some inevitable holes. I'm hesitant to include Wenzhou too much, as in general it's far removed from Taihu dialects like Changzhou, Suzhou and Shanghai, and at this stage it'll be too likely to influence my judgements of what's been borrowed versus what's a natural phonetic divergence. Still, a great book for a general overview.

      Beware shopping on Kongfz.com. While I've had great luck with them in the past for buying antique books, shipped from Jiangsu to Jiangsu, I'm a little weary about whether or not my latest purchase will actually make it to Ilha Formosa as it should.

        Phonetic Corpus Re-write tools - corpus

        Just a quick update to say that the phonetic corpus, as is, has a number of errors that need to be corrected, for reasons mentioned in the previous post. More than that, I'm trying to get a much more stable, accurate and comprehensive version put together so that it can be made available for public consumption.

        I'm also in the process of building a parallel Suzhou phonetic corpus. This is all quite time consuming, and it's being done in addition to my other grad work. I'll have it up as soon as I can, with progress reports in the mean time.

          HUGE Phonetic Corpus Update tools - corpus

          It took a few hours of solid work, but I've just upgraded the Shanghainese phonetic corpus. It previously supported about 5,300 characters. It now supports over 25,000. It's not complete, but it's much much better than it was. There have also been some corrections made, and more to come.

          Feel free to try it out. Keep in mind this is being actively worked on. Consider it a beta version that will likely need some more corrections and bug fixes.

            Big Changes To The Phonetic Corpus tools - corpus

            Aaaand we're back. I admit, I lost my password. Got it back now. Also some images are still gone from the move to the new server. I'll get those fixed up as soon as I can.

            The big reason for posting tonight: Huge changes are coming to the Shanghainese phonetic corpus. What kind of changes? Programming friendly changes for one. Tonal encoding changes for another. Oh and a larger data set. Now instead of a plain text file with nothing more than hanzi and poorly tone-marked IPA, I'm migrating the site to a proper SQL file format with much more data organised for easier sorting by web-based programs. But that's just the beginning.

            What's prompted the return to the blog? Mostly it's that I'm studying in a proper linguistics department so I can justify all the time spent on language related projects. I was in a philosophy program when the blog fell apart and it was hard to find the time to work on this. Now it's sort of a requirement to do it.

            Also, be on the look out for some small changes at Phonemica related to this blog being reactivated. I'm quietly plotting to introduce a Shanghainese localisation of the site without Syz catching on.

            Are you a native speaker? Care to help translate the Mandarin pages to Shanghainese? Or maybe you're a native Cantonese speaker and would like to help create Cantonese translations. Both would be appreciated. Shoot me an email or leave a comment if you'd like to help.

              Corpora!    tools - corpus

              It's fun to say. Go ahead. Give it a shot.

              I've spent the last week, probably averaging about 6+ hours a day in between school and my (uncoincidentally) limited social life crunching text, getting paper cuts and carpal tunnel syndrome and ending up with blackened fingertips and blacker keys. And now, thousands of lines of text later, I'd like to officially announce what may be the first Shanghainese phonetic data set of this size fully in IPA.

              It's a collection of widely used characters (7000ish) with their pronunciation as would be heard in the Shanghai dialect of Wu, all done up in the International Phonetic Alphabet, complete with tones.

              The reason behind it was primarily that a number of Mandarin dictionaries offer Cantonese pronunciation as an option. I have yet to see one that really covers Wu in any systematic way. The best thing I've seen that does handle Wu isn't a dictionary. Now, because of this data, some are starting to and others will hopefully follow suit.

              Before I repeat much more of what can be found on the project page, why not head over and take a look. Further developments will be reflected there.

              Thanks to Allan Simon and Christoph Burgmer for their contributions and help.

                Wu Phonetics Corpus tools - corpus

                One of the hurdles in learning Shanghainese, or for that matter any dialect of Wu, is the lack of easily accessible data. There are services that provide phonetic transcription based on character input, but often the results given are some proprietary form of pinyin. For more phonetically accurate results, i.e. IPA, there are books that provide that information, though often for only a limited number of characters.
                In an effort to fix that, I’ve compiled a list of characters with their corresponding IPA pronunciation. It’s a tabbed text file, UTF-8 encoding. The original file is loosely based on a similar list of about 450 entries provided by Tatoeba.org.
                The data set covers the most commonly used characters for writing Wu, as well as a number of other characters to cover things like family names and Wu-specific 语气词. It started as a list of just over 450, quickly expanding to 1400 entries and recently to just over 5300 now over 7520. More entries are continually being added.
                Who uses this? For starters, this data set has been integrated into Tatoeba.com for both entries in Shanghainese to IPA tool as well is in their general Shanghainese sentences. Sentences entered on the site using characters will be converted as below.
                ɦi⁵³ ɦɑ̃⁵³ ʦɤ lɛ⁵³ gəˀ¹²
                It will also be included as part of the upcoming release of the Eclectus dictionary created by Christoph Burgmer and the related cjklib project..
                Expect to see the data appear elsewhere in the near future.
                If you’re interested in using the data for your project, send me an email at kellenparker在sinoglot.com explaining what the project is and how you plan to use the data.
                The only thing I ask is that you credit me in some way for the many many hours I’ve put into collecting the data. I’m releasing this under the Creative Commons CC-BY license.
                I’m looking in a few different directions as to what else to do to improve the data. I don’t want to get too much into it just yet, but keep an eye on this space for updates.
                Thanks to Allan Simon of Tatoeba.org for providing me with an initial 450+ word set and for allowing me to contribute to Tatoeba’s data set. Also thanks to Christoph Burgmer for helping work out some kinks and for being willing to include the data into Eclectus.


                  A semi-academic linguistics blog about Sinotibetan, previously focused primarily on Wú, a Sinitic language spoken in the Yangtze Delta region. Topics now include historical linguistics, documentation, language rights, sociolinguistics and learning materials, as well as acting as the dev blog for Phonemica from time to time.

                  I'm a linguist based in Asia, working on documentation and historical development of Sinotibetan. In addition to academic research, I'm heavily involved in Phonemica, an organisation that promotes crowd-sourced preservation of local languages.

                  I'm currently in the field, so getting in touch isn't easy. However you can try to email me at the following address and I'll respond as soon as I'm able:

                  © 2009-2017