吳實錄

Annals of Wu

漢藏緬語々言研究ㄟ博客
a sinotibetoburman linguistics blog
2014-09-01

Improved Segmenting On Phonemica phonemica - development

Segmenting is a pain. It's the most time consuming part of processing recordings for the site.

In general, for each file that's submitted, about 30 minutes gets spent in preparing the story for transcription. About 10 minutes of that is spent on file preparation and then 5 on getting the database information corrected and another 15 for cutting the sound up into the segments you see in the transcript.

It's just not sustainable. I'm the only one doing it, and it's incredibly time consuming when the submissions start building up. To solve the processing time before segmenting, I've written a script that runs on the main Phonemica computer which automates all the file conversion work that needs to be done. There's a lot of it.

The other major improvement is with the segmenter. The site administrators have access to a tool that allows them to segment recordings. This involves working with a waveform that goes along with the audio, and then manually marking the start and stop of each segment. It works, and it's what we've used for the past year, but it's pretty awful for how much time it still takes to do the work.

Starting this coming weekend, that iteration of the tool will be deleted forever. In it's place, we now have something much more useful. Below is a shot of the new system.



This is the new version. Note that this isn't just a replacement for the segmenter tool that you've probably never seen or heard of before now; It's also a replacement for the old transcript system on the pages of individual entries.

For users, this will be something immediately available for transcription and cleanup. For me, the bigger part is replacing the segmenter, since that was the major time suck on our end. Rather than me having to go through and segment, and then only other admins being able to edit, now anyone who's signed in can not only edit the transcription as before, but also fine tune the timing of each phrase. If you don't like how an entry is segmented, and there are plenty that need cleaning up, you now have the power to do something about it. Want to add another segment? Go for it. Are there too many segments in a small space? Now you can delete them. Don't delete all of them though, becuase that would stuck.

We really think this is going to be a huge improvement in usability, and we hope you'll agree. This is part of a larger push to give people more control over their stories, as well as getting the data in better shape for future uses by researchers or language learners.

If you have any questions, feel free to shoot me an email. Otherwise stay tuned as the update should be in place by the weekend.

    Leave a comment




    About

    A semi-academic linguistics blog about Sinotibetan, previously focused primarily on Wú, a Sinitic language spoken in the Yangtze Delta region. Topics now include historical linguistics, documentation, language rights, sociolinguistics and learning materials, as well as acting as the dev blog for Phonemica from time to time.

    I'm a linguist based in Asia, working on documentation and historical development of Sinotibetan. In addition to academic research, I'm heavily involved in Phonemica, an organisation that promotes crowd-sourced preservation of local languages.

    I'm currently in the field, so getting in touch isn't easy. However you can try to email me at the following address and I'll respond as soon as I'm able:

    yhilan.ko@gmail.com
    © 2009-2017