Proposed new doc hierarchy for closed-captions / transcripts from conferences

Murray Stokely murray at stokely.org
Mon Jan 18 07:57:18 UTC 2010


As some of you might be aware I have been working on getting closed
captions for the videos of FreeBSD related talks at conferences.  In
the last month I've started using the YouTube Machine Learning to
produce the first automatic transcript and then paying human editors
through Amazon Mechanical Turk to improve the technical vocabulary /
general editing of the transcripts.

There are now four videos in the BSD Conferences YouTube channel with
relatively good quality human-edited english language transcripts.
(e.g. pointers at
http://freebsd.stokely.org/2010/01/improved-conference-captions-from.html)

The caption files themselves are simple ASCII text files with one line
for the start/end time of the text to be displayed, 1 or 2 lines for
the text to be displayed, and a blank line to separate the next
record.

I would like to start checking in these text files under
doc/en_US.ISO8859-1/captions/ for a number of reasons.

1. I want to make it easier for others to correct any mistakes in the captions.
2. I want to make it easier to translators to produce localized
captions for the most popular videos.
3. Keep a centralized repository of the captions outside of YouTube,
so other hosting sites or systems are able to use them.
4. Increase discoverability of technical content discussed in the
conference talks with indexable transcripts open to search engines.

The blog post above has some example text files that I'd like to check
in.  It then becomes a matter of choosing the hierarchy.

I might suggest:

doc/${LANG}/captions/${YEAR}/${CONFERENCE}/${TALK}

e.g.

doc/en_US.ISO8859-1/captions/2009/asiabsdcon/mckusick-kernelinternals.sbv

Thoughts?

    - Murray



More information about the freebsd-doc mailing list