Sunday, December 19, 2004
Self-Documenting Life (Transcription)
I have an iPod
Last month I bought an iPod, but I didn't buy it for the usual reasons.
Yes, I am an avid music listener, and the ability to carry around my favorite tunes was definitely a plus. As was the ability to use the iPod as a hard drive so that I could tote around files that I need both at the office and at home.
But I didn't buy it for these reasons. I bought it to record what I say. All the time. And have it transcribed into text.
I've found that in describing the purpose of this project, people are either intuitively in favor of it, or don't understand it at all.
For those immediately interested, we talk animatedly about how interesting it is to do this kind of thing, and when I explain the things I think would be interesting outcomes of such a project, they are often completing my sentences for me.
For those for whom the project holds but perplexity, no amount of explanation convinces them otherwise, and, indeed, I'm often at a loss to explain why it seems so interesting.
If you fall into the former category of people, below is a bit more detail on what I'm actually doing, and then some idea of where it could go from here. For those in the latter, thanks for dropping by, but I'm not sure it'll get any more interesting from here.
The basic idea of this project is to record only my part of any conversation on something with enough storage to contain an entire day or more worth of conversation so that I wouldn't have to regularly juggle media just when things were getting interesting.
After some digging, I found that the third and forth generation iPods have the ability to record audio suitable for voice (and not much more, no doubt due to piracy concerns, but perhaps owing to the processing power internal to the iPod as well).
To enable this, you need to buy a third party product that allows you to plug a microphone into it. There are two products, that I'm aware of, that support this: Griffin's iTalk, and Belkin's Universal Microphone Adapter.
I was initially intrigued with Griffin's product as it has a built-in microphone/speaker that lets you record ambient audio without an external microphone, and you could play back through the speaker so that others could here without needing to use headphones (Griffin has a similar product, the Voice Recorder, but it lacks the ability to plug in an external microphone).
As I don't have a large gadget budget, I decided that it would be best to try to borrow an iPod rather than buy one as I wasn't really interested in having a portable music device. It turns out this was harder than I had thought given the amount of press the iPod has gotten. I found a few people with iPods, but mostly older ones that don't support the recording of audio. What compatible ones I did find were formatted for Macs, and apparently you have to reformat them (thus wiping them clean of whatever was on them) to use them on a PC (seems darned inconvenient). I finally found someone who was willing to part with their Mac iPod for a week and allow me to reformat the drive, but at the last minute his broke.
Since, after making the arrangements to borrow an iPod, I had already ordered Griffin's iTalk, and since no other iPod appeared to be forthcoming, I sucked it up and purchased one new (the 20G version) from Best Buy.
I had some trouble buying it as Best Buy keep them behind the counter, and I couldn't get a sales person to help me get at them. I also ended up having to get into an argument with a gal at the register about why I didn't want the extended warranty (I had just said that I wasn't interested, and she pushed to ask why, and then tried to counter everything I said). This seemed like a bad idea as: 1) I don't like to be browbeaten (and I'm guessing most other customers don't either); 2) Her arguments were based on anecdotal evidence that, despite a large amount of research that has been done on the subject to the contrary, I was supposed to trust; and, 3) Part of her argument was that she saw a lot of iPods come back for repair, which doesn't exactly make Apple look good, and, if I were a less technically inclined customer, would have made me think twice about spending my hard-earned dollars on something that seems to break a lot.
Anyway, the iTalk arrived in short order and I plugged it in immediately. The audio that it took from it's built-in microphone was fine, but in uncontrolled situations I didn't expect it would work well enough to be transcribed. The speaker was also fine (despite much I had read about it being underpowered, but then I had low expectations, and no real need of it for this project).
The iTalk has a single jack on it that can be used for either a microphone, or for headphones, but not both simultaneously (again, this didn't matter much for me for this particular project; though for future projects of this type it would have).
I plugged one of several PC mics I have lying around in and started talking. The result? Nothing. It would say it was recording, but on the playback, nothing but silence. I tried a couple of other mics with the same result. Strangely, if I plugged in an earphone and it record, it would pick up my speech (though very poorly), but no microphone would work.
I went around and around with Griffin's technical support via email (their live support hours being inconveniently short each day, and the fact that Thanksgiving occurred in the middle of this not helping either) where they claimed I was using the wrong kind of mic (which sent me on a two day wild goose chase) before I finally sent it back and bought Belkin's product at a nearby Mac store (the only place in town that I could find that carried it).
The Universal Microphone Adapter worked immediately and well, and I don't have any complaints about it. It was nice to be able to plug in both the microphone and earphone portion of one of the dictation headsets I have around, at the very least because that means I don't have to have a wire dangling around, and also because my plan was to get a stereo dictation headset (like this one from Koss) which would allow me to go back and forth between listening to music and recording conversations without having to plug things in and unplug others.
I have owned both Dragon's Naturally Speaking, and IBM's ViaVoice (both now owned or licensed by ScanSoft), but I couldn't find the install CD for one, and the version of the other didn't support file transcription, so I picked up IBM's ViaVoice 10 Advanced Edition predominately on its merits of being about $100 cheaper than the equivalent Dragon product.
I recorded one of the training readings you have to do to get speech-to-text software to get used to how you talk. I tried to read it as much as I could in the way I might talk to someone else, rather than the way I might read something, as that was how I expected most of my recordings would sound.
Getting the audio file to my PC with the transcription software had minor annoyances due to the fact that the iPod can only be synced to one iTunes at a time, so I had to get the file from the iPod (it copies down automatically when the iPod syncs) and copy it to another machine for transcribing.
ViaVoice complained about the low bitrate of my file, but dutifully accepted it anyway. I recorded two more training files to try and get its accuracy up. It did just fine with one, and appeared to do fine with the other until it reached the end and then decided all of the lines it said it had accepted were faulty.
That evening I recorded my first regular conversation and got about three hours worth of audio of just my side of the conversation.
I had expected that the transcription would be off more than normal since I wasn't in the best conditions, wasn't speaking particularly clearly, and wasn't dictating punctuation or line breaks. I had guessed I'd see transcription in the 70%+ range.
Nope. The first transcription was about 30-40% accurate. In fairness, when I'm having an animated conversation, the way I talk certainly isn't easy for software to transcribe. Also, ViaVoice steadfastly attempted to transcribe everything that was audible, so if I stammered, coughed, or corrected myself mid-word, it would try to assign a word to the sounds.
The nice thing is that if you're transcribing an audio file, you can watch the words pour out on the screen which is thrilling in it's own way, and it's faster than the conversation. My 3 hour recording took about 20 minutes or so to transcribe.
Here's a snippet of the conversation transcription for your amusement:
for the most hard those things don't necessarily add a whole lot of burden to those folks read before was the French ban the French ban I read Dryden freight your Ios ago from a year ago i.e. as the latter but that it is difficult to get a never-ending but it's not just eating
Since the software doesn't pay any heed to conversational breaks, and since I wasn't dictating punctuation, it becomes difficult to follow the conversation rather quickly as disparate ideas based on, for example, non-sequitors from other participants in the conversation, get all jammed together as an apparent train of thought. Further complicating matters is ViaVoice's attempt to bring in context to help figure out what words are. It turns out (not surprisingly) that the types of context you might have while doing a direct dictation, are rather different from the types of context you might have while conversing. This lead it to make the wrong choice of word based on context even when it apparently chose the individual word correctly based on speech-matching.
So I've been doing some training by making corrections to the text and introducing new words to the software's vocabulary (like "y'know"). This does appear to be improving the transcription accuracy, but at a maddeningly slow pace, made more frustrating by the fact that the application crashes or loses its place from time-to-time, including the only time I've seen it say that it was ready to update my voice model.
In any case, tests have only been conducted indoors and in fairly well controlled environments, and will probably continue as such until I can get a accuracy rate high enough to actually follow the conversation.
One final note on the setup: I was jonesing for a Jawbone headset as I think the technology they are using for filtering out background noise is pretty interesting (they sense the vibration in your jaw to determine when you are talking). Alas, they don't have a version that plugs into any old audio jack just yet (only special phone jacks). I've sent them an email asking when they might have a more general product, but haven't heard back from them (and, frankly, I don't expect to). This kind of technology will be absolutely critical for my project to work in the majority of live situations.
Where this takes us
So, why bother doing this at all?
I think we're on the cusp of some very interesting capabilities that will be brought about by having portable computing with relatively fast processing, large storage repositories, access to fast broad-area networking, and intuitive near-area networking. Here's where the iPod experiment fits in.
Probably the most obvious use is indicated by the title of this entry. If you can record everything you say, you have, in no trivial sense, provided some part of your story for others to see either now or in the future. I understand that this sentiment is probably shared by only a minority of people, but I would like my descendents to have some view of who I was and how I went about being me. It is a stab at a certain kind of immortality, I suppose, allowing a portion of my being to exist beyond my lifetime. Some people do it with written or photo journals. I'm far to lazy for that, so technology can provide a hand.
Of perhaps broader interest is the ability to Tivo your life. For example, if the iPod was able to record and play at the same time, and it always knew when you were talking, you could have the ability to skip backwards some amount of time and review something you had said, potentially putting an end to arguments that go something like this: "Well, you said I could go bar hopping with the boys," "I most certainly did not", "Remember, last week when I mentioned it?" "There's the couch, my friend, dream up another one." If you were going to do this kind of functionality, however, you might want to record more than just yourself, but recording your self is a good first step.
I have a certain fascination with building knowledge structures to expose the right ideas to the right people who can take the idea and build upon it (I'm starting to believe that humans' primary purpose is to create and maintain information; and not even on that abstract a level, but that's fodder for another entry). The Internet is an excellent example of how having a large group of people's information on pretty much everything allows us to spread knowledge and a very fast rate, and build upon that knowledge faster than we ever have built knowledge before (even normalized for the size of the global population). People who are interested in the Semantic Web are looking to make this system even more efficient and potentially bring another revolution in knowledge sharing (though I have quiet doubts at this point).
The first step in building on knowledge, however, is capturing it. I have a pretty poor memory, as do several of my friends. This means that we are often rediscovering our own theories years later, much to everyone's amusement. This stems in part from the fact that we don't take notes when we are having interesting conversations. Often it is not possible to take notes as we're driving around together, or talking on mobile phones. Having a transcription of everything we say may not prevent us from re-creating ideas, but it certainly can reduce the occurrence of it, and allows us to look back at things we've talked about as it was captured and build upon those ideas.
Perhaps it is ironic that I am interested in contributing to the very glut of information that I believe will increasingly make the Internet hard to search through for quite a while yet, but I already have this blog, so why not everything I say as well?
Ok, on to the less philosophical reasons this is interesting.
Imagine that I was able to get this process to work very efficiently, so that the transcription knew my voice model well enough to have an accuracy rate of more than 99%.
Suppose I was able to carry this complete system with me, and that it operated in real time (there is no reason the speech-to-text software can't do that as that's what it was originally built for, and it was built to run on much slower computers than I currently have).
Suddenly, you can perform searches on everything you said in real time, playing back the actual audio, or displaying the transcription, depending on what you need. Add in some other metadata like date and time, GPS coordinates, even which direction you are facing, and you can do searches like: "What was I saying last Tuesday when I was sitting at Starbucks?".
Now imagine that you attach a timestamp to every single transcribed word (I have to believe this is trivial now, but no one had a good use for it). You can then integrate other information, like pictures, documents, and the like into a single stream of information. You might reference this via your transcription stream with other information sources included right in the interface.
Now, if I had a conversation with you, and we were both recording our side of the conversation, I could send you my transcription, and you could send me your transcription, and we could integrate them to have the entire conversation as it was originally spoken.
If my system knew who you were, as the sender of the other half of the conversation, I now have new data I can search by.
If you and I were connected via some form of network, I could broadcast my transcription to you in real-time.
And if you didn't speak my language, you could automatically route my transcription through a translation service that fed you back a translated document pretty much in real time.
From this, you could have a text-to-speech application read the translation into your headphones in real time as I'm talking.
Perhaps you could even use my own voice model that I might choose to make available to you, so that the translation you are hearing of my words actually sounds like me as well. If my software is able to discern that I am yelling, or whispering, that data might also get passed along as part of the meta data stream to you, allowing for nuance.
Given this, there's no reason we have to be in the same location, or even connected via any voice application. I could just send you my transcriptions in real time and let your computer speak it to you and vise-versa, greatly reducing the amount of bandwidth required for electronic carrying of the conversation.
What becomes interesting here, is that we end up building an infrastructure from which new applications can be created to provide capabilities we never even thought about just by transforming a type of information we constantly put out (in fact, THE information we constantly put out) into something that can be manipulated, transmitted, and combined with other things in just the same way that the telephone, the highway, and the Internet have done. It probably wouldn't have quite the same transformative effect as the other things I just mentioned, but you have to admit, it's interesting.