對話 TalktoApps 創始人:Voice AI 提高了我五倍的生產力,語音輸入是人機互動的未來

RTE开发者社区發表於2025-02-07

那些正在做 Voice AI Agent 產品的 builder 都碰到了哪些實際問題?他們又是如何思考和解決的?

今天推薦的文章來自 Vela 新錄製的一期播客的整理,對話語音 APP TalktoApps 的創始人 Ebaad。Ebaad 分享了諸多在開發 voice first 產品時碰到的挑戰和思考,語音介面和圖形介面如何結合?何時何地採用什麼樣的人機互動更為合適?產品背後的技術架構又該如何設計和演化?聽聽他們的對話,期待對你有所啟發。

如果你已經認識我,可能會知道我是一個 AI 產品和語音 AI 愛好者(也在努力成為一名創造者)。作為一個音訊和音樂愛好者,我認為任何型別的聲音都很美妙。不僅僅是音樂,還有來自山谷的聲音和城市中的對話聲。

我相信語音 AI 能讓產品以更自然的方式與使用者互動,可以做到像人與人之間的對話一樣自然。它也可以應用到不同的場景中,為真正重視這一點的使用者帶來價值。所以我決定開始這個播客系列"Voice Talk",來連線更多的語音 AI 開發者,並開源我們早期的語音 AI 經驗。

在每一期節目中,我都會邀請一位在矽谷正在開發很酷的語音產品的創始人,我們會談論他們與語音的個人故事,構建語音產品的歷程,以及語音產品的未來。

第一集錄制請到了我在 Founders.inc 的一個好朋友同時也是一個語音產品的愛好者-Ebaad,在建立一款語音 APP TalktoApps。除了關於語音技術和語音產品的歷程,我們還聊了聊語言、聲音、和語音未來的人際互動。

是同作為語音產品創業者的同頻對話。

對話原文為英文,中文翻譯如下。

影片全文見文末。

一、我與聲音的故事

Vela: Hello, everyone. I'm Vela.
Vela:大家好,我是 Vela。

Ebaad: I'm Ebaad
Ebaad:我是 Ebaad。

Vela: welcome, Ebaad
Vela:歡迎,Ebaad。

Ebaad: thank you for having me.
Ebaad:謝謝邀請我。

Vela: You are the first one I met at Founders Inc. who are doing something with voice as well.
Vela:你是我在 Founders Inc.遇到的第一個也在做語音相關專案的人。

Ebaad: Likewise.
Ebaad:我也是。

Vela: Amazing. And I think you are also an audio person. Like you would prefer collecting information with audio.
Vela:太棒了。我覺得你也是一個音訊愛好者。就像你更喜歡用音訊來收集資訊。

Ebaad: Yeah, I prefer that in some cases.
Ebaad:是的,在某些情況下我確實更喜歡這樣。

Vela: Cool. Let's talk more about that. Can you tell us about your story with voice and which role voice has played in your life?
Vela:很酷。讓我們多談談這個。你能告訴我們你和語音的故事,以及語音在你生活中扮演了什麼角色嗎?

Ebaad: Oh, voice has been really, really amazing. It's 5x my productivity. Maybe 10x it as well. In the past maybe year or two years, I've listened to 5 million words. I use this app called Speechify.
Ebaad:哦,語音真的非常棒。它讓我的生產力提高了 5 倍,可能甚至 10 倍。在過去一兩年裡,我聽了 500 萬字。我使用一個叫 Speechify 的應用。

And I recently came across this other app called WhisperFlow, which I've dictated a hundred thousand words in two months. I used to use the dictate function on Apple, but it was not that good. But WhisperFlow makes it very easy.
最近我發現了另一個叫 WhisperFlow 的應用,我用它在兩個月內就口述了十萬字。我以前用蘋果的口述功能,但效果不是很好。而 WhisperFlow 使用起來非常方便。

So yeah, it's a big part of how I have my workflows. From coding, so I just talk to cursor as well.
所以是的,這是我工作流程中的重要組成部分。從編碼來說,我也會對著游標說話。

I also basically talk to an LLM. I just blab on for like 30 40 seconds about different things and give it as much context. And it just does that, and then I use Speedify to listen to the answer as I'm walking around or doing something.
我基本上也會和 LLM 對話。我會隨意談論不同的事情大約 30 到 40 秒,給它提供儘可能多的上下文。它就會處理這些內容,然後我用 Speedify 在走動或做其他事情時聽取回答。

I like to walk around, so you can't read while walking around, so I'll just run the LLM and be walking around the room. And then hear it through my headphones.
我喜歡走來走去,你不能邊走邊讀,所以我就執行 LLM,在房間裡走動。然後透過耳機聽取內容。

Vela: So you like talking to apps?
Vela:所以你喜歡和應用對話?

Ebaad: I like talk to apps, yes.
Ebaad:是的,我喜歡和應用對話。

Vela: So here comes through your product.
Vela:那這就說到你的產品了。

語音技術讓我的生產力提高了 5 倍,可能甚至 10 倍。我基本上也會和 LLM 對話。我會隨意談論不同的事情大約 30 到 40 秒,給它提供儘可能多的上下文。它就會處理這些內容,然後我用 Speedify 在走動或做其他事情時聽取回答。

二、產品介紹:Talk to App

Ebaad: That's a nice transition. I'm building this thing called talktoapps.com. It's kind of like what it is.
Ebaad:這個轉折很好。我正在開發一個叫 talktoapps.com 的東西。就像它的名字一樣。

Basically you can interact with your favorite apps using natural language. It could be text or it could be voice as well. So both.
基本上你可以用自然語言與你喜歡的應用進行互動。可以是文字,也可以是語音。兩者都可以。

And basically what you can do is instead of clicking a hundred times you can just say things that are very abstract, like remove all my meetings on Wednesday instead of going and clicking and everything.
基本上你可以做的是,不用點選上百次,你可以說一些很抽象的話,比如"刪除我週三的所有會議",而不是去點來點去。

And it will understand that or make a meeting and assign it to Vela or invite Vela or assign it to this person instead of going and finding them.
它會理解這些,或者建立一個會議並分配給 Vela,或邀請 Vela,或分配給某個人,而不是去找他們。

With clicks. So yeah, so currently it has Todoist, but I'm gonna be integrating today, Google Calendar. And then Google Sheets, I was testing out yesterday.
透過點選。所以是的,目前它有 Todoist,但我今天要整合 Google 日曆。然後是 Google 表格,我昨天在測試。

It looks, it looks pretty exciting, like you can do things just from an extension, and talk to the extension. Considering WhatsApp as well, like, it could be cross platform using WhatsApp, so you can just basically speak in a WhatsApp bot.
看起來非常令人興奮,比如你可以僅透過一個擴充套件程式來完成事情,並與擴充套件程式對話。也在考慮 WhatsApp,比如它可以透過 WhatsApp 跨平臺使用,所以你基本上可以對著 WhatsApp 機器人說話。

And then it will just do those things. So you don't have to worry about it being on your computer.
然後它就會去完成這些事情。所以你不用擔心它是否在你的電腦上。

Yeah, I think the barrier to entry in terms of making a task or having the idea in your head for it to basically go on to the computer or your app, I think it's more of a design problem.
是的,我認為在建立任務或將你腦中的想法傳達到電腦或應用方面的門檻,我認為這更多是一個設計問題。

And I think the, the, the, the barrier to entry should be very little. So in the future, I would like to have it as a bracelet, where you can just tap it, and then it does that. But that's a long way from now.
我認為準入門檻應該很低。所以在未來,我希望把它做成一個手環,你只需輕輕一點,它就能完成操作。但那還需要很長時間。

三、和使用者的語音 AI 故事

Vela: The basic point is, you want to change your interaction with machine as natural as the interaction with humans.
Vela:基本要點是,你想要讓人機互動變得像人與人之間的互動一樣自然。

Ebaad: Possibly, yes. Cool.
Ebaad:可能是的,沒錯。很酷。

Vela: Can you share us your favorite user story so far?
Vela:你能分享一下目前最喜歡的使用者故事嗎?

Ebaad: I I have a favorite user story. I think it's this guy, Hadza. And he uses WhisperFlow as well. And he built a nuclear reactor by talking to Claude.
Ebaad:我確實有一個最喜歡的使用者故事。我想是這個叫 Hadza 的人。他也使用 WhisperFlow。 他透過與 Claude 對話構建了一個核反應堆。

So that's what he was doing. He was playing with these tools and stuff and then he would be talking to Claude as it is. Cause you can do space and FN and then you can keep talking to Whisper as you're doing things.
這就是他在做的事。他在玩這些工具之類的,然後他就這樣和 Claude 對話。因為你可以按空格和 FN 鍵,然後你就可以在做事的同時繼續對 Whisper 說話。

So yeah, I think that's pretty good. Like, you can just keep talking and give it the context of what you're seeing. Maybe you have this, and then you can just talk to it, and then kind of take the input back output back, and then use it to what you want to do.
所以是的,我覺得這很好。就像,你可以繼續說話,給它提供你所看到的上下文。也許你有這個,然後你可以跟它說話,然後某種程度上獲取輸入輸出的反饋,然後用它來做你想做的事。

四、建立語音 AI 產品的技術挑戰

Vela: Let's dig into the technical side. Which challenge you have been faced with when building talk to apps.
Vela:讓我們深入技術層面。在構建 talk to apps 時,你面臨過哪些挑戰?

Ebaad: That's an interesting question. I think there's two main problems, and I'll go over them chronologically.
Ebaad:這是個有趣的問題。我認為有兩個主要問題,我會按時間順序講述。

The first one is, as we talked about, it's the design of basically how you communicate with it. It could be natural lang it's gonna be natural language, but the first thing is like, can you type as well?
第一個是,就像我們討論過的,基本上是 關於如何與它交流的設計。 它可能是自然語言,但第一個問題是,你能否也用打字?

Right? Cancel all my meetings on Wednesday. Sometimes that's necessary because if you're outside and things like that. People, what I've heard, people don't want to talk as much.
對吧?比如"取消我週三的所有會議"。有時這是必要的,因為如果你在外面之類的。據我所知,人們不想太多說話。

You're more careful talking out loud. Yeah, exactly. So typing could also be an interesting solution. But then it's voice first.
你在大聲說話時會更謹慎。是的,確實如此。所以打字也可能是一個有趣的解決方案。但它還是以語音為主。

The second is how do you basically interact with the app? Like, how does it give you feedback? Like, there's a term in UI, like it has to give feedback.
第二個是你如何與應用進行基本互動?比如,它如何給你反饋? 就像 UI 中的一個術語,它必須給出反饋。

Normally Alexa and Siri, they just say things. Which I think is very limited sometimes because with voice you can only hear the things that are right in front of you or someone saying, but with graphic you can see the whole thing.
通常 Alexa 和 Siri,它們只是說話。我認為這有時候很受限,因為透過語音你只能聽到當前正在說的內容,但透過圖形介面你可以看到整體。

Vela: Yeah, like one dimension and multi dimensions.
Vela:是的,就像一維和多維。
Ebaad: Yeah, multi dimension, like, there's this quote like, a picture is worth a thousand words, so. Ebaad:是的,多維,就像那句話說的,一圖勝千言。

If you can see things, how things are happening, that's a very interesting one as well, and then you want continuous feedback. Like, basically if you're doing multi step workflows, like an example would be like, Okay, could you get this Twitter link and research this person and put it into my investor doc for sheets?
如果你能看到事情是如何發生的,這也很有趣,然後你需要持續的反饋。比如,如果你在做多步工作流程,舉個例子就像,好的,你能獲取這個 Twitter 連結,研究這個人,然後把它放到我的投資者表格文件中嗎?

So you basically need to kind of see that that text converts into something visual, where it's like, Okay, taking Twitter, it's an icon, and then it's putting it into an LLM, is kind of spinning, maybe.
所以你基本上需要看到文字轉換成視覺效果的過程,比如,好的,獲取 Twitter,顯示一個圖示,然後把它放入 LLM,可能會有一個旋轉的動畫。

And then basically, you can also drag and drop things if it screws up. Like, and then it changes the text based on how you do. There's this thing called Zapier where you can kind of tie multiple Functions together.
然後基本上,如果出錯了你也可以拖放東西。然後它會根據你的操作改變文字。有一個叫 Zapier 的東西,你可以用它把多個功能連線在一起。

So it kind of could be like, it's also text, but you can also drag and drop if something messes up, or if it selects the wrong function.
所以它可能是這樣的,它也是文字,但如果出錯了或者選擇了錯誤的功能,你也可以拖放。

That, and I think the last part of the design part is like, when you mess up. Like, maybe if you are trying to add investor in a doc, which is basically like for your friends or something, and there's no investor column or something.
還有,我認為設計部分的最後一點是,當你搞砸了的時候。比如,如果你試圖在文件中新增投資者,而這個文件基本上是為你的朋友準備的或其他什麼的,而且沒有投資者欄目之類的。

How does it give you the feedback? In terms of, oh, you gave wrong commands, or it should just basically give you the right commands to run.
它如何給你反饋?比如說,噢,你給出了錯誤的命令,或者它應該直接給你正確的執行命令。

That kind of comes back to the technical side. It's like, you need to have context of their Excel or Google Docs, or their Notion page of what the structure is.
這就回到了技術層面。就像,你需要了解他們的 Excel 或 Google 文件,或者他們的 Notion 頁面的結構是什麼樣的。

So I'm working on that on the technical side, like dry, creating like separate functions for everyone, everyone's Google Sheets. So it can basically know exactly what the natural language commands are missing.
所以我在技術方面正在研究這個,比如建立針對每個人、每個人的 Google 表格的獨立功能。這樣它就能準確知道自然語言命令缺少什麼。

And can give a feedback, like maybe it's like a, it's like a sentence. And then it has like empty spaces. Oh, like you added the investor, but then there's a Twitter. Do you want to also add the Twitter? And I can research that for you. Or something like that.
並且可以給出反饋,可能就像一個句子。然後它有一些空白處。哦,比如你新增了投資者,但還有 Twitter。你想要也新增 Twitter 嗎?我可以為你研究一下。或者類似這樣的東西。

So it could be interesting. So that's the interaction product side. And I could, I'll be happy to talk more about the technical side of how the infrastructure, the LLM and state management will be.
嗯,這部分可能會很有意思。以上是關於互動產品方面的內容。我很樂意進一步討論技術細節,包括基礎設施、LLM 以及狀態管理方面的設計。

Vela: Yeah, sure. But before that, I want to dig into the interaction, like the design side. How would you love to handle this problem? Like how to do the trade off or balance the interaction design between voice interaction and the graphic interaction?
Vela:好的,當然。但在那之前,我想深入探討互動,比如設計方面。你想如何處理這個問題?比如如何在語音互動和圖形互動之間做出權衡或平衡?

Ebaad: I, I'm, it depends on what problem you're solving. Like, if you have a screen in front of you, I think graphics is better.
Ebaad:這取決於你要解決什麼問題。比如,如果你面前有一個螢幕,我覺得圖形介面更好。

Because it has more bandwidth of information. And I think you can do a lot of interesting things of like, what you're showing at that time highlighted in a specific way.
因為它可以傳遞更多資訊。而且我認為你可以做很多有趣的事情,比如在特定時刻以特定方式突出顯示你展示的內容。

Things like that, but I think in terms of interaction Graphics is very very good Because you can just basically see all your calendar like you can't if it's like voice is telling you That this is your calendar you have meeting on at 3 you have meeting at 5 It it takes like maybe 10 20 seconds to kind of yeah, and then and you forget as well.
諸如此類的事情,但我認為就互動而言,圖形介面非常好。因為你可以一眼看到你的整個日曆,而如果是語音告訴你"這是你的日曆,你 3 點有個會議,5 點有個會議",這可能需要 10-20 秒,而且你也可能會忘記。

So I think it's good for one way interaction And then the feedback should be graphical. 所以我認為 它適合單向互動,然後反饋應該是圖形化的。

Does that make sense?
這說得通嗎?

Vela: Interesting.
Vela:有意思。

Ebaad: Yeah, so I'm more focused. I, I, I think interaction right now, if you're talking to Siri or something like that, it's very computationally heavy. And then you're waiting for the answer. It's thinking.
Ebaad:是的,所以我更關注。我認為現在的互動,如果你在和 Siri 之類的對話,它在計算上非常重。然後你要等待答案。它在思考。

And then if you cut it off, it's like, you remember with, with Agenta? Like cutting it off and like, it's, it's like, yeah. So I think talking and giving instruction Voice is hands down the best, but giving feedback, I think graphical things are interesting.
然後如果你打斷它,就像,你還記得 Agenta 嗎?就像打斷它然後,是的。所以我認為說話和給出指令,語音是絕對最好的,但給出反饋時,我認為圖形化的東西很有趣。

But there's trade offs, of course, depending on where you are.
當然,這也有權衡,取決於你在什麼場景。

Vela: Yeah, and it reminds me, probably it's your design principle for a talk to app. Like, your input is just the voice, but your output is the graphic, like, command.
Vela:是的,這讓我想到,這可能是你對 talk to app 的設計原則。就像, 你的輸入只是語音,但輸出是圖形化的,像命令一樣。

Ebaad: Yes, and there are benefits to that. Like, you can talk to it on a phone as well. But could be, could be interesting. But I think I'll figure the design out more as I go across.
Ebaad:是的,這樣做有好處。比如,你也可以在手機上和它對話。但可能會很有趣。但我想隨著進展我會更多地理清設計思路。

Vela: Cool. Let's move on to the technical challenge.
Vela:很酷。讓我們繼續談技術挑戰。

Ebaad: I think the technical challenge is like, currently it's a very, very simple infrastructure, but it has to get really complicated as the complexity of the tasks increase.
Ebaad:我認為技術挑戰就像,目前它是一個非常簡單的基礎架構,但隨著任務複雜度的增加,它必須變得真的很複雜。

So currently it's basically, if you guys know, Agents are not that crazy. You just basically give a command and they convert them into objects. And those objects are normally JSON objects that you can run functions with.
所以目前基本上,如果你們知道的話,Agents 並不那麼瘋狂。你基本上只是給出一個命令,它們就把命令轉換成物件。這些物件通常是 JSON 物件,你可以用它們來執行函式。

So if, if you're a to do task you basically create a task at three and it creates this JSON object, which has these parameters, so it could be like content, due date, that's it, right? 所以如果你是一個待辦任務,你基本上是在三點建立一個任務,它會建立這個 JSON 物件,它有這些引數,可能是內容、截止日期,就這些,對吧?

And then it has this API where you can basically run that API on the backend. But complexity arises. It's easy with create tasks because there's no previous context you're dealing with, but updating is a little bit challenging because to update, you need to find what you're updating first.
然後它有這個 API,你基本上可以在後端執行那個 API。但複雜性就出現了。建立任務很容易,因為你不需要處理之前的上下文,但更新有點具有挑戰性,因為要更新,你首先需要找到你要更新的內容。

So if, if you say update, get groceries at three to five, it has to first search the groceries. 所以如果你說更新,把三點買雜貨改到五點,它首先必須搜尋雜貨。

And then these API is require you to have like a primary key where, There's a task ID, then you have to map the task ID to the update and then run it.
然後這些 API 要求你有一個主鍵,比如任務 ID,然後你必須將任務 ID 對映到更新,然後執行它。

And then there's also this challenge of storing states. Like if you basically say that update it and it updates it, and then you're like, you change your mind and you want to undo it.
然後還有儲存狀態的挑戰。比如如果你說要更新它,它更新了,然後你改變主意想要撤銷。

You have to store what it was before.
你必須儲存它之前的狀態。

Vela: a lot of function calls,
Vela:需要很多函式呼叫,

Ebaad: a lot of function calls, and a lot of state management as well. So you have to like be able to go back. Like maybe if you're, if you're on the fly, Oh, change it to five. Oh no, change it to six. Oh, you know what? Leave it at where it was.
Ebaad:是的,很多函式呼叫,還有很多狀態管理。所以你必須能夠回退。比如如果你在臨時改變,"哦,改到五點"。"哦不,改到六點"。"哦,你知道嗎?還是保持原樣吧"。

Vela: Ah, it reminds me, you know, if I say that change to seven, or change to six, probably it can be some design stuff. Like, the point behind that is How can you identify what users are exactly saying, what they're missing behind that? Probably you need more time, like, until five seconds later, you need to do the translation after that.
Vela:啊,這讓我想起,你知道,如果我說改到七點,或改到六點,這可能是一些設計問題。比如,背後的重點是如何識別使用者到底在說什麼,他們背後遺漏了什麼?可能你需要更多時間,比如直到五秒鐘後,你才需要在那之後進行轉換。

Ebaad: Could be. It could be, could be, could be like that, like, maybe like after you've said the sentence and you've completed it, and then run it, that could be handled on the front end, like, when the sentence is complete, then run it.
Ebaad:可能是。可能是這樣的,可能就像在你說完句子並完成後,然後執行它,這可以在前端處理,就像當句子完成時,再執行它。

But it could also be done where it displays it and then you can change it again. So I think both, so it's kind of like a thing of where you pre process it and pre process it.
但它也可以做成顯示出來然後你可以再次改變它。所以我認為兩種都可以,所以這有點像你在哪裡預處理它和預處理它的問題。

Yeah. Because like if you say change my groceries at five, or like do groceries at five, it, that's a complete sentence. So it can just basically wait for you to change your mind, maybe one or two seconds.
是的。因為如果你說在五點改變我的雜貨清單,或者在五點買雜貨,那是一個完整的句子。所以它基本上可以等你改變主意,可能一兩秒。

But then there's like this thing where you want to just do it immediately. Instead of waiting. So I think storing it in a state makes more sense. But you can do some pre processing on the front end before you make the function calls. It's a very interesting problem.
但是也有這種情況,你想立即執行它。而不是等待。所以我認為將它儲存在狀態中更有意義。但你可以在進行函式呼叫之前在前端做一些預處理。這是個很有意思的問題。

Vela: Very interesting. Can I say it's not only a technical issue, but it can also probably handle by some product design.
Vela:非常有趣。我可以說 這不僅是一個技術問題,也可能可以透過一些產品設計來處理。

Ebaad: Yes, it could be handled by product design as well. Initially, yeah. It could be interesting to see, but the thing is, it's a very new field, it's a new human computer interaction field, so a lot of people are going to take time on developing it, and the best design will rise to the top, it will bubble up.
Ebaad:是的,它也可以透過產品設計來處理。最初是的。看看會很有趣,但問題是,這是一個非常新的領域,是一個新的人機互動領域,所以很多人會花時間開發它,最好的設計會浮到頂端,會冒出來。

Vela: Yeah, also for the evaluation of your product and the voice interface, there is a brand new evaluation.
Vela:是的,而且對於你的產品和語音介面的評估,有一種全新的評估方式。

Ebaad: What does evaluation mean?
Ebaad:評估是什麼意思?

Vela: Like the product evaluation.
Vela:就像產品評估。

Ebaad: Okay.
Ebaad:好的。

Vela: There comes to, for me, there comes to two parts. So first is the product eval. That you, how can you evaluate whether your product solved the user's problem.
Vela:對我來說,這涉及到兩個部分。首先是產品評估。就是你,如何評估你的產品是否解決了使用者的問題。

Vela: It's from the user side. And second is if you think from the technical side. How's the quality and latency and whatever.
Vela:這是從使用者方面來說。第二是如果你從技術方面考慮。質量和延遲等等如何。

Yeah. You do have some different metrics of your conversation.
是的。你確實有一些不同的對話指標。

Ebaad: One hundred percent.
Ebaad:完全同意。

Vela: How do you evaluate them?
Vela:你如何評估它們?

Ebaad: Yeah, that's interesting. I think right now I don't have a very good framework. I basically see what's a good workflow.
Ebaad:是的,這很有趣。我想現在我還沒有一個很好的框架。我基本上是看什麼是好的工作流程。

I could do more user testing, but I recently implemented this thing called Groq. Mm hmm. And their latency is very low. So on the transcript it can do under 300 milliseconds for a sentence, meaning that it's like 0.3 seconds.
我可以做更多的使用者測試,但我最近實施了一個叫 Groq 的東西。嗯嗯。它們的延遲非常低。在轉錄方面,一個句子可以在 300 毫秒內完成,也就是說大約 0.3 秒。

So it's kind of real time. Yeah. And then creating JSON objects and running the API, it's like 0.8 seconds, so it kind of is real time.
所以它有點像實時的。是的。然後建立 JSON 物件和執行 API,大約需要 0.8 秒,所以某種程度上是實時的。

So you basically say add groceries at three and then it would do it, and then you can see the visual and then, oh, you know what, change that to five.
所以你基本上說"在三點新增雜貨"然後它就會執行,然後你可以看到視覺效果,然後,哦,你知道嗎,把它改到五點。

So I think the latency with Groq has changed a lot of things. Because previously with OpenAI it took two seconds. And then it's kind of not that intuitive.
所以我認為使用 Groq 後的延遲改變了很多事情。因為之前用 OpenAI 需要兩秒。然後就不那麼直觀了。

So I think you can just basically see the workflow and be intuitive enough to be saying, oh this is a good, it's good technology and then it's a good, also, it seeps into the design as well.
所以我認為你基本上可以看到工作流程,並且足夠直觀地說,哦這是好的,這是好的技術,然後這是好的,而且,它也滲透到設計中。

Because I think latency is very, very important in this case because you want to be able to change things on the fly.
因為我認為在這種情況下延遲非常非常重要,因為你想要能夠隨時更改事物。

There's this company called AquaVoice on YC that's doing a really good job with editing on the fly. I will show you after. It's, it's, it's, it's amazing like the way they do voice editing.
有一個在 YC 的公司叫 AquaVoice,他們在即時編輯方面做得非常好。我稍後會給你看。他們做語音編輯的方式真的很神奇。

Basically you can just speak. You just say, Hey, can you create three items? GPUs, computers, and things. And then you can like, can you add, can you convert it to a list?
基本上你可以直接說話。你只要說,嘿,你能建立三個專案嗎?GPU、電腦和其他東西。然後你可以說,你能新增,你能把它轉換成列表嗎?

So it would just basically take a sentence and convert it to a list. And then you would be like, oh, can you add GPUs to the top instead of computers? And it would do that. And can you make 200 GPUs? Instead of like, just GPUs, and it would add 200 there.
所以它基本上會把一個句子轉換成列表。然後你可以說,哦,你能把 GPU 放到頂部而不是電腦嗎?它就會這樣做。然後你說能把它改成 200 個 GPU 嗎?而不是僅僅 GPU,它就會在那裡新增 200。

As you understand, like, it's like, it's very interesting. We'll take a look after this.
你明白的,就像,這很有趣。我們待會兒看看。

Yeah, and I think it could be done with coding as well. Like, maybe you're just running a function, and, oh, could you take line 150 to 160, and kind of add two if statements in there, or add one argument, or move this to the function below.
是的,我認為這也可以用於編碼。比如,也許你只是在執行一個函式,然後說,哦,你能把第 150 行到 160 行,在那裡新增兩個 if 語句,或者新增一個引數,或者把這個移到下面的函式。

So you can just basically edit code, and run code. With the voice.
所以你基本上可以用語音編輯程式碼,執行程式碼。

I don't know where our conversation started, but I think it's kind of the evaluation side. I think you can basically see intuitively if it's a good product or not. I don't know, I don't know if that's taste or I don't know if that's just I think everyone can tell that if it's like, if it's a bad product, you can just basically see.
我不記得我們的對話從哪裡開始的了,但我認為是關於評估方面。我認為你基本上可以憑直覺看出這是不是一個好產品。我不知道,我不知道這是品味還是什麼,我認為每個人都能看出來,如果它是個糟糕的產品,你基本上一眼就能看出來。

It's like a movie. Like you see a bad movie and you're like, ah, it's a bad movie.
就像看電影。就像你看到一部爛片,你就會說,啊,這是部爛片。

Vela: Mm hmm. Yeah, there's two parts. Yeah. Okay. Yeah, we move very far away. And let's move to the last question.
Vela:嗯嗯。是的,有兩個部分。好的。是的,我們偏離得很遠了。讓我們來看最後一個問題。

五、關於語音 AI 產品的未來

語言在其結構上非常靈活。所以我認為這就是未來,它對人類來說更直觀。自然語言瀏覽器將會是一個重要的東西。自然語言 IDE。在 South Park Commons(矽谷一個著名孵化器)有一個公司,正在開發自然語言瀏覽器。

Vela:In the future, what kind of, like, how do you see the future of voice AI and voice based product?
Vela:在未來,你如何看待語音 AI 和基於語音的產品的未來?

Ebaad: Oh that's interesting. I think it'll depend on these two technologies that are coming in. I think it's taking two paths.
Ebaad:哦,這很有趣。我認為這將取決於正在出現的這兩種技術。我認為它正在走兩條路。

One is the computer vision side. Where you can just basically tell your phone to send a WhatsApp message to Vela. about this meeting or something. And it can just do that.
一個是計算機視覺方面。你基本上可以告訴你的手機傳送 WhatsApp 訊息給 Vela,關於這個會議或其他什麼。它就能做到。

And then the design in terms of interaction will be very different because you're gonna see the AI do it on the screen or something like that. It's like just how you would do it.
然後在互動方面的設計會很不一樣,因為你會看到 AI 在螢幕上執行操作或類似的事情。就像你自己會怎麼做一樣。

And the second one is like this one, where you're adding a layer of natural language on top of these functions. I'm more bullish on the second one. The latter one.
第二個就像這個,你在這些功能之上新增一層自然語言。我對第二個更看好。後面這個。

Because I think it's just faster because you're using CPU instead of GPU to make these API calls, and then you're just using natural language to infer what those API calls.
因為我認為它更快,因為你使用 CPU 而不是 GPU 來進行這些 API 呼叫,然後你只是使用自然語言來推斷這些 API 呼叫。

And it's very like, it's very co like in terms of the cost. I know your question was like in terms of how the interaction will look like, but I wanna talk a little bit about the technology as well.
而且它很像,在成本方面很協同。我知道你的問題是關於互動會是什麼樣子,但我也想談一談技術方面。

Scanning images and finding where to click. It's very expensive. Mm-hmm. And then you have to do like one image per second or something like that.
掃描影像並找到點選位置。這非常昂貴。嗯嗯。然後你必須每秒處理一張影像之類的。

But with this, it's just very, very quick. And then you're using human to scan the page and things like that.
但用這個方法,它就非常非常快。然後你讓人來掃描頁面之類的。

In terms of the, the, the, the future, I think the future is going to be natural language.
就未來而言,我認為未來將是自然語言。

Because I think technology tries to optimize towards how efficiently you can do things. Like, not always, like QWERTY keyboards are not very effective, but those are other things.
因為我認為技術試圖最佳化你做事的效率。比如,不是總是這樣,就像 QWERTY 鍵盤並不是很有效,但那是其他方面的事情。

But, it basically optimizes for the thing that's gonna take the least effort. And I think if you're clicking and then you have to do ten touchpoints to do one task, and if you can do it in a sentence, the efficiency will demand that you move towards voice and no more natural.
但它基本上是為了最佳化需要最少努力的事情。我認為如果你要點選然後必須點十下才能完成一個任務,而如果你用一句話就能完成,效率會要求你轉向語音和更自然的方式。

Like, you can do way more with voice in terms of communicating what you want to do than clicking. So, and it carries more information, like, it just carries more information, like, voice, like, natural language.
比如,在表達你想做什麼方面,用語音可以做得比點選多得多。而且它攜帶更多資訊,就像,它就是攜帶更多資訊,比如語音,比如自然語言。

You don't have to specify a lot of things. Like, you can get, you can get a lot of context from the words around it as well.
你不需要指定很多東西。比如,你可以從周圍的詞中獲得很多上下文。

Like, we're, we're recording an interview in a podcast room. This is not a job interview. It's a podcast. So it's probably a different type of interview.
比如,我們在播客室錄製採訪。這不是求職面試。這是播客。所以這可能是不同型別的採訪。

So language in its structure is very flexible. So I think that's the future and it's more intuitive to humans.
所以語言在其結構上非常靈活。所以我認為這就是未來,它對人類來說更直觀。

So I think natural language browser is going to be a big thing. Natural language IDE. There's a company in South Park Commons, I think that's working on natural language browser.
所以我認為 自然語言瀏覽器將會是一個重要的東西。自然語言 IDE。在 South Park Commons 有一個公司,我認為正在開發自然語言瀏覽器。

So basically you type in, open this page and go somewhere or something, and it will do that. Download this and then with code it's like run this, deploy this, things like that.
所以基本上你輸入,開啟這個頁面然後去某個地方之類的,它就會執行。下載這個然後用程式碼就像執行這個,部署這個,諸如此類。

So I think natural language is going to be very huge, especially with LLMs and also the new tools everyone's building.
所以我認為自然語言會非常重要,特別是有了 LLM 以及每個人都在開發的新工具。

Vela: Here comes to the end. In the future, I would love to say that you know, currently what we are building is we leverage the voice AI to make the interaction between humans and the machine more natural as the interaction with humans.
Vela:我們要結束了。在未來,我想說的是,你知道,目前 我們正在構建的是利用語音 AI 使人機互動變得像人與人之間的互動一樣自然。

In this way, we can have more time spending talking with humans.
這樣,我們就能有更多時間與人交談。

Ebaad: Yeah, and talking to machines as well.
Ebaad:是的,也可以和機器交談。

Vela: And talking to machines as well. And let the agents talking to other agents.
Vela:也可以和機器交談。並讓代理之間相互交談。

Ebaad: Yeah, that would be hilarious. There's two hinge agents talking to each other. There's like, yeah, agents talking to each other, yeah.
Ebaad:是的,那會很有趣。 有兩個 Agents 互相交談。就像,是的,代理之間互相交談,是的。

Then there's gonna be other problems that are gonna come. I think most of them. I think there's two sides, the technical and the design side. And both of them interest me.
然後會出現其他問題。我認為大部分是這樣。我認為有兩個方面,技術方面和設計方面。這兩個方面都讓我感興趣。

So, we'll see, I think the future looks I don't know if it's bright, but it looks different than It's definitely bright, I think. I don't want to be clicking.
所以,我們拭目以待,我認為未來看起來,我不知道是不是光明的,但看起來與現在不同,我認為肯定是光明的。我不想再點來點去了。

Vela: Yeah, very good vision. Very good point. Thank you for digging into the technical side and bringing us upon the future of the voice and interaction with machines.
Vela:是的,很好的願景。很好的觀點。感謝你深入探討技術方面,帶我們展望語音和人機互動的未來。

Ebaad: Thank you for having me, Vela.
Ebaad:謝謝你邀請我,Vela。

Vela: Thank you, Ebaad.
Vela:謝謝你,Ebaad。

Referenced Voice Product:

• Speechify: https://speechify.com/

• WhisperFlow: https://wisprflow.ai/

• AquaVoice: https://withaqua.com/

Where to find TalkToApps:

• Website:https://www.talktoapps.com/

影片原文:

關於 Vela 語音產品探索者
Builder & Creator
微信:la_vela(請加備註)

更多 Voice Agent 學習筆記:

2024,語音 AI 元年;2025,Voice Agent 即將爆發丨年度報告發布

對話谷歌 Project Astra 研究主管:打造通用 AI 助理,主動影片互動和全雙工對話是未來重點

這家語音 AI 公司新融資 2700 萬美元,並預測了 2025 年語音技術趨勢

語音即入口:AI 語音互動如何重塑下一代智慧應用

Gemini 2.0 來了,這些 Voice Agent 開發者早已開始探索……

幫助使用者與 AI 實時練習口語,Speak 為何能估值 10 億美元?丨 Voice Agent 學習筆記

市場規模超 60 億美元,語音如何改變對話式 AI?

2024 語音模型前沿研究整理,Voice Agent 開發者必讀

從開發者工具轉型 AI 呼叫中心,這家 Voice Agent 公司已服務 100+ 客戶

WebRTC 建立者剛加入了 OpenAI,他是如何思考語音 AI 的未來?

相關文章