Whisp Subvocal Input – ØF

Jan 21, 2024

Ask episode

Chapters

Transcript

Episode notes

Pablos: Here’s one of the things I think is a critical area of invention that remains unsolved, but it’s definitely a part of the future. So if you’re using an iPhone anywhere in the world, cultures vary. I’ve been working with this guy in Venezuela on a project. I text him on WhatsApp and then he replies with a voice memo like every time and so his, culture and worldview is just like talking to the phone and probably because I know Venezuelans do a lot more talking or something.

Whereas I never use voice memo. I’m texting, but a lot of that is like, I’m in public around other people and I don’t want to disturb them and, disturbing people is considered uncool where I come from, but in Venezuela, like everybody’s chattering all the time, probably because they’re all Latinos.

Talking to your computer will become more and more common. And you can see that some people are more comfortable with it than others. I see it a lot more in people from other countries than I do in Americans. Right now, talking to Siri kind of sucks, and Alexa. These things are kind of stunted because, they’re very one shot oriented. If you take your iPhone and start using the voice interface for ChatGPT, wow, it gets pretty exciting. Because now you’re having this, two way, audible conversation that builds on itself over time.

And if you haven’t done that, I think everybody should try it because that will give you a sense of where these things are going. Once you get that going and realize, oh, I can just do this while I’m driving or walking, and I don’t have to be staring at my phone. It starts to get compelling.

And so it’s not hard to imagine being, a few years down the road where ChatGPT is just listening all the time and piping in when it has the answers for you . So that’s just laying the groundwork, hopefully all that makes sense.

But where I think this goes is that we need to solve one really big problem that remains, which is sub vocal input.

Ash: Okay.

Pablos: And what that means is, right now, if I’m talking, I don’t want to talk to my phone, I don’t even want to dictate text messages or do voice memo, because there’s people around listening, I don’t want them here in my business. We’re in this situation where the eavesdropping potential, even if you’re not talking about something super secret, it could be private or whatever. I don’t want to play a message from you out loud and I want other people hearing things that I haven’t screened yet, who knows what you’re talking about.

So, what sub vocal input would do is give you the ability to just essentially whisper and have your phone pick it up. People around you wouldn’t hear you, wouldn’t understand you but you would still use the same machinery that you have and we all have the ability to whisper, and and quietly. If you’re trying to whisper for someone else to hear you, maybe it gets kind of loud, but if you’re just trying to whisper to yourself, it can be super quiet.

We know that this should be possible, and we know that because deaf people are able to train themselves to do lip reading pretty well. So a deaf person who’s, got nothing, bothering them audibly can sometimes, apply enough focus to the task of learning how to read lips that they can do a really good job of it.

So there’s enough of a signal in what your phone could see. So you know with Face ID there’s a tiny little LiDAR sensor that’s doing depth, and it can see your face. It can see the, minute details about your face. That’s why it can tell, the difference between your face and a photo of you and your twin brother or sister, whatever.

So it might be possible right now. With the hardware that’s in an iPhone, even though you probably don’t have access to the right APIs for this to work, but maybe in a equivalent Android phone or something, maybe this could be prototyped. Where you could just use that machinery, train a giant, model, just a machine learning model on, lip reading.

Ash: Yeah.

Pablos: And so you would be able to just look at your phone and whisper, and it would transcribe.

Ash: There’s a couple of things on this. Three GSM world, before GSM, 2000 or so. So we’ll go back in time. One of the big conversations that we would have was, I was a proponent saying that we just don’t have enough bandwidth and People are like, “yeah, but we’re going to have 3G & 4G & 5G & 6G.”

And I said, “no, no, you’re missing the point.” The bandwidth to your device is not the issue, it’s between the device and the human. It’s your conversation. It’s, this is where we’re stuck. We’re stuck because we type, we could try Dvorak, we could try QWERTY, we can pick the keyboard, we can have sideways keyboards, we can speak to it, but I still think all of these are terrible.

Whispering, could be very interesting. There was a MIT headset, Alter Ego. So Alter Ego, if you look at this thing up, it’s a mind reading, reading device. Sub vocalization signals through EEG, brain activity.

He can actually make it work.

Pablos: Well, I’ve played with some of these things. I have NeuroSky headset emotive, but I think what you have to do with them…

Ash: This one you wear. It’s bone conducting. It’s wild. You just put it on and say,

Pablos: Oh, it’s bone conducting. So it’s picking up speech, it’s not EEG.

Ash: No, no, no. The bone conducting is how it tells you things back. So it even whispers it back. Like, into your head.

Pablos: Oh, but you could just do that with headphones.

Ash: No, that’s how it whispers back. You think it and then it tells you things. Anyway, it’s called alter ego, we’ll link to Alter Ego. To me, it goes back to what you’re saying, which is, is there a way? Otherwise, we just look like, we’re murmuring to ourselves, right?

We’ll just look completely crazy. Like sometimes I get a little bit annoyed with people on conversations with AirPods. You just have no idea what’s going on, right? There’s a little hairdryer sticking out of their head, and they’re like, just walking around, and we just are fully, we’re already like, isolating ourselves and now we’re, we’re conversing. I think what you’re saying though is that the sub vocalization stuff needs to be in a way where it’s, Almost so discreet that it is a relationship between you and a listening device, right? It’s almost like the pixie on your shoulder.

Pablos: Yes.

Ash: It’s like the little angel devils whatever the animated version was.

Pablos: Yeah, and I think there could be other technologies. I don’t know if you could fit it in something like an AirPod. Maybe like a Compton backscatter detector, one of these terahertz imagers, like the thing at the airport that you do the HOVA signal to, and then it’s you. Without a lot of radiation, you know, those things are low impact. You could do something like that to see the tongue through the

side of the mouth.

Ash: My belief is closer to the way that you were trying to tackle this problem, which is, hey, it listens in and jumps in. But what if I could prompt it to jump in, right? So for example, let’s assume that instead of having to build anything new, it’s now just listening to me.

Constant in real time. Imagine a natural language parsing system with a, engine underneath. We used to call these things While Aware. This was actually the name of our company from years ago. And While Aware was intercepting SMS messages in real time on the SMSC. And the idea was that, it would detect what the conversation was, but because it knows who you are, it would evoke different things at different moments, right?

So let’s pick, for example, Bitcoin share price, Bitcoin’s falling as a price. And that message was coming to you or that data was somehow coming to you. It might say, do you want to open up, your trading account and you can go sell it.

And for me, it might, immediately tell me, do you want to book, tickets to Belize in a non extradition country, because my capital call is too high,. Whatever it is, if I have a margin call, because it knows what’s happening. It’s contextual, understanding. And I think one of the big things that we’re missing in all of these little support things that you allude to that ChatGPT brings to the table is contextual.

We fail because It doesn’t understand us. Siri doesn’t know.

Pablos: This is a separate conversation. Fundamentally, you are right. The whole future of AI requires that it know you, it needs to know you, it needs to know every conversation you’ve had, not only every SMS but text message and email, it needs to have 100 percent of that so it understands you. It knows what you know, it knows what you care about, it sees what you do, it sees what you say, it has to have all that and I want the AI to have all that. We need to architect for that and right now we’re not doing that because we’re building giant centralized AI’s.

Ash: That’s when you’re, different technologies, whether it’s the backscatter or it’s the, lip reader or the whisper detector. All of those become a lot easier when you have context. I don’t know if you remember Google’s evolution, 2009, 2010, Google suddenly, not as creepy as Facebook, but its searches were just better, its searches were just better.

Why were they better? Oh, you’re standing in New York city. So obviously maybe it’s contextual to what’s around you. Maybe the weather is cold. So Google’s original cookie, which they’re now getting rid of, was so laden with data. If you could mine that sucker, you won.

It knew all of the signals. And I used to call it, signal gathering in terms of the more signal you had, the more accurate you became. And the more you look like sort of a savant. So our AI, like you said, isn’t really smart and Siri’s terrible because it doesn’t know much. It doesn’t even know intent.

So as humans, why is it that we can speak with somebody with a very heavy accent sometimes?

Because we know the context of what’s happening and why we got there.

It’s not just lip reading. It’s because when we’re with them, we do our own interpretive dance. I think that if you tie the two together, what you just said about, you know, these other little signal things, you could pull it off.

Pablos: I assume we’re gonna get the latter for free. That’s gonna happen. AIs will be stunted until they start to have access to everything and know everything about me and my context in real time. So that’s all gonna happen anyway, and there’s such momentum around that. So I think we get that for free and even if you didn’t, having a conversation with ChatGPT right now will probably convince you that it’s, like, good enough that we’re going this direction one way or another.

Ash: The reason I bring all this up is, can you imagine if, instead of having to whisper, what if all I have to do is have my phone out, and I just say yes or no, or I say more? Go back to my Starship Trooper obsession of, “would you like to know more?” What’s interesting is, imagine in your scenario, you’re having this sub vocal conversation, but instead of you having to have any conversation, ChatGPT has heard you and it’s like, ” oh, alter ego,

Pablos: No, no, I get it. One of my friends, figured out that you could get through life with only four words, fuck, man, dude, and totally. If you just have those four words, you can get through life because you can express a multitude of things with just those four words.

Totally.

Ash: Totally.

Your response, totally. Funny enough though, right? That may solve some of your problems because you could whisper a little

Pablos: Yeah, yeah.

Ash: And not have to do long things.

Pablos: Yeah. Right. Exactly. No, you’re totally right. And that’s what you do with your friends. And the closer you are to your friends, like if you’re just hanging out with somebody you’ve known for a long time, you can have a lot of communication with very little actual content. If I watch my daughter and her best friend hanging out, they’re incomprehensible because they have like, shortcodes for memes, everything they see or talk about or discuss is related to some other thing that I wasn’t part of and like they’re foreign objects to me. I think that is kind of what you’re describing. Like at some point,

Ash: So go back to your Venezuelan, right? If you go back to that conversation and they’re sending you a voice note. Now, let’s say that voice notes processed and parsed and read by our GPT friend, and it comes back and gives you a summary, five sentence. So you don’t even have to look. It just whispers it in your head. Like he wants to know, should he edit the podcast? I don’t know, whatever it is. And you could just go back and be like, just hit the yes button, right? I mean, you could go back and say, totally. You could do one of your four words.

Pablos: Yeah, totally. No, you got to try it. I tried it. You can go for days without using any other words. But yeah, I think that gets more possible. Like with a human, the more shared experience you have, the more shared context, shared vocabulary, the more concise you can be in your interactions.

And so it stands to reason that an AI that knows you really well could get to the point where. All you gotta do is nod or wink and you’re done, on a lot of things cause it knows how to set you up to make a quick decision.

Ash: If it can formulate the outbound response in long form, and all you have to say is totally…

Pablos: Mm hmm, yep.

Ash: Then you’re good, right? That’s usually the problem with these voices, with getting those voices. I’ve got those too, where people, it is the Latin America thing. They just love, like, I don’t know what’s going on. It was Brazil too, just, people just go off. And they have a recording. I’m like, you do understand, if I could listen to this, I wouldn’t be texting you. That’s like, I would pick up the phone and just phone you if I can, if I could have a dialogue, I would have one. When I saw that, I was like, well, can you just tell me like what’s in the voice recording?

That’s what we’re looking for. The other thing to think of, and I thought this is where you were going before, you were talking about the sub vocal thing, It’s almost like the Babelfish thing, for all the fans of Hitchhiker’s Guide. I just had this crazy problem happen, which was, I’d ordered an Uber, and I’m sending information to the Uber driver in English, and the Uber driver is replying in Spanish, but I have a little translate button, but I don’t think they had a translate button. And at some point they just simply just said, no hable ingles. I tried to give the directions to my house, finally, I had to run into the street. I sent my daughter out into the street, like someone went out and we’re trying to tell them like, go to the yellow house.

And I’m like, does anyone remember the word yellow? I realized that I was getting translate and they just didn’t speak English. I think that maybe there’s this universal input concept. If someone sends you a voice message, it not just transcribes it, but maybe it automatically just dumps it into like concise format. Or to the other person, it reads it to them. So you pick your poison of consumption, like the way you like to consume it, and you just build a proxy in the sky that just It just takes care of all this.

There’s like a universal proxy, like a little babble bot that sits in the world. And I think you could get pretty far with that. And then you use that to feed ChatGPT. And then you use that to go with the totally man, dude, fuck, right? That’s your sequence to that. And then you add your sort of exotic input mechanisms for your sub vocal and everything else.

So I could like, you know. Whisper.

Pablos: So job one is all the people making AIs need to figure out how to make them mine so that I have my own that I can love and trust and have for life.

Job two is they need to make that thing know everything about me, I’m not just a lowest common denominator, I’m me and I need, I need my AI to really know me.

Job three is we’ve got to come up with some clever hardware for doing sub vocal input and it could be something that you wear like a headset that just see through the side of your face and see what’s going on in your mouth and your tongue and your embouchure

Ash: Well, it could be like a body cam, just clip it on.

Pablos: It could be something like that, something that looks up at you. I don’t know, it’s hard to mount something that sees the front of your face very well, a phone does, though. And even if you had to just aim the phone at your face for it to work. That would be a good start. And I think you could do that today without making any hardware.

Ash: Yeah, well, you could put it into your Apple watch. Just hold it up. it’s like Dick Tracy.

Pablos: There’s no camera yet, but next apple Watch will.

Ash: Yeah, next Apple will have a little camera, so you just hold that up. It doesn’t even have to, you just have, you don’t even have to hold it up because if you’re using your little radar or LIDAR thing, you just have to have your hand out a little bit. Gesture control on steroids.

Pablos: Did you see they put like a gesture control in the new Apple Watch, but it only knows one gesture, which is you pinch your fingers together and it can detect that. I haven’t tried it yet.

Ash: The other thing I was going to say is I wanted to add what you said about your daughter’s thing is that if the AI becomes your buddy, then the total bandwidth between your AI and you will start to decrease.

The requirement will decrease because you’ll just be able to speak in your own code. You’ll be able to be like, yeah, that thing that we worked on last week, dude.

Pablos: Mm hmm.

Ash: And then it’ll just know,

Pablos: Exactly. Right.

Ash: the other way that it’s going to help. So it all starts with that first step, though.

It’s got to twin you a little bit. Little little scary on the privacy side.

Pablos: That’s where, some of these, some of these folks working on OpenAI competitors have certainly, gotten onto that notion. Allegedly Apple is trying to figure out how to make the LLM’s local, so they run on your device and presumably that’s part of the rationale beyond just, justifying you having to buy a faster device and also, make it low latency.

Recorded on January 8, 2024

The post Whisp Subvocal Input – ØF appeared first on .