Building My Own Siri / Jarvis
Most of the magic behind Siri happens remotely.
I want to create my OWN version of Siri…. except I don’t care for having it on my phone. I want my entire house to be talking to me… more like Jarvis (from Ironman).
I believe I have access to all the right resources to create this AI.
It breaks down into three major parts:
1) convert speech to text
2) query database populated with q&a
3) convert text to speech
Speech to Text
Most speech to text engines suck. Siri’s works exceptionally well because the engine isn’t on your phone… it’s remote. I supposed we can hack Siri by running a MITM attack on an iphone and faking the SSL cert and intercepting the apple ID…. OR we can do something much simpler. Google’s Chrome 11 browser includes a voice input function (which isn’t yet part of the HTML5 standard) and can convert your speech into text. This guy discovered that it was happening remotely through an undocumented API call to google. All we have to do is access this same API and we got ourselves a free Speech-to-Text engine!
In case you don’t understand Perl, this is how you use the API:
POST to: https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US
POST params: Content
(which should include the contents of a .flac encoding of your voice recorded in mono 16000hz or 8000hz)
Content_Type
(which should read “audio/x-flac; rate=16000” or 8000 depending on your voice recording. This should also be mirrored in the Content-Type section of your header.)
Response: json text
I used ffmpeg to convert my audio into the desired format:
ffmpeg -i Memo.m4a -vn -ac 1 -ar 16000 -acodec flac test.flac
So I recorded my voice on my iphone 3gs asking “what day is it today?” and converted it to the appropriate .flac format and posted it to google’s API and this is what I got in response:
{"status":0,"id":"008bd1a95c3c2b04bd754da5e82949f4-1","hypotheses":[{"utterance":"what day is it today","confidence":0.91573924}]}
Sweet.
Database populated with Q&A
This is probably the most difficult part to obtain. To build it from scratch would require tons of data and advanced algorithms to interpret sentences constructed in various ways. I read somewhere that Siri was using Wolfram Alpha’s database….. so…. I checked out Wolfram Alpha and they have an engine that answers your questions. Not only that, they also offer an API service. (If you query less than 2000 times a month, it’s free!). So I signed up for the API service and tested it out. I asked it some simple questions like “What day is it today?” and “Who is the president of the United States?”. It returns all answers in a well-formed XML format.
<?xml version='1.0' encoding='UTF-8'?>
<queryresult success='true'
error='false'
numpods='1'
datatypes='City,DateObject'
timedout=''
timing='1.728'
parsetiming='0.193'
parsetimedout='false'
recalculate=''
id='MSP77719ii856b9090fei40000543b8b9eibb14ida&s=21'
related='http://www4d.wolframalpha.com/api/v2/relatedQueries.jsp?id=MSP77819ii856b9090fei400001d3h9h126cgaeigc&s=21'
version='2.1'>
<pod title='Result'
scanner='Identity'
id='Result'
position='200'
error='false'
numsubpods='1'
primary='true'>
<subpod title=''
primary='true'>
<plaintext>Friday, January 13, 2012</plaintext>
</subpod>
</pod>
</queryresult>
Again…. sweet.
Text to Speech
This part is easy… and google makes it even easier with yet another undocumented API! It’s straight-forward. A simple GET request to:
http://translate.google.com/translate_tts?tl=en&q=speech+to+convert
Just replace the q
parameter with any sentence and you can hear google’s female robot voice say anything you want.
Voice Input
I can either make my program run over a web browser or as a stand-alone app. Running it over the web browser is cool because I would then be able to run it from just about any machine. Unfortunately, HTML 5 doesn’t have a means of recording voice. My options are a) only use google Chrome, b) make a flash app, c) make a Java applet.
Anywho… no big deal.
Putting It All Together
<?php
$stturl = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US";
$wolframurl = "http://api.wolframalpha.com/v2/query?appid=[GET+YOUR+OWN+STINKIN+APP+ID]&format=plaintext&podtitle=Result&input=";
$ttsurl = "http://translate.google.com/translate_tts?tl=en&q=master+cranky,+";
// Google Speech to Text
$filename = "./test1.flac";
$upload = file_get_contents($filename);
$data = array(
"Content_Type" => "audio/x-flac; rate=16000",
"Content" => $upload,
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $stturl);
curl_setopt( $ch, CURLOPT_HTTPHEADER, array("Content-Type: audio/x-flac; rate=16000"));
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
ob_start();
curl_exec($ch);
curl_close($ch);
$contents = ob_get_contents();
ob_end_clean();
$textarray = (json_decode($contents,true));
$text = $textarray['hypotheses']['0']['utterance'];
// Wolfram Alpha API
$wolframurl .= urlencode($text);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $wolframurl);
ob_start();
curl_exec($ch);
curl_close($ch);
$contents = ob_get_contents();
ob_end_clean();
$obj = new SimpleXMLElement($contents);
$answer = $obj->pod->subpod->plaintext;
// Google Text to Speech
$ttsurl .= urlencode($answer);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $ttsurl);
ob_start();
curl_exec($ch);
curl_close($ch);
$contents = ob_get_contents();
ob_end_clean();
header('Content-Type: audio/mpeg');
header('Cache-Control: no-cache');
print $contents;
?>
It responds with this answer. Good girl.
It’s still missing the voice input portion of the code. Currently, it just accepts a .flac file. I wrote 3 chunks of code that I put together as one pipeline of an AI process. The advantage of this over Siri is that I can intervene at anytime. I can have it listen for particular questions such as “who is your master?” and respond appropriately…. but more importantly, I can have it listen for “Turn on my lights” or “turn on the TV” or “open the garage door” or “turn to channel 618”. Certain questions will have my bot send a signal to the appropriate Arduino controlled light switch or garage switch or IR blaster and respond with a “yes, master”. I’ll post videos when it’s done.
Here is a video of the prototype in action.
Updated to give you a link to a working demo. This version requires you to use the Chrome browser (thanks to Shiv Kokroo for generously providing hosting / wolfram app ID):
Working Demo
Click on the little microphone and try asking her a question like “how many legs does a spider have?” or “what is 15 + 11?” or “turn off the lights”. 🙂
Update: There is a follow-up to this post here.
Source codes can be found on github.
Buddy , you are damn cool ! I would like to collaborate on this !
Hey Shiv! Dude, that would be awesome. Just to keep you updated, I’m looking into the X10 devices so I can make utilize Jarvis to automate the home. I have a arduino + ethernet shield programmed to accept commands from jarvis to trigger responses for TV, lights, garage, etc.
can you possibly integrate wolfram alpha queries into speech recognition through perl? if so how? can you plz help me. I am absolutely at the beginners level at programming,but I’m willing to learn. I’m absolutely interested in this project.
http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/
On that page you can see how he used perl to access the google speech api.
The demo is awesome .
Any chance of the Demo being put back up?
Hey chefwear, I wasn’t hosting the demo and didn’t realize it was down. I’ll try to get a working prototype back up.
Looks awesome. Demo what nice. I’m very interested in the result.
Hi Cranklin! What’s the status of this? Did you get the arduino working?
Actually, I did. I didn’t actually connect the arduino to the appliances/lights/garage/doors yet, but I have an arduino with an ethernet shield that acts as a miniature intranet web server and waits for instructions via GET requests. When I speak to Jarvis, she can GET requests to the arduino and give it instructions to turn on/off lights with instructions like “http://[internal IP]:[port]/?dev=tv&cmd=on” to turn the tv on.
I’m looking into other protocols such as X10, xbee, etc before I finalize the project.
I’ll post source codes for the arduino webserver and an updated jarvis/siri in a future post.
Awesome, I’ve been really interested in home automation through this method, but I was going to use AppleScript, – automate computer, home, and hopefully I’ll figure out how to use the APIs with it! I would love to look at your arduino code too!
I had the same for automating the home. But, first I am remaking my computer. I was going to try and port Skyvie over to my computer.
hey man this is so cool , i always kinda wanted to do this can i put this on my tech blog ? cheers
Hey Subin, absolutely.
Could you make a 100% custom server and make it sound like zazu from the lion king ? 😛
Or even better allow “pst!” to toggle siri and have hyper sensitive sound for intimate conversations ?
lol David. That would be comical. I can’t get Zazu’s voice, but I did notice that when the google TTS api is triggered from a different country, the accent is different.
Is this project dead? Id like to pick up where you left off.
Hi Stefanoxr. It’s not dead, but I just haven’t had time to work on it lately because of my job. Feel free to develop it further. Everything is in github though I apologize for the lack of organization. I also added a trueknowledge API version.
I too am building my own Jarvis and am interested in using the wolfram software to do it. Are you ok if I use some of your software as a basis like stefanoxr2? How can I access it on github? Is there a link I’m not seeing?
Hi Michael. How is your software coming along? You can find my source code on http://github.com/cranklin/Jarvis
Slow man. Got to admit I’m new a complete newbie to electronics and programming. I’m quick learner and eager to tackle this project, even if it’s way beyond my current abilities. I’m a fast learner. Any general tips or resources about what I should learn first? For instance the type of code you’re using and whatnot? I’m just looking for some open resource to learn.
Hey Michael, there’s nothing wrong with that. I’m pretty sure you’ll grasp everything you need soon enough. If you don’t mind me asking, can you tell me what technologies and/or programming languages you are currently comfortable with?
Well, when I said new to languages I meant, grandmotherish. As in I know how to open my email and search the web. Changing the background picture of my laptop would have been difficult for me two weeks ago. Like I said though, I have a steep learning curve and I’m out of college for the summer, so I’m already comfortable with Java, XML, and some Basic, although I’m having a tough time finding a good place to learn basic. What language are you using for your programming if you don’t mind me asking?
Hey Michael, I’ll use whatever language is the best fit. For example, programming the arduino microcontroller requires C (not true C as it does use objects… but similar enough)… I chose PHP for the backend of the web interface, but I could have easily used Python or another language of choice. Don’t let the language be your focus. If you’re a good programmer, you’ll be able to learn a new language on the fly.
If I was going to make a recommendation to a new programmer, I’d recommend python. It’s widely supported, it has many different applications, it’s fairly easy to learn, and it’s just an over great language.
Awesome! Python and C are my next focus’ then. I’ll let you know how it’s coming along in a while. 🙂 Thanks for all the help! 🙂
Time lag is 3 to 4 seconds. How one can make it faster?
Rahul, you can disable one of the 2 AI engines. I reckon the double query will slow things down significantly.
I haven’t learned any computer languages yet and am wondering how.could a complete newbie like me figure out how to do all this stuff, thanks
Hi Tom, I am sorry about the late response. I have been so busy with work.
If you haven’t learned any computer languages yet, this may all seem very overwhelming. I recommend getting your feet wet first. There are tons of online resources. codecademy.com offers some great classes that will help get you on your feet. I recommend it.
thanks u so much dude !! really
Thank you for reading!
Hey I came across your jarvis project while I was looking into doing something similar. Would you possibly be able to email me I have some questions I would like to ask. Thanks in advance
Hi Metin, sure. What kind of questions did you have?
How did you get the male voice vs the google female voice?
Thanks
Actually, that’s up to google. I noticed depending on your region, the google voice changes.
Thanks for the fast response. So you just set your region to what?
I left my region default.
Which is? I’m in the USA and its a female.
I’m in the USA as well and it’s a female. Where did you hear the male voice? The link that I provided is actually an Indian server. That might be why. lol
Syn virtual assistant is coming this april i saw its video on youtube something named like madonna virtual assistant its free and made for developers. if it can be extend i will definitely be using it because they say its free
I had a bunch of trouble attempting to use the google api until somebody suggested to me that I try http instead of https. I don’t know why https was failing for me, but just in case somebody else is having problems, here’s something to try.
Hello, do you run the code on an wamp server or similar? because php is serverside, so i don’t know how you host it, for the arduino to connect, doesn’t they both have to be on the same network? Or do you forward the requests to the arduino from a website? Victor
Hey Cranklin,
I’m looking into getting my feet back into programming, i’ve had basic C++ experience so I have a very general idea of whats going on in your program. I’m interested in replicating your program here but I want to learn what is happening at the same time and not just copy the code line for line. Is there any way you can add a few comments to the file to further explain the implementation of the APIs in the code?
Regards
Hi Tiko, sorry for the late reply. I’ve been crazy busy. Yes. Actually, if you can wait, I’m re-releasing Jarvis with a lot of enhancements and it will be easier to follow.
Hi i have been wondering about your JARVIS and thought that this is really cool, but i do not know the programs you used so if you can please tell me them i will be most pleased to make my own JARVIS! P.S. Ive been looking for this for a while and this seems perfect!
I’m really new to coding. I’m kinda confused as to what you’re coding this on, and what language, and if you’d be interested in kinda making a more step by step kind of post.
Hi, I have been following your javrvis project, and it’s ridiculously cool! I would like to recreate your program. I know java and I wanted to ask if this could be recreated using java or would I have to pick up some python to recreate the project? Look forward to hearing from you. Awesome Project!
you can use java (or other language). Just pay attention to the requests being made to google as well as the AI API. With just a little bit of work, you can easily port this to Java.
Hi, currently I know some arduino programming and java. Is it possible to create a application which serves as the main control panel for the jarvis project and then based on voice commands sent to the computer by voice, the computer will be able to respond to the voice command with the correct response?
The application will most likely be made for PC
I’m not exactly sure what you’re asking, but yes. If you look at part 2 of this blog post, I think I’m doing what you’re asking about. I may be mistaken.
Okay will look into it thanks! Do you know if this whole project can be made using a beagle bone board?
I haven’t tried tampering with a beagle bone board, but I’m pretty certain it can.
http://www.codeproject.com/Articles/579471/How-to-Write-Your-Own-Siri-Application-Mobile-Assi
Hello cranklin
i am a newbie for programming. You have a code posted there how can i get it running?
the PHP code you have given in the github files..
how can we run the above code?? by using which software ?? pls tell me ..anybody
Hi I’m a UI design Developer so I’m always looking for some cool projects and i think this is very cool and would love to help you develop this to make it a desktop app that people can just down load and just have it running every where so like jarvis even if you are at work from your phone you can have your AI complete task at home, stuff like that.
Hey cranklin on what language is your project based? Please tell me.
hey brother i pretty much like your work and looking forward for it, but as you say about the Google’s API..
As i was checking related to JARVIS and came across this website. I am week in HTML so can you please check this site they are doing same like google API. and please let me know whether they have used google’s API
i am very very sorry about it i din’t mentioned the link over hear the link is http://jarone2.jarviscorp.com/newdemo.html
pleas help me to find the working way of the website
Good day! I just would like to offer you a big thumbs up for your
great info you have got here on this post. I’ll be coming back to your site for more soon.
It’s perfect time to make some plans for the future
and it’s time to be happy. I have read this post and if I
could I wish to suggest you some interesting things or advice.
Maybe you can write next articles referring
to this article. I wish to read more things about it!
Hi, your site is amazing! Thank to you i have finished my version of the programm. Now the url is not working becouse it was released a new version of the api. I solve the problem reading here -> https://github.com/gillesdemey/google-speech-v2
For windows user: it not necessary to convert the audio to flac! you can use .wav file!
I’ve learn some good stuf here. Definitely price bookmarking for
revisiting. I surprise how muh attemt you set to create any such great informative website.
It’s fantastic that you are getting thoughts from this piece of writing as well as from
our dialogue made at this place.
How do you do the same in Java?
I LIKE THAT
What if I dont really want it to respond to me with a voice, but with text? But also understand what im saying. So, I speak in the microphone, and it responds on the screen with text
and this is Python right?
There’s an easy and best tutorial on Youtube to get started completely on all the necessary concepts to Build An Advanced App Like SIRI :
The Chilean winger and Mesut Ozil are in talks with the Gunners over lucrative contract extensions.