Skip to content

Building My Own Siri / Jarvis

January 13, 2012


Most of the magic behind Siri happens remotely.

I want to create my OWN version of Siri…. except I don’t care for having it on my phone. I want my entire house to be talking to me… more like Jarvis (from Ironman).

I believe I have access to all the right resources to create this AI.
It breaks down into three major parts:
1) convert speech to text
2) query database populated with q&a
3) convert text to speech

Speech to Text

Most speech to text engines suck. Siri’s works exceptionally well because the engine isn’t on your phone… it’s remote. I supposed we can hack Siri by running a MITM attack on an iphone and faking the SSL cert and intercepting the apple ID…. OR we can do something much simpler. Google’s Chrome 11 browser includes a voice input function (which isn’t yet part of the HTML5 standard) and can convert your speech into text. This guy discovered that it was happening remotely through an undocumented API call to google. All we have to do is access this same API and we got ourselves a free Speech-to-Text engine!

In case you don’t understand Perl, this is how you use the API:

POST to: https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US

POST params: Content (which should include the contents of a .flac encoding of your voice recorded in mono 16000hz or 8000hz)
Content_Type (which should read “audio/x-flac; rate=16000″ or 8000 depending on your voice recording. This should also be mirrored in the Content-Type section of your header.)

Response: json text

I used ffmpeg to convert my audio into the desired format:
ffmpeg -i Memo.m4a -vn -ac 1 -ar 16000 -acodec flac test.flac

So I recorded my voice on my iphone 3gs asking “what day is it today?” and converted it to the appropriate .flac format and posted it to google’s API and this is what I got in response:

{"status":0,"id":"008bd1a95c3c2b04bd754da5e82949f4-1","hypotheses":[{"utterance":"what day is it today","confidence":0.91573924}]}

Sweet.

Database populated with Q&A

This is probably the most difficult part to obtain. To build it from scratch would require tons of data and advanced algorithms to interpret sentences constructed in various ways. I read somewhere that Siri was using Wolfram Alpha’s database….. so…. I checked out Wolfram Alpha and they have an engine that answers your questions. Not only that, they also offer an API service. (If you query less than 2000 times a month, it’s free!). So I signed up for the API service and tested it out. I asked it some simple questions like “What day is it today?” and “Who is the president of the United States?”. It returns all answers in a well-formed XML format.


<?xml version='1.0' encoding='UTF-8'?>
<queryresult success='true'
    error='false'
    numpods='1'
    datatypes='City,DateObject'
    timedout=''
    timing='1.728'
    parsetiming='0.193'
    parsetimedout='false'
    recalculate=''
    id='MSP77719ii856b9090fei40000543b8b9eibb14ida&s=21'
    related='http://www4d.wolframalpha.com/api/v2/relatedQueries.jsp?id=MSP77819ii856b9090fei400001d3h9h126cgaeigc&s=21'
    version='2.1'>
 <pod title='Result'
     scanner='Identity'
     id='Result'
     position='200'
     error='false'
     numsubpods='1'
     primary='true'>
  <subpod title=''
      primary='true'>
   <plaintext>Friday, January 13, 2012</plaintext>
  </subpod>
 </pod>
</queryresult>

Again…. sweet.

Text to Speech

This part is easy… and google makes it even easier with yet another undocumented API! It’s straight-forward. A simple GET request to:

http://translate.google.com/translate_tts?tl=en&q=speech+to+convert
Just replace the q parameter with any sentence and you can hear google’s female robot voice say anything you want.

Voice Input

I can either make my program run over a web browser or as a stand-alone app. Running it over the web browser is cool because I would then be able to run it from just about any machine. Unfortunately, HTML 5 doesn’t have a means of recording voice. My options are a) only use google Chrome, b) make a flash app, c) make a Java applet.

Anywho… no big deal.

Putting It All Together


<?php
    $stturl = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US";
    $wolframurl = "http://api.wolframalpha.com/v2/query?appid=[GET+YOUR+OWN+STINKIN+APP+ID]&format=plaintext&podtitle=Result&input=";
    $ttsurl = "http://translate.google.com/translate_tts?tl=en&q=master+cranky,+";

// Google Speech to Text

    $filename = "./test1.flac";
    $upload = file_get_contents($filename);
    $data = array(
        "Content_Type"  =>  "audio/x-flac; rate=16000",
        "Content"       =>  $upload,
    );
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $stturl);
    curl_setopt( $ch, CURLOPT_HTTPHEADER, array("Content-Type: audio/x-flac; rate=16000"));
    curl_setopt($ch, CURLOPT_POST, true);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $contents = ob_get_contents();
    ob_end_clean();
    $textarray = (json_decode($contents,true));
    $text = $textarray['hypotheses']['0']['utterance'];

// Wolfram Alpha API

    $wolframurl .= urlencode($text);
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $wolframurl);
    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $contents = ob_get_contents();
    ob_end_clean();
    $obj = new SimpleXMLElement($contents);
    $answer = $obj->pod->subpod->plaintext;

// Google Text to Speech

    $ttsurl .= urlencode($answer);
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $ttsurl);
    ob_start();
    curl_exec($ch);
    curl_close($ch);
    $contents = ob_get_contents();
    ob_end_clean();
    header('Content-Type: audio/mpeg');
    header('Cache-Control: no-cache');
    print $contents;
?>

It responds with this answer. Good girl.
It’s still missing the voice input portion of the code. Currently, it just accepts a .flac file. I wrote 3 chunks of code that I put together as one pipeline of an AI process. The advantage of this over Siri is that I can intervene at anytime. I can have it listen for particular questions such as “who is your master?” and respond appropriately…. but more importantly, I can have it listen for “Turn on my lights” or “turn on the TV” or “open the garage door” or “turn to channel 618″. Certain questions will have my bot send a signal to the appropriate Arduino controlled light switch or garage switch or IR blaster and respond with a “yes, master”. I’ll post videos when it’s done.

Here is a video of the prototype in action.

Advertisement

From → Hacks

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 83 other followers