Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Software

Reading Lips In Software 149

SEWilco writes "The Register points out that Intel has released code for reading lips from a video image, Audio Visual Speech Recognition (AVSR). They do point out that better results would probably be achieved by combining video and audio recognition processing. I don't know if they have any patents, we all know some prior "art" from 2001, er.. 1968. HAL's accomplishment was also mentioned by CNN during 2001 in an article about this group's work."
This discussion has been archived. No new comments can be posted.

Reading Lips In Software

Comments Filter:
  • by burgburgburg ( 574866 ) <splisken06NO@SPAMemail.com> on Monday April 28, 2003 @06:40PM (#5829714)
    Thick mustaches.

    Men and women, boys and girls. All with really thick, dirty, obscuring mustaches.

    What is this world coming to?

    • No, this is the era of ventriliquisum (sp?), as the ventriliquistes will rise up against those who mocked them, their plans for unholy vengance will go unnoticed, as our only safety net collapses.
    • Well, on a positive spin, maybe some of us could then get new jobs. [ronjeremy-themovie.com]
  • Good or Evil? (Score:5, Insightful)

    by Blaine Hilton ( 626259 ) on Monday April 28, 2003 @06:41PM (#5829718) Homepage
    That's all we need, now everybody and his brother can easily create software applications to log everything. Security cameras record a lot of movements, but imagine hooking that up to lip readers and then being able to grep through all of that text output? Total Information Awareness here we come......

    Go calculate [webcalc.net] something

    • a reason to really hunker down and learn an obscure Chinese dialect.

      I've been putting it off for far too long.
    • Of course, it seems that the camera would need near head-on uninterrupted vision of the person whose' lips are being read, which is not an ideal surveillance situation and makes it pretty much impossible to track actual conversations (with people facing one another).

      I'd be scared of speaking to my computer now, tho - can you imagine a virus that uses your own webcam or whatever to see what you're saying when you're sitting in front of the screen?

      I guess no more webcam sex for me. =(

    • Sigh... (Score:3, Interesting)

      by ScoLgo ( 458010 )
      Sigh... the signal to noise ratio alone is enough to lend you reasonable anonymity. There's just way too much information that would need to be grepped through in order to listen in on your dinner conversation. No one, (or their Big Brother), is going to bother unless they have a really good reason to be investigating you in the first place.

      I'm thinking that the 'good' will outweigh the 'evil' here...
      • Re:Sigh... (Score:4, Interesting)

        by shaitand ( 626655 ) on Monday April 28, 2003 @07:37PM (#5830145) Journal
        How about having it record everything it picks up and time coding it, so that you grep for the word "revolution" "bomb" "nuts itch"and then cross reference it to the time sequence in the video. This is then passed on to the FBI as routine policy for "the war on terror"
        • Re:Sigh... (Score:2, Interesting)

          by ScoLgo ( 458010 )
          Well, it's possible that my tinfoil hat is on crooked today...

          From the Reg article... "Intel's announcement implies that the system works better when coupled with facial recognition to identify 'known' speakers."

          Doesn't this imply that, at least for the foreseeable future, this technology won't be easily used as some general Orwellian tool? It sounds as though it needs to 'learn' each speaker - much like voice recognition software has to be trained to your voice before it can be used accurately.

          From the
      • Maybe now thats true, but as computing power continues its exponential growth, big brother will hae the computing power necessary to analyze more and more data.

        Twenty years and why not hear everything?
    • If this is able to reach accurate reading in less then ideal circumstances, the only way to assure the government does not act outside its mandated bounds would be by having a completly open society.
    • by bryanthompson ( 627923 ) <logansbro.gmail@com> on Monday April 28, 2003 @07:17PM (#5829999) Homepage Journal
      ersonally-pay, i-ay(?) erfer-pay o-tay use pig latin.

      geeze, that really wasn't worth the effort...

    • This reminds me of the Seinfeld episode where George wants to "borrow" Jerry's deaf girlfriend to read people's lips. George and Jerry try to hide their lips from her as they discuss her lip-reading abilities, then she can read their lips anyway.
  • Oh wait, that was a different lip reading session...
  • by KPU ( 118762 ) on Monday April 28, 2003 @06:46PM (#5829762) Homepage
    Anybody else reminded of the Read My Lips [www.atmo.se] videos that fit clips to songs?
  • by yeoua ( 86835 ) on Monday April 28, 2003 @06:46PM (#5829763)
    Maybe now with a cluster at our finger tips and this sound visual lip analyser thing, we may be able to (finally) understand what all those foreign heavy accented professors are actually mumbling about...

    And well, beats manual note taking if the computer can read the board and his mouth and his voice.
  • by skermit ( 451840 ) on Monday April 28, 2003 @06:50PM (#5829795) Homepage
    A couple months ago, a very fine article was posted to /. about work at MIT regarding speech-->video synthesis using pre-recorded syllables. This means in the near future we'll be able to have avatars which an communicate to other people by videophone and/or other computers should we wish to do so. I'm reposting the old link because it got /.'ed for about 2 months (the professor took down the link) before putting the vids back up. So check out the amazing work that's on the flip-side of this article.

    http://cerboli.mit.edu:8000/research/mary101/resul ts/results.html [mit.edu]
  • by Smallpond ( 221300 ) on Monday April 28, 2003 @06:52PM (#5829815) Homepage Journal

    Body language should be even easier than lip reading. I want to know if I'm wasting my time or whether I should invite her back to my place.

    • by Suchetha ( 609968 ) <suchetha@@@gmail...com> on Monday April 28, 2003 @07:15PM (#5829988) Homepage Journal
      simple.. you're posting on /. .. face it.. you're wasting your time

      Suchetha
      • by Anonymous Coward
        LOL ROTFLMAO!!! You so funny witty slashdot poster!! You make old joke seem fresh and not like old stinky turd!!! Please share more surprising-delivery of classic-style joke!!

        You: Did you know chickenz cross road to not be on old side?
        Us: LOL!! You so funny!!

        You: Black peoples are funny with watermelon in their cadillac!
        Us: OMG - its funny cause its funny cause itr true!!!

        You: Please, take my wife!
        Us: Ha Ha ha ah ah aH

        You: Nerd get no sex!!!! Ha ha ha
        Us: Funny but true but sad but funny to

    • "Gesture recognition" is even easier. Want me to interpret that last one for you, Dave?
    • by graveyhead ( 210996 ) <fletch@@@fletchtronics...net> on Monday April 28, 2003 @09:28PM (#5830786)
      Lyndsey Nagle: Do I detect a note of sarcasm?
      Frink: (With sarcasm detector) Are you kidding? This baby is off the charts mm-hai.
      CBG: A sarcasm detector, that's a real useful invention.
      (Sarcasm detector explodes)
    • The interesting thing about this is that body language is an *important* part of lip reading. Facial expressions and gestures can add a lot of meaning to communication... I wonder what type of gesture recognition this system claims to have.
  • Wow, that must have taken a lot of hard work to do. First you'd have to recognize the location of the lips in the images (they might not stand out that much, especially in a crowd scene), then find the region in which the lips are moving, then finally use the positions of the lips to extrapolate for the current shape of the inside of the person's mouth, and make a haphazard guess at the sound being produced. And you'd need to be able to recognize the lips from any angle whatsoever. Sounds near impossible to me... and besides, by the point at which the person is beyond the range of the audio pickup of a security camera (I'm assuming that's what this would be used for), it would also be beyond the point of bad resolution. (unless the target is in a crowd, in which case the lips would be obscured frequently by people moving around in front of the target).
    • Hey, and what about Chinese? Reading inflection would be near impossible, even if you looked at the person's voicebox (assuming it's visible).
      • by Nihilanth ( 470467 ) <chaoswave2&aol,com> on Monday April 28, 2003 @07:05PM (#5829910)
        yeah, a lot of asian languages rely on internal vowel sounds that make lip-reading nearly impossible. Maybe if they used lasers to measure the sound pressure waves, or vibrations of the voicebox in conjunction with the lipreading.
        • That second one could work, but can lasers measure pressure fluctuations? I would think that air wouldn't reflect a laser, and if one measures the pressure by the speed of light through the medium (high pressure will slow it down slightly), you'd need a reflector of some sort...
          • All I wanted was a lip-reading computer with a frickin pressure-sensing laser beam attached to its USB port. Was that too much to ask?
          • Why the hell would they need a laser?? I think if it's gone that far they might just consider a MICROPHONE! WTF guys?! I know it's news for nerds, but... I can only think of a few applications for this that a microphone would be useless and that a laser with a mirror setup couldn't possibly help with sound pickup. Spying from tall buildings, maybe, but I can't imagine how much zoom those cameras would have to have and how stable they'd have to be to lipread something from any ridiculous distance. The manpow
            • relying on sound pressure waves has its limitations, especially in noisy or crowded areas. The number of microphones to reliably process everyone's conversations would have to grow exponentially with the amount of noise in the area. Relying on a visual information source might be trickier given the current state of computational power, but from a pure signal standpoint, it isn't effected by the ambient noise level or interferance from other conversations. it basically lets you sidestep the problem of sor
          • laser microphones have been widely used in james-bondesque espionage situations for years (those of us who've played Splinter Cell were forced to use one more than once), basically a laser microphone measures the vibrations on a plate of glass and tunes into conversations by measuring the vibrations.

            Sound pressure waves cause the density of air to fluctuate, which would bend the path of a beam of light travelling through it.

            basically, you'd need more than one laser in this situation, i think, you'd need l
        • i'm begginning to wonder if a microphone might be the simpler solution. too bad mics don't have that Gee-Whiz factor we love so much
    • A lot of these issues have been taken care of a long time ago. In 1996 several of my colleagues published a simple system for doing this in realtime (including integrating sound and video together for speech recognition) at the European Conference Computer Vision -- CiteSeer link to paper [nec.com] and there are several other papers in that same epoch describing similar systems. Clearly Intel has a more complete system than these papers (as you would expect given 7 years), but it's not as hard as you're making it s
      • Impressive... ;:o (I'm an amateur coder and am fazed easily by complex-sounding projects)
      • I'm trying to transcribe some tapes of lectures right now, and I'm looking for an easy way out. I know speech recognition programs are out there, but from what I know, they need significant training of the user with the program in order to work.

        Unfortunately, my voice is not the one giving the lectures, and there are actually two or three different lecturers. Since training is impossible (AFAIK, at least), I'm wondering how far speech-to-text technology has come, especially in the open source community. Ca
    • Comment removed based on user account deletion
  • by luzrek ( 570886 ) on Monday April 28, 2003 @06:53PM (#5829818) Journal
    Unlike HAL the Planet Express Delivery Ship cannot read lips.

    Fry, Leela, and Bender are hiding out in the shower discussing how to turn of Planet Express Delivery Ship. The little red light is on, the screen is scrolling back and forth between the lips as Leela gives orders and Bender objects. Then the ship says, "Oh, if only I could read lips!"

  • Cameras randomly zooming on the lips of the crowd, if somebody says someting from some "list" of words, they keep tracking that person and make some face recognition also.
  • by DeadScreenSky ( 666442 ) on Monday April 28, 2003 @06:55PM (#5829835)
    ... but I think it is interesting that Arthur C. Clarke thought HAL reading lips was the only implausible scene in the film. [boraski.com] You know, as opposed to the whole aliens thing. :P Just goes to show you the perils of trying to predict the future...
  • Call me cynical but has this been released as open source so it will be rapidly improved before being used in an Intel product?
  • Usage of IRC across the globe suddenly drops as users are dismayed by the number of people asking to sweep with them.
  • by raehl ( 609729 ) <(moc.oohay) (ta) (113lhear)> on Monday April 28, 2003 @07:06PM (#5829912) Homepage
    I may have done better in my AI class if I was able to read lisp. All those damned parenthesis made life very difficult.
  • by Metallic Matty ( 579124 ) on Monday April 28, 2003 @07:06PM (#5829915)


  • by ektor ( 113899 ) on Monday April 28, 2003 @07:08PM (#5829931)
    No... more... taxes.
  • I know that English is one of the more easy languages to "lip read." It goes into the latin roots, and such. I'm sure that using slang will make it much harder, but I'd be curious how it works with other languages. I think that Japanese (when spoken clearly, and not using dialects) would be incredibly easy where Chinese could be very difficult. If anybody has time and a desire to hack on it, keep me posted if you do multi-lingual work. I'm really curious on how it goes.
    • I owuld image any language that invokes clicking sounds would be difficult. as well as bird calls.
    • If I've understood thing correctly, Chinese uses both formants and pitch to signify meaning. A formant is a distinct sound, like A or O, which is recognizible at any pitch, and that part can be lip-read.

      But can pitch be lip-read? If not, would a system like this work at all for languages who apply pitch aswell as formants to distinguish between words?
  • I only use sign language!

    fools...
    ummm wait.
  • I've investigated Intel's vision library, OpenCV, before... and it does appear to be available for Linux if you look hard enough... but I couldn't find any Linux applications using it to actually *do* something.

    Has anyone had any success with OpenCV/Video4Linux?...

  • by djoham ( 93430 ) * on Monday April 28, 2003 @07:16PM (#5829993)
    ...someone recording to video a person *speaking* the source code of DeCSS and then using this tool in combination with gcc to generate libDVDCSS?

    Would this tool then be declared a "circumvention device" under the DMCA, or would the courts finally realize that code can be considered protected speech? The code was, after all, spoken in its original form in this case.

    This same question could also be applied to audio-to-text converters as well. Maybe there's hope the DMCA will be declared unconstitutional after all.

    Interesting food for thought...

    David
  • Prior Art (Score:4, Informative)

    by cperciva ( 102828 ) on Monday April 28, 2003 @07:24PM (#5830051) Homepage
    Software and business model patents have evidently effected comprehension of what a patent entails.

    "A computer, examining a set of video images, to perform lip reading" is not patentable. HAL would be prior art for this; but it doesn't matter because there isn't any inventive step here anyway.

    "A computer, processing a set of video images by locating what appears to be a set of lips, selecting recognizable points, using the movement of those points to track the deformation against a 3D model, comparing against a table of syllables to compute the probability of each particular syllable, and using knowledge about a language to determine which syllables are most likely to follow each other" could be patented. HAL would not be prior art for this, because there is no indication of how HAL performed the lip reading.
  • Fox News (Score:2, Funny)

    by Jru Hym ( 609379 )
    It probably wouldn't work for Greta "Lips" Van Susteren [foxnews.com]
  • Read my lips: "No new invasions of privacy... hey, wait a minute!"
  • by GoBears ( 86303 ) on Monday April 28, 2003 @07:36PM (#5830134)
    I don't know if they have any patents, we all know some prior "art" from 2001, er.. 1968.

    patents are supposed to be on inventions, not ideas. (very) generally speaking, you have to demonstrate you know how to do something for it to count as prior art. actually building something counts, as does a patent application (since the patent application has to explain how the invention works at a reasonable level of detail, for an admittedly arguable legal definition of reasonable).

    ianal, but the last i heard, a mention in a science fiction book or movie wouldn't typically be considered prior art. a person skilled in the art can't tell from 2001 how to make a computer read lips.

  • The evil trolls inside my head keep trying to make a joke about women, scanners and a lack of pants, but it's just not coming together.
  • Oh oh!! L-I-P-S!!

    First I thought, Jeeze... I can already read Lisp, emacs style...

    Then... ohhhh... they mean Lisps... like a speech impedement... That would be cool, to read lisps.

    But reading lips makes much more sense.
  • by RhettLivingston ( 544140 ) on Monday April 28, 2003 @07:42PM (#5830178) Journal
    in speech recognition if it does no more than allow input from a camera to aid in separating out which sounds came from which speakers. Simply fixing the background noise problem would be a huge advance.

  • This could solve my fundamental beef with speech as an interface - privacy! Dictating email and documents would be great, if I didn't have to broadcast to everyone around me. Not to mention the annoyance of hearing the guy in the next cube complain to his girlfriend over IM...

    Mouthing words silently takes some getting used to, but it has advantages. No more trying to type on a tiny PDA keyboard - etc. Obviously this is a ways off, but it seems doable.
  • but when can I get this on my desktop? it would be really neat to chat through IRC without making a sound. oh wait...
  • The article submitter says:
    "I don't know if they have any patents, we all know some prior "art" from 2001, er.. 1968. HAL's accomplishment was also mentioned by CNN during 2001 in an article about this group's work."

    Is there not a difference between the idea and the way to implement the technical solution. Meaning thay cannot patent the idea, but they can still patent the code itself for the way thwe code works.

    Just curious. What does everyone think?
  • HAL, as in "2001" for one thing. You all know about that already.

    The REAL THREAT of this is "them" using camera's to look at people from afar (or by whatever means) and eavesdropping on people when they can't get a microphone in..

    You can be sure that H.L.S. will jump on this like white on rice...

  • It's been done at Carnegie Mellon as well [cmu.edu].
  • I can imagine the source video material quality may be quite critical to this. It would be much easier to process a signal from a DVD, for example, than a composite video camera.

    But then on a DVD you'd just hit the subtitle button and problem sorted :)
  • The human mind parses speech by using both senses of sight and sound. They demonstrated this on the news one time by repeating a word over and over. They instructed the viewers to look at the screen while listening, then at some random time, to close their eyes and then open them again after waiting an interval, all while continuing to listen. Sure enough, when I closed my eyes, the word I heard was a completely different word, even though when I looked at the screen, I wasn't necessarily looking at the per
  • Sports (Score:2, Funny)

    by Dynastar454 ( 174232 ) *
    I know what I want this for- I want to read the lips of all the coaches and players during basketball/baseball/whatever broadcasts. Maybe ESPN could offer this as a feature, censoring as needed. :-)
  • In Soviet Union .... computer reads you.
  • by nloop ( 665733 ) on Tuesday April 29, 2003 @03:22AM (#5832202)
    I have taken many years of ASL classes and am pretty involved with Deaf culture; one of the biggest myths about it is peoples ability to read lips.

    The idea most people have of lipreaders, like in the movie See No Evil Hear No Evil (Richard Pryor Gene Wilder comedy) or the Seinfeld lipreader episode just really isn't possible. Many sounds such as "t" and "d" look the exact same, and many such as "k" and "g" are not visible at all. The best lipreaders really can only get 2/3 of what is being said, (if they are entirely Deaf, which many Deaf people are not, if your hearing loss is not total it can be far more efective) and that is with the person speaking slowly, facing them, and human intuition (context). Throw in facial contortions, (like yelling... "they can't hear me so if I yell it will help") low light, bad angle, fast talking, etc. and the accuracy drops dramatically.

    Computers lack the ability to figure out what word is being said based on context when the lips don't provide adequate information. They are also historically terribly poor at things like complex image recognition. Registration script busting is based on what? Image recognition with noise in the image (i.e. type the word that appears in the next form box) and no one has even come close to a functional computer ASL interpreter and ASL is far easier to disguish visibly than speech.

    I don't see that 40% word error rate it is currently having being able to improve much at all, and I'm guessing the video feed that's off of isn't anything like fullspeed nonexagerated human speech.

    Your fears of the video cameras on the streets logging your conversations are pretty unfounded ;)
  • My computer hath been able to read lithpth for yearth.
  • Puts me more in mind of Coppola's The Conversation.

    The idea of combining it with speech recognition in an adaptive fashion, using one source to cross-check the other, could open up a whole new area of privacy invasion.

    Imagine this stuff running on all the CCTVs in the town where you live...

  • In french the word "benjamin" and the french translation for "eat shit" have exactly the same lip movement pattern. Just a tought like that

If you have a procedure with 10 parameters, you probably missed some.

Working...