Volker Dellwo is a voice catcher. Recently he listened to John Bercow, Speaker of the British House of Commons, giving a lecture at the University of Zurich. Dellwo, professor of phonetics at the Institute of Computational Linguistics at the University of Zurich, is interested in the unique characteristics of different voices, such as Bercow’s booming sound. Why do we recognize certain voices immediately? When we think, for example, of Barack Obama, Angela Merkel or Steve Jobs, we immediately hear their voices in our heads. People use their voices to stand out from the crowd, and they even use easily recognizable voice features to shape their personas, says Dellwo. Speaker John Bercow uses his voice to great effect with his unmistakable calls of “Ooorder! Ooorder!” when things get too uproarious in the House of Commons. It is probably the aggressive rounding of the lips that gives Bercow’s voice its particular quality, guesses Dellwo.
A few years ago, phonetics had the reputation of being a slightly esoteric and arcane subject. Since the advent of digitalization, however, phonetics has been in demand everywhere: “smart” AI systems and robots need good speech recognition systems; banks, post offices and other companies are increasingly relying on biometric authentication using automatic voice recognition for telephone advice lines; and phonetics is also in demand in forensics when the police are looking for criminals and want to analyze the voices of suspects.
The jawline of a voice
Volker Dellwo is sitting at his narrow office desk. The pitch of his voice rises, he modulates the tonality, gets louder and then quieter again. Dellwo, in his mid-forties, tall, has a long narrow head, deep-set eyes. The sound of a voice is determined by individual anatomy: The size of the larynx and jaw, length and width of the vocal cords, tongue shape, length of the throat, skull size. The out breath causes the vocal folds to vibrate, the voice is formed and transformed into sound in the pharynx, mouth and nose – the vocal tract – with the help of the tongue, mouth and palate. In fact, it should be possible to identify the anatomy of the speaker from the sound of their voice. That possibility is still in the future – Dellwo is currently developing a system that can create an identikit image of the speaker’s jaw from their voice.
Volker Dellwo’s voice has a slight hissing sound when he pronounces ch and sh – probably as a result of his upbringing in Trier on the Moselle in Germany. It was at the university in Trier that he discovered phonetics, after a detour through English and German language and literature. It was above all the music and acoustics of language that interested him. He continued his studies in Bonn and then Jena. Dellwo has always trodden his own path, he says with an impish grin. He financed his studies by playing music, touring the whole of Germany with his folk band in which he played the flute and bagpipes. Yes, he laughs, he still plays the pipes, though nowadays more often only in his head.
There are sketches of room plans on the whiteboard: Plans for a professional recording studio. A quantum leap will be required for the development of voice recognition systems of the future. Until now, says Dellwo, speaker-specific characteristics have been neglected in the field of phonetics. To make voice recognition systems fit for the digital age, neural networks will need to be used to capture and decode the variety and diversity of human voices, explains the phonetician.
Our AI robot friends still have a lot to learn – in certain situations, humans are still streets ahead of machines in recognizing voices, especially if someone deliberately alters their voice. “Siri, Alexa & Co have little sense of humor and irony. It’s easy to outsmart them by complaining in a whining tone about how great things are going,” says Dellwo. A voice that is muffled by a head cold, for example, can also cause problems for voice recognition.
Dellwo straightens up. As he recounts an astonishing experiment, his speech gets faster, more insistent and authoritative, with only brief pauses. For a voice recognition system to work, a computer first has to be trained and get to know a voice, he explains. Then the computer is supposed to recognize the voice – but that is challenging, as we sound very different in different situations.
If a mother, for example, speaks to her baby, she uses an incredible variety of sounds, melodies and rhythms. Dellwo has now found that computers that are first trained with a voice directed at babies can also recognize voices directed at adults more easily. That is no coincidence, says Dellwo. Rather, according to his hypothesis, evolution has seen to it that “babies will always recognize the voice of their mother in every situation and with every variant. That’s why mothers use the whole range of their voices when speaking to their babies.”
Voice comparison for police
Can voices by used to identify someone, like a fingerprint or DNA? Dellwo has been working with the Zurich police for a long time. One of the things he helps them with is providing an expert opinion using forensic analyses to compare voices. Is the defendant and the suspect the same person? Clear identification without any trace of doubt is not possible – even fingerprints or DNA can't give absolute certainty, says Dellwo. But if there are good-quality voice recordings, it is possible to give a relatively accurate assessment. Voice analysis is also helpful to police to help them track down a suspect in the first place. When profiling a suspect, filtered-out characteristics such as lisping, tongue-clicking, dialects, accents or sociolects (the dialect of a social group or class) can give valuable clues about the background of the speaker.
In the case of the IS terrorist “Jihadi John”, who, with a masked face, repeatedly executed hostages on film, the decisive clue in the search came from forensic phoneticians, who were able to deduce the origin and immediate surroundings of the perpetrator from his voice, and thus identify him.
Voice fakes of the future
In other areas of life too, computer-assisted voice analysis is getting more and more important. Swisscom and Postfinance recently introduced automatic voice recognition to quickly identify customers on the phone. The system compares the voice it hears with the customer’s previously registered voice. But what if the customer is hoarse? Even if the system recognizes the customer, the advisor will still ask a few additional security questions.
Developments are even further ahead in the area of voice cloning. Together with neuroscientists, Volker Dellwo is working on cloning a voice. In the age of social media, there is enough voice material freely available to make an artificial voice profile. Things will get interesting when the clone starts to speak. Dellwo and the researchers want to find out what consequences there will be if a synthetic voice is no longer perceived by the listener as being artificial. One thing is for sure, voice fakes will soon be as much a part of life as photoshopped images.
Dellwo’s voice is now slightly lower, he is speaking more slowly and sounds a bit more relaxed. He is talking about his houseboat, which for the time being is on the river Saar near Saarbrücken. He has a picture of it on the wall next to his office door. The boat is painted white and blue, and the interior is fitted out with lots of wood. It always needs some work doing on it. Dellwo has not been able to find a home for the boat in Switzerland. Before he came to the Institute of Computational Linguistics at UZH, he worked for nearly 10 years as a lecturer at University College London. There, he was able to keep his boat on the river Lea, a tributary of the Thames. He plans to bring his floating holiday home a little nearer soon, to Strasbourg, where it already spent several years prior to its London adventure. But for his voice analyses Dellwo prefers to stick to the office – and give his ears a break on the houseboat, with only the quiet lapping of water to break the silence.