Does the manual go into how the pair of voice samples must be chosen so that the resulting probabilities are meaningful? Do they mention pairs where one is screams and the other is speech?
The manual provides very little detail about the underlying model. SpeechPro states that these details about the specific algorithms are proprietary. Most importantly they do not explain how the "matching percent" is calculated even though this is the key output of the comparison. Unless STC gives Owen their "secret recipe" he cannot answer any specific questions about the key component of his analysis. STC also specifically states that the level of match (no color, grey, yellow, or green) is a threshold that the user does not know and cannot be changed. To me as a scientist, this absolutely disqualifies any output of EVB as science; if you can't explain what your numbers mean, the results can not be scrutinized or repeated; it is not science. I do not know why the defense is not focusing on this issue.
If you go to SpeechPro's EVB site
(not Owen's EVB site), the overview of the software states:The product is designed to assist analysts but not to substitute for them. The court verdict cannot be based on the results of any automatic system regardless of what we compare: voice, face, fingerprint or footprints. Only the forensic expert who carried out the investigation using all scientifically available methods and tools can make a conclusion.
SpeechPro envisions their product as purely an investigatory tool not to be used in court.
In the manual they do state that 3 seconds is the minimum required speech content. Speech content is only the part of the sample that EVB recognizes as speech; it automatically throws away silence, noise,etc. So Owen originally put in 7 seconds by combining all the shouts/screams. This 7 seconds was whittled down to 2 seconds of speech by EVB. This was not enough. The manual says that the error message "Voice model cannot be created.
" is caused by:The program cannot detect biometrical characteristics. The reason is, probably, that the speech duration in the file is less than 3 seconds and the program will not process such files.
If you look closely at the screen in the Soledad OBrien interview of Owen my post above, Owen has specifically marked the file as containing only 2 seconds of screams. If you look at the calculated speech time it is 4 seconds. He has obviously doubled the 7 second file containing 2 seconds of speech and now has 4 seconds of speech. Otherwise he couldn't do his analysis. The 2 seconds of speech matches the testimony of Nakasone.
I downloaded the EVB demo. The demo is limited to a bunch of audio samples that they provide. When comparing most of the same speaker recordings the matching percent is high. However they do give a sample of a person whispering and another sample of the same person talking over a noisy phone line. That pair give a 49% matching. George had a 48% matching. This certainly shows that the ability of the software to identify speakers degrades substantially when the context and channel of the two samples and the UBM are not the same.