We are happy to announce that the following distinguished scholars agreed to deliver a keynote lecture: (in alphabetic order):

  • Prof. Dafydd Gibbon (Bielefeld University, Germany)
  • Prof. Daniel Hirst (Aix-Marseille Université, France)
  • Prof. Sonja Kotz (Maastricht University in the Netherlands)
  • Prof. Andrew Rosenberg (IBM Research AI, USA)
  • Prof. Jianhua Tao (Chinese Academy of Sciences)

Prof. Dafydd Gibbon

Dafydd Gibbon is emeritus professor of English and Linguistics at Bielefeld University and is currently visiting professor in linguistics and phonetics at Jinan University, Guanzhou, China. His publications on prosody started with Perspectives of Intonation Analysis (1976) and continued with the collection Intonation, Accent and Rhythm. Studies in Discourse Phonology (1984) the collection Rhythm, Melody and Harmony. Studies in Honour of Wiktor Jassem, and numerous articles and conference contributions on aspects of intonation, tone and speech timing. Specific contributions to the study of prosody include the three-way semiotic distinction between structure, form and function in prosody; the application of the rank-interpretation architecture to prosody; finite state models of tone, the concept of prosody as metalocution, and the computation of time trees from speech annotations. A further area of specialisation has been language documentation for heritage preservation, linguistics and speech technology, as lead editor and co-editor of three handbooks in these fields (1997, 2000, 2012). He has received awards from the Polish Phonetics Association, the Linguistic Association of Nigeria and from the Ivory Coast government for contributions to linguistics, phonetics and speech technology, including aspects of prosody of endangered languages in West Africa.

Title of the talk: "The Future of Prosody: It's About Time"

Prosody is usually defined in terms of the three distinct but interacting domains of pitch, intensity and duration patterning, or, more generally, as phonological and phonetic properties of ‘suprasegmentals’, speech segments which are larger than consonants and vowels. Rather than taking this approach, the concept of multiple time domains for prosody processing is taken up, and methods of time domain analysis are discussed: annotation mining with timing dispersion measures, time tree induction, oscillator models in phonology and phonetics, and finally the use of the Amplitude Envelope Modulation Spectrum (AEMS). While frequency demodulation (in the form of pitch tracking) is a central issue in prosodic analysis, in the present context, it is amplitude envelope demodulation long time domain spectra which are focused. Using this method, multiple rhythms are described as multiple frequency zones in the AEMS, a new Frequency Zone Hypothesis of rhythm, and pointers to research fields beyond the time domains of foot, syllable and mora are outlined.

Prof. Daniel Hirst

Daniel Hirst is a British linguist and phonetician, who lives and works in the South of France. He has been working in the field of speech prosody and phonology for over forty years. After a PhD (1974) on intonation and prosody, he became a researcher for the CNRS (French National Scientific Research Centre) and completed a Habilitation thesis (1987). He is currently Emeritus Research Director for the CNRS and Aix-Marseille University in Aix-en-Provence.
He was also appointed Lecture Professor at Tongji University, Shanghai, China from April 2012 until 2015.
He has published numerous articles in several major journals and has contributed chapters to numerous international volumes and was responsible with Albert Di Cristo in 1998 for the edition of Intonation Systems: a Survey of Twenty Languages (Cambrige University Press).
In 2000 he founded the ISCA Special Interest Group on Speech Prosody (SProSIG), which in 2002 organised an International Conference on Speech Prosody in Aix en Provence, France. Since then, Speech Prosody has become a regular international conference, held every two years (France, Japan, Germany, Brazil, USA, China, Ireland, USA, Poland).
He is also the chief editor of a collection of books published by Springer, entitled Prosody, Phonology and Phonetics, which publishes monographs and collections of papers on the subject of Speech Prosody.
He was elected a fellow of ISCA in 2013 and member of the permanent council for the organisation of ICPhS in 2015.

Title of the talk: "On prosodic structure"

Our ideas about prosodic representation are heavily influenced by our knowledge of written language.
All writing systems represent utterances as a linear sequence of elements drawn from a finite set of characters. In many languages special characters such as spaces or punctuation marks are used as boundary symbols.
There is a general consensus today that utterances, although themselves produced and perceived as a linear stream of acoustic/physiological events, are mentally represented as a prosodic structure in which smaller chunks of speech are grouped into larger chunks following a hierarchy of phonological levels, and that this hierarchy is only partially related to the more abstract syntactic structure.
In this paper I present and discuss some ideas on the nature of these prosodic chunks and the ways in which prosodic structure differs both from written language and syntactic structure.
I suggest in particular that a less linear approach to prosodic structure may lead to significant and sometimes surprising insights into the nature of prosodic representations.

[The full text of this talk will be made available at the following address after the oral presentation on June 13]

Prof. Sonja Kotz

Sonja A. Kotz is a cognitive, affective, and translational neuroscientist who investigates the role of prediction in multimodal domains (perception, action, communication, music) in healthy and clinical populations using behavioural and modern neuroimaging techniques (E/MEG, s/fMRI).
She holds a Chair in Translational Cognitive Neuroscience at Maastricht University in the Netherlands, is a Research Associate at the Max Planck Institute for Human Cognitive and Brain Sciences in Leipzig, Germany, has multiple honorary positions and professorships (Manchester & Glasgow, UK), Leipzig (Germany), (Georgetown, USA) and is currently the President of the
European Society for Cognitive and Affective Neuroscience. She also works for multiple funding agencies in Europe including the ERC. She has published close to 200 papers in leading journals of cognitive and affective neuroscience and her current h-index is 57 (Google Scholar).

Title of the talk: "Multimodal emotional speech perception"

Social interactions rely on multiple verbal and non-verbal information sources and their interaction. Crucially, in such communicative interactions we can obtain information about the current emotional state of others (‘what’) but also about the timing of these information sources (‘when’). However, the perception and integration of multiple emotion expressions is prone to environmental noise and may be influenced by a specific situational context or learned knowledge. In our work on the temporal and neural correlates of multimodal emotion expressions we address a number of questions by means of ERPs and fMRI within a predictive coding framework. In my talk I will focus on the following questions: (1) How do we integrate verbal and non-verbal emotion expressions; (2) How does noise affect the integration of multiple emotion expressions; (3) How do cognitive demands impact the processing of multimodal emotion expressions; (4) How do we resolve interferences between verbal and non-verbal emotion expressions?

Prof. Andrew Rosenberg

Andrew Rosenberg is currently a Research Staff Member at IBM Research AI where he has worked since 2016.  He received his PhD from Columbia University in 2009.  He then taught and researched at CUNY Queens College as Assistant and, later, Associate Professor until joining IBM.  While at CUNY, from 2013 through 2016, he directed the CUNY Graduate Center Computational Linguistics Program.  His research is primarily on automated analyses of prosody and the use of these on downstream spoken language processing tasks.  This has included paralinguistic analysis, named entity recognition, segmentation, summarization and speech synthesis.  He has written over 70 journal and conference papers, the vast majority on speech prosody and language production.  He is the author and maintainer of AuToBI, an open-source tool for the automatic labeling of ToBI labels from speech.  He is an NSF CAREER award winner for a proposal titled "More than Words: Advancing Prosodic Analysis".

Title of the talk: "Speech, Prosody, and Machines: Nine Challenges for Prosody Research"

Speech technology is becoming commonplace. Traditional telephony based interactive voice systems have been joined by virtual assistants and navigation systems to create a broad ecosystem of voice enabled technologies. Prosody is an essential component to human communication, but machines still lag in their ability to understand information communicated prosodically and to produce human-like intonation.

This talk poses nine challenges designed to effectively and more thoroughly integrate prosody into current speech technologies. These include long-standing and contemporary concerns surrounding the availability and utility of data, gaps in linguistic theory and specific technological issues. Each of these challenges have received some attention, additional work is necessary to bring the role of prosody in speech technology closer to its role in human communication.

Prof. Jianhua Tao

Professor Jianhua Tao is currently the deputy director at National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. He is the winner of the National Science Fund for Distinguished Young Scholars.
He received the bachelor and M.S. degree from Nanjing University, Nanjing, China, in 1993 and 1996, and the Ph.D. degree from Tsinghua University, Beijing, China, in 2001. He is currently the Steering Committee Member of IEEE Trans. on Affective Computing, Vice-Chairperson of the ISCA Special Interest Group of Chinese Spoken Language Processing (SIG-CSLP), and Executive Committee member of HUMAINE, China Computer Federation, Chinese Association For Artificial Intelligence, Chinese Character Information Society of China, the Acoustical Society of China, and the secretary-general of Chinese Character Information Society of China Linguistic Data Development and Management Committee. He has directed and participated in more than 20 national projects, including “863”, National Natural Science Foundation of China, National Development and Reform Commission, International Cooperation Program of Ministry of Science of Technology. He has repeatedly served as an evaluation expert of national projects such as National Natural Science Foundation of China, and “863”. He has published more than 150 papers in SCI or EI journals and proceedings, authorized 15 domestic invention patents and 1 international patent, and edited 2 books. Prof. Tao received several awards from important conferences, and won twice the Scientific Technology Advance Award of Beijing City. Currently, he also serves as committee or chair of program committee in domestic and international famous conferences, including ICPR,ACII,ICMI,IUS,ISCSLP,NCMMSC. He also serves as a member of the editorial board of Journal on Multimodal User Interface and International Journal on Synthetic Emotions.

Title of the talk: "Speech emotion recognition"

Speech emotion recognition supports natural and efficient human-computer interaction with wide applications of website customization, education and gaming. Typical methods are based on short-time frame-level feature extraction, followed by utterance-level information extraction and classification or regression as required. However, the selection of a common and global emotional feature subspace is challenging. We explore the influence of different emotional features, voice quality features, spectral features and prosodic features on different types of corpora. Denoising auto-encoder is utilized to extract high-level discriminative representations. On the other hand, various machine learning algorithms are applied for speech emotion recognition, such as Gaussian Mixture Models, Deep Neural Networks, Support Vector Machines. Emotion is a temporally expression event, thus we favor the methods can model larger sets of contextual information well, such Hidden Markov Models and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN).
In this talk, I present our multi-scale emotional dynamic temporal modeling using deep belief network and LSTM-RNN. We also propose temporal pooling to release the problem of redundant information and label noise for dimensional emotion recognition. To solve the ambiguity of emotion description, we combine dimensional emotion and discrete emotion information to improve the performance of emotion recognition.



Template by L.THEME. Photo by Kubiak