Marko Robnik-Šikonja, Professor of Computer Science and Informatics and Head of Artificial Intelligence Chair at the University of Ljubljana, Faculty of Computer and Information Science
Currently, the most successful machine learning methods are numeric, e.g., deep neural networks or SVMs. If we are to harness the power of successful numeric deep learning approaches for symbolic data such as texts or graphs, the symbolic data has to be embedded into a vector space, suitable for numeric algorithms. The embeddings shall preserve the information in the form of similarities and relations contained in the original data by encoding it into distances and directions in the numeric space. Typically, these vector representations are obtained with neural networks trained for the task of language modelling. As it turns out, the resulting numeric spaces are similar between different languages and can be mapped with approaches called cross-lingual embeddings.
We are going to present ideas of supervised, unsupervised, and semi-supervised cross-lingual embeddings. We will focus on recent contextual embeddings which assure that the same word is mapped to different vectors based on the context. We will describe how to build and fine-tune contextual embeddings, such as ELMo and BERT, and present examples of training a model in a well-resourced language such as English and transfer it to less-resourced language such as Finnish. We will describe applications of cross-lingual transfer in text classifiers and abstractive summarizers.
slides:
https://haagahelia-my.sharepoint.com/:b:/g/personal/h01928_haaga-helia_fi/EcBGht8OxRBPlRucYLreip0BNt...