Learning Cross-lingual Word Embeddings via Matrix Co-factorization

Tianze Shi, Zhiyuan Liu, Yang Liu and Maosong Sun

In Proc. of ACL (short papers), 2015

Research Summary

A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this project, we present a matrix co-factorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matrices. The cross-lingual constraints can be derived from parallel corpora, with or without word alignments. Empirical results on a task of cross-lingual document classification show that our method is effective to encode cross-lingual knowledge as constraints for cross-lingual word embeddings.

[paper (pdf)] [author's version (pdf)]

Code

Licensed under the Apache License, Version 2.0

This code is a simple implementation of the ACL 2015 paper, and it is based on GloVe implemented by Jeffrey Pennington

[code (v0.01)]

Usage

This research is supported by the 973 Program (No. 2014CB340501) and the National Natural Science Foundation of China (NSFC No. 61133012, 61170196 and 61202140).