IndoLEM Resource Collection

IndoBERT

IndoBERT is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:

  • Indonesian Wikipedia (74M words)
  • news articles from Kompas, Tempo (Tala et al., 2003) and Liputan6 (55M words in total)
  • Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words)

We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT-base).

How to Use:
We use Huggingface (Pytorch) Framework. You can download and use them by:
pip install transformers==3.5.1
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobert-base-uncased")
model = AutoModel.from_pretrained("indolem/indobert-base-uncased")
        
Reference for BERT and Transformer:
Don't worry if you are new with pre-trained language model. You can check these references: