Data and Tasks

Part-of-speech tagging
NER
Dependency Parsing
Sentiment Analysis
Summarization
Next Tweet Prediction
Tweet Ordering

All data and code can be accessed in IndoLEM Github Account 😊

♦ Morpho-syntax and Sequence Labelling

Part-of-speech tagging

We use the Indonesian POS tagging of Dinakaramani et al. (2014), and 5-fold partitioning of Kurniawan and Aji (2018). Train/Dev/Test distribution is 7,222/802/2,006.

NER

NER UI (Universitas Indonesia) by Fachri (2014) with training/dev/test distribution: 1,530/170/425.
NER UGM (Universitas Gajah Mada) by Gultom and Wibowo (2017) with Train/Dev/Test ditribution: 1,687/187/469.

Dependency Parsing

UD-Indo-PUD: 1,000 sentences of UD-Indo-PUD (Zeman et al., 2018), and we use the corrected version by Alfina et al. (2019)
UD-Indo-GSD: 5,593 sentences of UD-Indo-GSD (McDonald et al., 2013)

♦ Semantic Task

Sentiment Analysis

This dataset is based on binary classification (positive and negative), with distribution:

Train: 3638 sentences
Development: 399 sentences
Test: 1011 sentences

The data is sourced from 1) Twitter (Koto and Rahmaningtyas, 2017) and 2) hotel reviews.

Summarization

IndoSum, by Kurniawan and Louvan, (2018): is constructed from CNN Indonesian and Kumparan, with train/dev/test distribution: 14,262/750/3,762.
Liputan6, by Koto et al., (2020): is the first large-scale Indonesian corpus for Abstractive and Extractive summarization. This data is from year 2000 - 2010. Train/Dev/Test distribution: 193,883/10,972/10,972.

♦ Discourse Coherence

Next Tweet Prediction

To evaluate model coherence, we design a next tweet prediction (NTP) task that is similar to the next sentence prediction (NSP) task used to train BERT (Devlin et al., 2019). In NTP, each instance consists of a Twitter thread (2–4 tweets) that we call the premise, and four possible options for the next tweet, one of which is the actual response from the original thread.

Train: 5,681 threads
Development: 811 threads
Test: 1,890 threads

Tweet Ordering

This task is based on the sentence ordering task of Barzilay and Lapata (2008) to assess text relatedness. We construct the data by shuffling Twitter threads (containing 3–5 tweets), and assessing the predicted ordering in terms of rank correlation (ρ) with the original.

Train: 4,327 threads
Development: 760 threads
Test: 1,521 threads

Content