IndoLEM Resource Collection

Data and Tasks


All data and code can be accessed in IndoLEM Github Account 😊


♦ Morpho-syntax and Sequence Labelling

Part-of-speech tagging

We use the Indonesian POS tagging of Dinakaramani et al. (2014), and 5-fold partitioning of Kurniawan and Aji (2018). Train/Dev/Test distribution is 7,222/802/2,006.

NER
Dependency Parsing

♦ Semantic Task

Sentiment Analysis

This dataset is based on binary classification (positive and negative), with distribution:

The data is sourced from 1) Twitter (Koto and Rahmaningtyas, 2017) and 2) hotel reviews.

Summarization

♦ Discourse Coherence

Next Tweet Prediction

To evaluate model coherence, we design a next tweet prediction (NTP) task that is similar to the next sentence prediction (NSP) task used to train BERT (Devlin et al., 2019). In NTP, each instance consists of a Twitter thread (2–4 tweets) that we call the premise, and four possible options for the next tweet, one of which is the actual response from the original thread.

Tweet Ordering

This task is based on the sentence ordering task of Barzilay and Lapata (2008) to assess text relatedness. We construct the data by shuffling Twitter threads (containing 3–5 tweets), and assessing the predicted ordering in terms of rank correlation (ρ) with the original.

Content