Skip to main content

NER API - Technical Model Documentation

This document provides detailed technical information about the three main components of the NER Product Extraction API: the NER Model, Rule-Based Post-Processing, and Intent Classifier.

FieldValue
AuthorYuditya Insani
Project VersionV.1
StatusActive
StakeholderProduct - Customer Apps

Definition and Objective

WhatsApp Customer Intention System is an NLP pipeline designed to automatically parse unstructured conversation data from WhatsApp commerce groups. System identifies key product entities (such as product names, variants, and fees) and interprets customer intent. This solution fixes the problem where admins have to manually read and record every single order.

The primary objective is to automate the order recording workflow within WhatsApp channels. By transforming raw chat logs into structured order data in real-time, the system aims to:

  • Eliminate manual data entry.
  • Reduce the workload for group admins.
  • Enable scalable handling of high-volume transaction periods without increasing headcount.

Table of Contents

  1. NER Model
  2. Rule-Based Post-Processing
  3. Intent Classifier

NER Model

Base Model

The NER system uses BERT (Bidirectional Encoder Representations from Transformers) as its foundation.

  • Architecture: BertForWordClassification (custom implementation)
  • Base Model: BERT (likely IndoBERT or mBERT for Indonesian language)
  • Model File: model/best_model_f1.pth (~1.3GB)
  • Framework: PyTorch + Hugging Face Transformers

Model Architecture

Input Text → BERT Tokenizer → Subword Tokens

BERT Encoder (768-dim)

Subword → Word Aggregation

Dropout Layer

Linear Classification Layer

Output Labels

Key Components

  1. BERT Encoder (BertModel)

    • Converts input tokens into contextualized embeddings
    • Hidden size: 768 dimensions
    • Pre-trained on large Indonesian corpus
  2. Word-Level Aggregation

    • Converts subword-level representations to word-level
    • Uses averaging of subword embeddings that belong to the same word
    • Handles tokenization mismatches (e.g., "ukuran" → ["uk", "##uran"])
  3. Dropout Layer

    • Regularization to prevent overfitting
    • Rate specified in model config
  4. Classification Head

    • Linear layer: 768 → num_labels
    • Maps contextualized word representations to label predictions

Finetuning

The model has been fine-tuned on custom dataset containing:

  • Indonesian WhatsApp chat messages
  • Product descriptions from sellers
  • Customer inquiries and orders
  • E-commerce conversations

Training Objective: Token classification using CrossEntropyLoss

Output Labels

The model predicts 7 main entity types with BIO tagging scheme:

LabelTypeDescriptionExample
PRODUCTMain EntityProduct name/category"Sendal", "Baju", "Sepatu"
VARIANTAttributeProduct variant (color, type)"Merah", "Lengan Panjang"
VARIANT_QUANTITYMeasurementSize, weight, dimensions"Size 39", "100ml", "XL"
PRODUCT_QUANTITYCountNumber of items"2pcs", "5 buah", "1 box"
PRODUCT_FEEPriceProduct price/fee"50k", "150rb", "200000"
ONon-EntityOther words"mau", "kak", "ya"

BIO Tagging Scheme

  • B- prefix: Beginning of entity
  • I- prefix: Inside/continuation of entity
  • O: Outside any entity

Example:

Input:  "Baju kaos 100rb size M"
Labels: B-PRODUCT I-PRODUCT B-PRODUCT_FEE B-VARIANT_QUANTITY I-VARIANT_QUANTITY

Model Inference

  1. Normalization: Text normalized using IndonesianNormalizer
  2. Tokenization: Words → subwords using BERT tokenizer
  3. Prediction: Model outputs logits for each word
  4. Decoding: argmax to get most likely label
  5. Post-processing: Rule-based refinement

Rule-Based Post-Processing

The rule-based system (RuleBasedTagger) refines NER predictions using linguistic rules and domain knowledge.

Purpose

  • Fix common NER model mistakes
  • Apply domain-specific heuristics
  • Handle edge cases and patterns
  • Improve precision on specific entity types

Rule Categories

1. Number in Brackets → Product Quantity

Rule: Numbers inside parentheses are marked as PRODUCT_QUANTITY

Input:  "Mau (3) box"
Before: O O O
After: O B-PRODUCT_QUANTITY O

Rationale: Brackets often indicate quantity in Indonesian e-commerce context


2. Ordinal Numbers → O

Rule: Numbers followed by period (.) are marked as O (not entities)

Input:  "1. Baju 2. Celana"
Labels: O O O O

Rationale: Ordinal numbers in lists are not product information


3. Dash Between Numbers → Variant Quantity

Rule: Pattern number-number indicates size range

Input:  "Ukuran 30 - 40"
Labels: O B-VARIANT_QUANTITY B-VARIANT_QUANTITY B-VARIANT_QUANTITY

Rationale: Ranges like "30-40" indicate size/variant specifications


4. Variant Units Recognition

Total: 33 variant units

CategoryUnits
Weightkg, gram, gr, mg, ons, lb, pound
Volumeliter, l, ml, cc, oz
Lengthcm, mm, m, meter, km, inch, inchi, ft, feet, yard
Areasqm, m2, cm2
Sizesize, sz, ukuran, uk, isi, insole

Rules:

  • Number + variant unit → VARIANT_QUANTITY
  • Variant unit + number → VARIANT_QUANTITY
  • Number adjacent to variant unit → both marked as VARIANT_QUANTITY

Examples:

"50kg" → B-VARIANT_QUANTITY
"size 39" → B-VARIANT_QUANTITY B-VARIANT_QUANTITY
"100ml" → B-VARIANT_QUANTITY

5. Quantity Units Recognition

Total: 79 quantity units

CategoryUnits (Sample)
Basicpcs, pc, buah, biji, bij, item, unit
Containerset, paket, pack, pak, bundle, lot
Boxbox, boks, dus, kotak, tray
Bulkkarung, sak, peti, krat
Packagingbotol, kaleng, cup, sachet, saset, pouch, tube, jar
Pairpasang, pair, helai, lembar
Dozenlusin, kodi
Variationspcsnya, boxnya, setnya, paketan, itemnya

Rules:

  • Number + quantity unit → PRODUCT_QUANTITY
  • Quantity unit after number → mark both as PRODUCT_QUANTITY

Examples:

"2pcs" → B-PRODUCT_QUANTITY
"5 box" → B-PRODUCT_QUANTITY B-PRODUCT_QUANTITY
"3 lusin" → B-PRODUCT_QUANTITY B-PRODUCT_QUANTITY

6. Size Units Recognition

Units: xxs, xs, s, m, l, xl, xxl, xxxl, 2xl, 3xl, 4xl

Rule: Standalone size letters → VARIANT

Examples:

"Size L" → O B-VARIANT
"XL" → B-VARIANT
"2XL" → B-VARIANT

7. Number Range Classification

Rules based on numeric value:

RangeLabelRationale
25-49VARIANT_QUANTITYLikely shoe size or small measurement
> 1000PRODUCT_FEELikely price in Indonesian Rupiah
≤ 1000PRODUCT_QUANTITYLikely quantity or small count

Examples:

"39" → B-VARIANT_QUANTITY (shoe size)
"5000" → B-PRODUCT_FEE (price: 5000 IDR)
"10" → B-PRODUCT_QUANTITY (quantity)

8. Stopwords Filtering

Total: 763 Indonesian stopwords

Categories:

  • Common words: ada, adalah, dengan, untuk, yang
  • Pronouns: saya, kamu, dia, mereka
  • Temporal: kapan, nanti, sudah, belum
  • Discourse markers: kak, kakak, mas, mbak

Rule: All stopwords → O

Rationale: Stopwords don't contain product information


Intent Classifier

The Intent Classifier determines whether a customer message shows intention to order or is just neutral conversation.

Output Labels

IntentDescription
OrderCustomer intends to purchase/order products
NeutralCustomer is asking questions or casual chat

Scoring System

The classifier uses a scoring-based approach with two competing scores:

Score Calculation

order_score = 0      # Accumulates signals for "Order" intent
neutral_score = 0 # Accumulates signals for "Neutral" intent

# Final decision:
if order_score > neutral_score:
intent = "Order"
confidence = order_score / (order_score + neutral_score)
else:
intent = "Neutral"
confidence = neutral_score / (order_score + neutral_score)

Order Indicators (+points to order_score)

SignalPointsCondition
Order trigger word+4Keywords like "ambil", "pesan", "beli", "mau"
Multiple items listed+4Pattern like "1. item1 2. item2"
Variant quantity+5Customer mentions size/variant (from NER)
Product quantity+4Customer mentions quantity (from NER)
Size specification+3Words like XL, L, M, S detected
Product/variant mention+3Customer mentions product (from NER)
Demonstrative + trigger+2"ini" or "itu" + order trigger word
Quantity number+2Numbers detected (but not price)
Photo attached+2Photo field is not empty
Responding to seller+1Seller message contains price info

Order Trigger Words (71 total)

Purchase verbs: ambil, pesan, titip, booking, beli, reserve, keep, simpan

Intention: mau, ingin, pengen, niat, minat, tertarik

Action: gas, cus, sikat, hajar, lanjut, jalan, ikut

Selection: pilih, incar, lirik, mark, save, hold

Addition: plus, tambah, include, masuk, taruh


Neutral Indicators (+points to neutral_score)

SignalPointsCondition
Availability question+4Patterns like "ada ga?", "ready?", "stock?"
General question+3Question marks, "apa", "berapa", "ada"
Asking price+2"brp 2 pcs?" → asking price, not ordering

Questioning Patterns (13 total)

  1. Ends with ?: \?$
  2. Question words: ^(apa|apakah|berapa|brp|kapan|kenapa|gimana|ada|adakah|bisa|boleh)
  3. Price inquiry: \b(harga|price|brp|berapa)\b
  4. Recommendation: \b(recomend|rekomendasi|saran)\b
  5. Info request: \bminta\b.*\binfo\b
  6. Question verb: \btanya\b
  7. Check DM: \bcek\s*japri\b
  8. Availability patterns:
    • \bada\s+(ga|gaa|gak|tidak|nggak)
    • \bmasih\s+(ada|bisa|ready)
    • \bready\s+(ga|gaa|gak|tidak|nggak)
    • \b(stock|stok)\b

Special Cases

1. Question Override

If a message matches neutral patterns but also has NER entities:

  • Check if it's a question first
  • Questions about products → Neutral (not ordering) -Examples:
    • "Ada size L?" → Neutral (+4 availability question)
    • "Size L" → Order (+3 size specification)

2. Price Question Negation

Numbers with price context → Neutral

"Brp 2 pcs?" → Neutral (+2 asking price)
"2 pcs" → Order (+2 quantity number)

3. Empty Message

if not cust_message:
return "Neutral" (confidence: 1.0)

Confidence Score

confidence = max(order_score, neutral_score) / (order_score + neutral_score)

Range: 0.0 to 1.0

Interpretation:

  • 0.5: No clear signals (default to Neutral)
  • 0.6-0.7: Weak confidence
  • 0.8-0.9: Strong confidence
  • 1.0: Only one type of signals detected

Classification Examples

Example 1: Clear Order

Input: "Mau kak yang size 39"
Signals:
- Order trigger "mau" (+4)
- Size specification "39" (+3)
Total: order_score=7, neutral_score=0
Result: Order (confidence: 1.0)

Example 2: Question

Input: "Ada size 39 ga kak?"
Signals:
- Availability question "ada...ga" (+4)
- Question mark (+3)
Total: order_score=0, neutral_score=7
Result: Neutral (confidence: 1.0)

Example 3: Mixed Signals

Input: "Mau tanya harga size 39?"
Signals:
- Order trigger "mau" (+4)
- Question pattern "tanya" (+3)
- Price inquiry "harga" (context for neutral +3)
Total: order_score=4, neutral_score=6
Result: Neutral (confidence: 0.6)

Example 4: Multiple Products

Input: "1. Baju M 2. Celana L"
Signals:
- Multiple items listed (+4)
- Size specifications (+3 each)
Total: order_score=10, neutral_score=0
Result: Order (confidence: 1.0)

Component Integration

Processing Pipeline

1. Input: ref_message, cust_message, photo

2. Text Normalization (Indonesian slang → standard, number → word)

3. NER Model Prediction (BERT, Product + Variant Labelling)

4. Rule-Based Post-Processing (Variant Classification)

5. Intent Classification (scoring)

6. Entity Extraction & Grouping

7. Output: List of products with variants, fees, intent

Data Flow Example

Input:
ref_message: "Sendal 50k sz 30-40"
cust_message: "Mau yang size 39"
photo: ""

Step 1: Normalization
ref: "sendal 50000 sz 30 - 40"
cust: "mau yang size 39"

Step 2: NER Prediction
ref: [sendal/PRODUCT, 50000/O, sz/O, 30/O, -/O, 40/O]
cust: [mau/O, yang/O, size/O, 39/O]

Step 3: Rule-Based
ref: [sendal/PRODUCT, 50000/FEE, sz/VARIANT_QTY, 30/VARIANT_QTY, -/VARIANT_QTY, 40/VARIANT_QTY]
cust: [mau/O, yang/O, size/VARIANT_QTY, 39/VARIANT_QTY]

Step 4: Intent Classification
Signals: "mau" (+4), "size" (+3), "39" (+3)
Result: Order (confidence: 1.0)

Step 5: Extraction
Product: "Sendal" (from ref)
Variant: "Size 39" (from cust)
Fee: "50000" (from ref)
Intent: "Order"
Unclear: "false"

Output:
[{
"product": "Sendal",
"varian": "Size 39",
"fee": "50K",
"photo": "",
"unclear": "false",
"source_text": "Mau yang size 39",
"intent": "Order"
}]

Model Files

Required Files

  1. best_model_f1.pth (~1.3GB)

    • Trained model weights
    • Contains: model state dict, config, label mapping (i2w)
  2. vocab.txt (~224KB)

    • BERT tokenizer vocabulary
    • ~30k tokens for Indonesian
  3. tokenizer_config.json

    • Tokenizer configuration
    • Special tokens mapping
  4. special_tokens_map.json

    • CLS, SEP, PAD, UNK token definitions
  5. rule_based.json (~16KB)

    • Variant units (33)
    • Quantity units (79)
    • Size units (12)
    • Positive keywords (71)
    • Negative keywords (46)
    • Stopwords (763)
    • Questioning patterns (13)

Extending the Model

Adding New Units

Edit app/utils/rule_based.json:

{
"variant_units": [
"existing_units",
"new_unit_1",
"new_unit_2"
]
}

Adding New Keywords

{
"positive_keywords": [
"existing_keywords",
"new_order_keyword"
]
}

Retraining the Model

  1. Prepare labeled dataset (BIO format)
  2. Fine-tune BERT using token classification
  3. Save best checkpoint to best_model_f1.pth
  4. Update i2w mapping if labels changed

For implementation details, see:

  • app/utils/word_classification.py - BERT model architecture
  • app/utils/rule_based.py - Rule-based processing
  • app/utils/intent_classifier.py - Intent classification
  • app/predict.py - Integration and pipeline