NLU调研
[TOC]
业务场景:小样本数据上的任务型对话理解。
对话领域三类
- 问答类
- 任务类
- 闲聊类
1. 规则方法
1.1 意图识别
- 词典法
- CFG(上下文无关语法)
- JSGF(JSpeech Grammar Format)
参考资料:
1.2 命名实体识别
需要构造词典
- AC自动机算法(Aho–Corasick算法)
- Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配
- 基于规则的模型
参考:
2. 模型方法
A dataset survey about task-oriented dialogue, including recent datasets and SoA results & papers.
2.1 pipeline
pipeline方法将意图识别和槽填充分为两个独立的部分,分别进行训练。
2.1.1 意图识别
本质上是短文本分类任务,一般的文本分类算法都可以处理
传统算法:
- LR
- SVM
- KNN
- RF
- GBDT
- …
深度学习方法
- Fasttext
- TextCNN
- GRU
- LSTM
- IDCNN
- TextRNN
经调研,预训练fasttext词向量+单层textcnn从分类效果和速度上都相对较优,作为优先选择。
TextCNN的改进:
- K-max pooling
- DPCNN
- …
2.1.2 槽填充
- CRF
- RNN/LSTM/CNN+CRF
- BiLSTM+CRF
- BiLSTM+CNN+CRF
2.2 joint model
Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling
https://www.coursera.org/lecture/language-processing/intent-classifier-and-slot-tagger-nlu-RmVnE
其中第三条提到的模型: Convolutional Sequence to Sequence Learning
3. 企业做法
3.1 阿里小蜜
Arxiv: AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce Experience
note: 经内部人员考证,这套框架太老已弃用
- business rule parser: 大量的样式(patterns)组成的前缀树匹配结构(trie-based)
- Intention classifier: 场景分类,pre-train采用fasttext,分类采用单层cnn
- requesting for assistance
- asking for information or solution
- chatting
- Semantic Parser: a trie-based, 匹配知识图谱中的实体
3.2 美团
参考:美团对话理解技术及实践
上下文无关文法,工具,规则的写法
4. 数据
- 【语料】百度的中文问答数据集WebQA
- SophonPlus/ChineseNlpCorpus
- candlewill/Dialog_Corpus: 用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
- brightmart/nlp_chinese_corpus: 大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
5. 开源工具
5.1 ChatterBot
没有NLU模块,做法是匹配式,训练的输入是一系列完整的对话过程,数据库存储。
通过Logic adapters
来获取输出结果
- BestMatch
- TimeLogicAdapter
- MathematicalEvaluation
这个框架主要对问题文本 使用相似度匹配,找出库中预定好的答案。 比较适合,知识问答类的情形。
5.2 rasa
数据
- 语料标注工具:rasa-nlu-trainer
- 数据生成工具:chatito
意图识别
- KeywordIntentClassifier:This classifier is mostly used as a placeholder. It is able to recognize hello and goodbye intents by searching for these keywords in the passed messages.
- MitieIntentClassifier: This classifier uses MITIE to perform intent classification. The underlying classifier is using a multi-class linear SVM with a sparse linear kernel 。
- SklearnIntentClassifier: The sklearn intent classifier trains a linear SVM which gets optimized using a grid search.需要前置feature extractor
- EmbeddingIntentClassifier: The embedding intent classifier embeds user inputs and intent labels into the same space. Supervised embeddings are trained by maximizing similarity between them. This algorithm is based on StarSpace.
实体识别
- MitieEntityExtractor:The underlying classifier is using a multi class linear SVM with a sparse linear kernel and custom features
- SpacyEntityExtractor:Using spaCy this component predicts the entities of a message. spacy uses a statistical BILOU transition model.
- EntitySynonymMapper: Maps synonymous entity values to the same value. 通过数据中的
value
来提供 - CRFEntityExtractor:spaCy has to be installed. 貌似用的spaCy的实现
- DucklingHTTPExtractor: Duckling lets you extract common entities like dates, amounts of money, distances, and others in a number of languages.
槽填充
参考:
- GaoQ1/rasa-nlp-architect: 采用nlp-architect实现rasa-nlu中文意图提取和槽填充
- Building contextual assistants with Rasa Forms: 原文, 译文
均可自定义component: Enhancing Rasa NLU models with Custom Components
5.3 DeepPavlov
deepmipt/DeepPavlov: 3.6k
An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai
支持英文和俄语。功能全面,可作为学习参考。
基本概念
Agent
is a conversational agent communicating with users in natural language (text).Skill
fulfills user’s goal in some domain. Typically, this is accomplished by presenting information or completing transaction (e.g. answer question by FAQ, booking tickets etc.). However, for some tasks a success of interaction is defined as continuous engagement (e.g. chit-chat).Model
is any NLP model that doesn’t necessarily communicates with user in natural language.Component
is a reusable functional part ofModel
orSkill
.Rule-based Models
cannot be trained.Machine Learning Models
can be trained only stand alone.Deep Learning Models
can be trained independently and in an end-to-end mode being joined in a chain.Skill Manager
performs selection of theSkill
to generate response.Chainer
builds an agent/model pipeline from heterogeneous components (Rule-based/ML/DL). It allows to train and infer models in a pipeline as a whole.
Models:
- NER model [docs]: BERT-based and Bi-LSTM+CRF.
- Slot filling models [docs]:
- Classification model [docs]
- Automatic spelling correction model [docs]
- Ranking model [docs]
- TF-IDF Ranker model [docs]
- Question Answering model [docs]
- Morphological tagging model [docs]
- Frequently Asked Questions (FAQ) model [docs]
意图识别
- BERT classifier (see here) builds BERT 8 architecture for classification problem on Tensorflow.
- Keras classifier (see here) builds neural network on Keras with tensorflow backend.
- Sklearn classifier (see here) builds most of sklearn classifiers.
模型很丰富
NER
- standard RNN based and BERT based.
- Multilingual BERT Zero-Shot Transfer
- Few-shot Language-Model based
槽填充
官方文档: Neural Named Entity Recognition and Slot Filling
This model solves Slot-Filling task using Levenshtein search and different neural network architectures for NER.
Slotfiller will perform fuzzy search through the all variations of all entity values of given entity type. The entity type is determined by the NER component.
使用博客:DeepPavlov articles with Python code
规则编写
只见到了对话规则的编写,通过PatternMatchingSkill
,使用正则编写pattern和response
有一个包装rasa的Rasa Skill
DeepPavlov存在的问题
- 环境依赖
- DeepPavlov是基于TensorFlow和Keras实现的,不能继承其他计算框架的模型实现(如PyTorch)。
- 语言支持
- Pre-train模型和评测数据集主要基于英文和俄文,不支持中文。
- 生产环境部署
- DeepPavlov在运行时需要依赖整个框架源码,开发环境对框架修改后,生产环境需要更新整个框架。
- 也不能直接将功能Component作为服务独立导出,不适合在生产环境的部署和发布。
5.4 Snips-nlu
Snips Python library to extract meaning from text https://snips-nlu.readthedocs.io
不支持中文
Tutorial: 意图和槽值都放在训练数据中了
# turnLightOn intent
---
type: intent
name: turnLightOn
slots:
- name: room
entity: room
utterances:
- Turn on the lights in the [room](kitchen)
- give me some light in the [room](bathroom) please
- Can you light up the [room](living room) ?
- switch the [room](bedroom)'s lights on please
This parser parses text using two steps: first it classifies the intent using an
IntentClassifier
and once the intent is known, it using aSlotFiller
in order to extract the slots.
IntentClassifier
- Logistic Regression
- Feature extractor for text classification relying on ngrams tfidf and optionally word cooccurrences features
- scikit-learn TfidfVectorizer
- Featurizer that takes utterances and extracts ordered word cooccurrence features matrix from them
SlotFiller
- Linear-Chain Conditional Random Fields
5.5 其他
- 基于金融-司法领域(兼有闲聊性质)的聊天机器人
- 基于最新版本rasa搭建的对话系统demo
- 使用 RASA NLU 来构建中文自然语言理解系统
- crownpku/Awesome-Chinese-NLP
参考:
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!