本文已使用 Google Cloud Translation API 自动翻译。
某些文档最好以原文阅读。
在本文中,我们将学习如何使用 TensorFlow.js 和 Node.js 构建文本分类模型。文本分类是自然语言处理中的一项常见任务,它是分析和处理人类语言数据的过程。
我们将使用烂番茄网站上的电影评论数据集。该数据集包含每条评论的文本,以及指示评论是正面还是负面的标签。我们将使用此数据集来训练一个模型,该模型可以阅读新评论并预测它们是正面的还是负面的。
在我们开始之前,您需要做一些事情:
首先,我们需要创建一个新的 Node.js 项目。为您的项目创建一个新目录并使用 npm 对其进行初始化:
mkdir text-classification
cd text-classification
npm init -y
这将为您的项目创建一个 package.json 文件。接下来,我们需要安装我们将要使用的依赖项。我们将使用 TensorFlow.js,一个用于在 JavaScript 中处理机器学习的库,以及 natural 库,它提供了一些处理人类语言数据的有用函数。
npm install --save @tensorflow/tfjs natural
安装依赖项后,我们现在可以开始编码了。在您的项目目录中创建一个名为 index.js 的新文件并添加以下代码:
const tf = require('@tensorflow/tfjs');
const natural = require('natural');
// TODO: Add code here
这将导入 TensorFlow.js 和自然库,我们将在下一节中使用它们。
在我们训练模型之前,我们需要准备数据。电影评论数据集位于一个名为 reviews.csv 的文件中,您可以在此处下载。
将以下代码添加到 index.js 以加载和准备数据:
const tf = require('@tensorflow/tfjs');
const natural = require('natural');
// TODO: Add code here
// Load the dataset
const csvUrl = 'https://storage.googleapis.com/tfjs-examples/movie-reviews/data.csv';
const { data, labels } = tf.data.csv(csvUrl);
// Shuffle the data
const shuffledData = data.shuffle();
// Split the data into training and test sets
const [trainData, testData] = shuffledData.split([0.8, 0.2]);
// Convert the labels to a one-hot encoding
const trainLabels = trainData.map((_, i) => labels[i]).oneHot(2);
const testLabels = testData.map((_, i) => labels[i]).oneHot(2);
// Normalize the data
const trainDataNormalized = trainData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
const testDataNormalized = testData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
让我们看看这段代码做了什么:
现在我们已经准备好数据,我们可以构建模型了。我们将使用双向长短期记忆 (BiLSTM) 模型,这是一种递归神经网络 (RNN)。
将以下代码添加到 index.js 以构建模型:
const tf = require('@tensorflow/tfjs');
const natural = require('natural');
// TODO: Add code here
// Load the dataset
const csvUrl = 'https://storage.googleapis.com/tfjs-examples/movie-reviews/data.csv';
const { data, labels } = tf.data.csv(csvUrl);
// Shuffle the data
const shuffledData = data.shuffle();
// Split the data into training and test sets
const [trainData, testData] = shuffledData.split([0.8, 0.2]);
// Convert the labels to a one-hot encoding
const trainLabels = trainData.map((_, i) => labels[i]).oneHot(2);
const testLabels = testData.map((_, i) => labels[i]).oneHot(2);
// Normalize the data
const trainDataNormalized = trainData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
const testDataNormalized = testData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
// Build the model
const model = tf.sequential();
model.add(tf.layers.embedding({
inputDim: 1000,
outputDim: 32,
inputLength: 100
}));
model.add(tf.layers.bidirectional({
layer: tf.layers.lstm({
units: 32
})
}));
model.add(tf.layers.dense({
units: 2,
activation: 'softmax'
}));
model.compile({
loss: 'categoricalCrossentropy',
optimizer: 'adam',
metrics: ['accuracy']
});
// Train the model
model.fit(trainDataNormalized, trainLabels, {
epochs: 5,
validationData: [testDataNormalized, testLabels]
}).then(() => {
// Evaluate the model
const results = model.evaluate(testDataNormalized, testLabels);
console.log(results);
});
此代码执行以下操作:
现在我们有了一个经过训练的模型,我们可以用它来进行预测。将以下代码添加到 index.js 中进行预测:
const tf = require('@tensorflow/tfjs');
const natural = require('natural');
// TODO: Add code here
// Load the dataset
const csvUrl = 'https://storage.googleapis.com/tfjs-examples/movie-reviews/data.csv';
const { data, labels } = tf.data.csv(csvUrl);
// Shuffle the data
const shuffledData = data.shuffle();
// Split the data into training and test sets
const [trainData, testData] = shuffledData.split([0.8, 0.2]);
// Convert the labels to a one-hot encoding
const trainLabels = trainData.map((_, i) => labels[i]).oneHot(2);
const testLabels = testData.map((_, i) => labels[i]).oneHot(2);
// Normalize the data
const trainDataNormalized = trainData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
const testDataNormalized = testData.map(review => {
// Convert the review to lowercase
const lowered = review.toLowerCase();
// Tokenize the review
const tokens = natural.WordTokenizer().tokenize(lowered);
// Remove stopwords
const filtered = tokens.filter(token => !natural.stopwords.some(stopword => stopword === token));
// Join the tokens back into a string
const text = filtered.join(' ');
// Return the normalized review
return text;
});
// Build the model
const model = tf.sequential();
model.add(tf.layers.embedding({
inputDim: 1000,
outputDim: 32,
inputLength: 100
}));
model.add(tf.layers.bidirectional({
layer: tf.layers.lstm({
units: 32
})
}));
model.add(tf.layers.dense({
units: 2,
activation: 'softmax'
}));
model.compile({
loss: 'categoricalCrossentropy',
optimizer: 'adam',
metrics: ['accuracy']
});
// Train the model
model.fit(trainDataNormalized, trainLabels, {
epochs: 5,
validationData: [testDataNormalized, testLabels]
}).then(() => {
// Evaluate the model
const results = model.evaluate(testDataNormalized, testLabels);
console.log(results);
// Make a prediction
const prediction = model.predict(['this movie was great']);
console.log(prediction);
});
此代码执行以下操作:
在本文中,我们学习了如何使用 TensorFlow.js 和 Node.js 构建文本分类模型。我们从加载和准备数据开始,然后我们构建并训练了一个模型。最后,我们使用模型进行预测。
文本分类是自然语言处理中的一项常见任务,本文展示了如何使用 TensorFlow.js 和 Node.js 构建文本分类模型。