Build a Classification Model in 30 Minutes
As the field of machine learning continues to advance, the demand for high-quality labeled data has become more important than ever before. However, the process of data labeling can be time-consuming, which can significantly impact the quality of machine learning models. Fortunately, there's a solution: Datasaur, a powerful data labeling platform that allows you to label their datasets with ease and accuracy. But that's not all - with Datasaur, you can also take their labeled data and use it to build powerful machine learning models on Hugging Face.
The best part? These models can be seamlessly integrated back into Datasaur's labeling interface, allowing for automated and efficient labeling of new datasets. In this blog, we'll explore the exciting world of Datasaur and Hugging Face, and show you how they can work together to take your machine learning projects to the next level.
Ready to get started? In this step-by-step guide, we'll walk you through the entire process of using Datasaur and Hugging Face to create your own machine learning models. We'll cover how to do the following:
- upload your datasets to Datasaur
- how to use the labeled data to train a model on Hugging Face
- how to integrate the model back into Datasaur for automated labeling
So, let's dive in and see how Datasaur and Hugging Face can revolutionize the way you approach data labeling and machine learning.
Goal
- Labeling news data category with Datasaur
- Train model with Hugging Face AutoTrain
- Predict unlabeled data with Datasaur ML-assisted labeling using Hugging Face provider
Dataset
We retrieved our datasets from the following sources:
In this example, this is a raw dataset that will include 250 rows of data. The label set will contain the follow labels:
- business
- entertainment
- politics
- sports
- technology
Here is a copy of the unlabeled data; this is a copy of the labeled dataset.
Step by step
Labeling with Datasaur
1. Prepare your preferred dataset.
a. We will use the following dataset for this tutorial: Dataset link
2. Log in to your Datasaur account. We recommend you use the team workspace.
3. Go to your desired team workspace. Once you have stored your data from the previous step, you can create a project with the DOC type from the Datasaur project template. It will give you automated settings for a classification project.
4. To get started with the data labeling process, the first step is to upload your data to the uploader.
5. After uploading your data, you can preview your dataset. If your data already has headers, you can convert the first row to a header using the available settings.
6. To set your labeler's task, you need to add a question set. The question set defines the goals for your labeling project.
7. When working on a data labeling project with a team, it's important to assign responsibilities clearly to ensure efficient collaboration. In Datasaur, you can assign team members as either labelers or reviewers based on their roles and responsibilities in the project.
8. Finally, you have reached the last step. The project is now ready to be launched.
9. You are now ready to annotate the data. Datasaur understands the importance of efficient labeling and offers an assisted labeling feature. You can find more details on Data programming and ML-Assisted Labeling.
10. After you have finished labeling your data in Datasaur, you can easily export it to Hugging Face's native format. To export your labeled data to Hugging Face, you need to first access the Export submenu by clicking on the "File" menu. Once there, you will see a dropdown menu with several options, one of which is "Hugging Face" . Selecting this option will initiate the export process, after which you will have a file in Hugging Face's format that can be used for training and deploying machine learning models.
9. Your dataset is now fully prepared and ready to be utilized for your machine-learning project.
Building model with Hugging Face AutoTrain
Now that you have successfully labeled your data with Datasaur and exported it in Hugging Face's format, the next step is to build a machine learning model with HuggingFace integration. By leveraging HuggingFace's AutoTrain feature, you can easily create a model in under an hour.
1. To get started with building your machine learning model, you need to go to the Hugging Face website and create a new project. This can be done by visiting the following link: https://ui.autotrain.huggingface.co/projects. Once you have arrived at this page, you will be prompted to create a new project.
2. Setup New project
- Task : Text Classification (Multi-class)
- Model choice : Manual
- Selected Model : distilbert-base-uncased
3. The next step in the process is to upload your labeled data from Datasaur to Hugging Face.
4. After successfully uploading new data, you need to set the split type and map the data columns into the project. In this case, we will use Auto split type. You can follow this settings below.
- Select split type : Auto
- Data columns mapping
- text : Text
- target : Category
5. You already have everything you need for training. Now, it is the time to start the model training.
6. It is possible to run multiple trainings simultaneously, each with different hyperparameters. As a result, the performance of each training will vary, which enables us to select the best performing model among them.
The training should take around 10 minutes.
Progress
7. The trainings are now complete. You can view detailed metrics on this page.
8. To view the best model, click on the card. The card contains all the details about the model, including its performance metrics, hyperparameters, and other relevant information. By clicking on the card, you can gain a better understanding of how the model was built and how it can be used for your specific machine learning project.
9. You can also view the model on the Model Hub. In this case, we have a sample model that can be accessed through this link.
11. Voila! You now have a model you can use for inference
You can use cURL to access this model:
$ curl -X POST -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{"inputs": "I love AutoTrain"}' https://api-inference.huggingface.co/models/Saripudin/autotrain-bbc-news-classifier-3523995259
Python API:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("Saripudin/autotrain-bbc-news-classifier-3523995259", use_auth_token=True)
tokenizer = AutoTokenizer.from_pretrained("Saripudin/autotrain-bbc-news-classifier-3523995259", use_auth_token=True)
inputs = tokenizer("I love AutoTrain", return_tensors="pt")
outputs = model(**inputs)
(Optional) Use the model in Datasaur ML-Assisted Labeling
One of the key features of Datasaur is its ability to seamlessly integrate with external model to help with the labeling process. With Datasaur's extension, you can easily plug in external model and take advantage of their capabilities to label your project more efficiently and accurately. Please refer to our documentation on this page.
Below are the steps to use the model to Datasaur project.
1. To use the project settings from a previous labeling project, clone the project by going to the corresponding project and selecting "Use Project Settings". Then, in the first step, upload this data. After updating the project settings accordingly, the project will be ready to launch.
2. To begin, open the project and navigate to the "Manage Extensions" menu, which can be accessed via the gear icon.
3. Set the Service provider as Hugging Face. We can use the model we trained earlier from here.
4. Next, we need to set the Model name and API Token. You can follow the instructions below to do so.
- Model Name: Saripudin/autotrain-bbc-news-classifier-3523995259
- API Token: Access tokens from HF account settings
5. To use ML-assisted labeling, click on "Predict Labels." The predicted labels will then automatically appear in the "Category" column.
Voila! In just 20 minutes, we can predict our labels with our own models! By utilizing Datasaur's labeling interface, along with Huggingface's powerful machine learning capabilities, you can save valuable time and resources while achieving highly accurate results. Datasaur's intuitive and easy-to-use interface, along with its seamless integration with Huggingface, makes the process of labeling data and building machine learning models an absolute breeze. Whether you're tackling a complex data labeling project or simply looking to streamline your machine learning workflow, Datasaur is the ultimate platform for your success.
Try it out for yourself and see how it can transform the way you work with data: Happy Labeling!