Label Studio安装和使用
Label Studio安装和使用
Label Studio 是一款开源的、灵活的数据标注工具,支持多种数据类型(如文本、图像、音频、视频、时间序列等)和标注任务(如分类、目标检测、实体识别、情感分析等)。其核心优势在于高度可定制化,用户可通过直观的界面或XML/HTML代码自定义标注模板,满足不同项目的需求。
该系统提供协作功能,支持团队分布式标注,内置质量控制和进度跟踪,确保数据标注的一致性和效率。同时,Label Studio 兼容主流机器学习框架,可直接导出JSON、CSV等格式的标注结果,便于后续模型训练。
Label Studio 的开源特性、跨平台兼容性(支持本地部署和云端使用)以及丰富的集成选项(如与MLflow、TensorBoard等工具结合),使其成为研究人员和企业在数据标注领域的优选工具。无论是简单的分类任务还是复杂的多模态标注,Label Studio 都能提供高效、可靠的解决方案。
Label Studio安装
方式一:使用python pip直接安装
conda create --name label_env python=3.9
conda activate label_envpip install label-studioexport LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/data/localdatalabel-studio start --port 8008 # 启动并指定端口
方式二:docker启动
docker pull heartexlabs/label-studio:latest
docker run -it -p 8080:8080 -v $(pwd)/mydata:/label-studio/data \
--env LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true \
--env LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/data/localdata heartexlabs/label-studio:latest# 遇到权限问题
chmod -R 777 mydata/
Label Studio各种任务标注
各个任务的输入格式参考官方文档:https://labelstud.io/templates/generative-pairwise-human-preference
文本类
文本分类标注
计算机视觉
要素语义分割
音频类
语音识别
生成式ai
RLHF
导入数据要注意数据格式与label interface对应,格式如下:
数据导出
label studio提供了多种形式的标注结果数据:
如果需要导出的数据较大,使用界面的export可能会没有响应,可以直接用命令导出:
label-studio export <project-id> <export-format> --export-path=<output-path>
或者
DEBUG=1 LOG_LEVEL=DEBUG label-studio export <project-id> <export-format> --export-path=<output-path>
预标注
预标注是指先对待标注的数据进行现有模型的预测,标注人员只需要对已有标注进行审核和调整。一方面节省人力,另一方面还能借助后端服务分析模型结果以及简单模型微调。
- 导入本地预标注文件
格式参考:https://www.breezedeus.com/article/label-studio-20230621
https://labelstud.io/guide/predictions
比如ner标注,先对文本进行通用ner模型的识别,上传数据的格式参考:
- 连接模型实时预标注
- A user opens the task.
- Label Studio sends the request to ML backend.
- The ML backend responds with its prediction.
- The prediction is loaded into the Label Studio UI and shown to the annotator.
参考项目:https://github.com/HumanSignal/label-studio-ml-backend
案例:目标检测实时预标注
环境安装
# 获取项目
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
# 进入项目安装环境(python版本>=3.10.0)
pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple
# 创建新的后端服务
label-studio-ml create my_ml_backend # my_ml_backend是自定义服务名称
创建完成后:服务架构如下:
其中主要需要修改model.py:要重写predict、fit方法。
yolov8权重需要自己下载。
配置链接Label Studio的环境变量
后端服务想要连接label studio的界面读取读片进行预测,需要配置access token,获取方式如下:
打开label studio界面的account&settings
获取到两个配置:
LS_URL = “http://127.0.0.1:8008”
LS_API_TOKEN = “536064d55cf9556caf3ff910efa2de13f40c3529”
可以在model.py进行配置。
预测程序编写model.py
改造model.py 需要根据你的任务目标进行改造。根据需求进行依赖安装。本次需要install ultralytics。
其中类名需要使用NewModel。如果要调整需要同时调整_wsgi.py中引入语句。
"""
@version: python3.9
@author: hcb
@software: PyCharm
@file: model.py
@time: 2025/4/23 13:39
"""
from typing import List, Dict, Optional
from label_studio_ml.model import LabelStudioMLBase
from label_studio_ml.response import ModelResponse
from label_studio_ml.utils import get_single_tag_keys, get_local_path
import requests, os
from ultralytics import YOLO
from PIL import Image
from io import BytesIO# 配置LS的url和接入权限!
LS_URL = "http://127.0.0.1:8008"
LS_API_TOKEN = "536064d55cf9556caf3ff910efa2de13f40c3529"class NewModel(LabelStudioMLBase):"""Custom ML Backend model"""def setup(self):"""Configure any parameters of your model here"""self.set("model_version", "0.0.1")self.from_name, self.to_name, self.value, self.classes = get_single_tag_keys(self.parsed_label_config, 'RectangleLabels', 'Image')# 配置自己下载的yolo模型!!self.model = YOLO("/data/huangchangbin/projects/label-studio-ml-backend/label-studio-ml-backend/my_ml_backend/yolov8n.pt")self.labels = self.model.namesdef predict(self, tasks: List[Dict], context: Optional[Dict] = None, **kwargs) -> ModelResponse:task = tasks[0]# header = {# "Authorization": "Token " + LS_API_TOKEN}# image = Image.open(BytesIO(requests.get(# LS_URL + task['data']['image'], headers=header).content))url = tasks[0]['data']['image']print(f'url is: {url}')image_path = self.get_local_path(url=url, ls_host=LS_URL, task_id=tasks[0]['id'])print(f'image_path: {image_path}')image = Image.open(image_path)original_width, original_height = image.sizepredictions = []score = 0i = 0results = self.model.predict(image, conf=0.5)for result in results:for i, prediction in enumerate(result.boxes):xyxy = prediction.xyxy[0].tolist()predictions.append({"id": str(i),"from_name": self.from_name,"to_name": self.to_name,"type": "rectanglelabels","score": prediction.conf.item(),"original_width": original_width,"original_height": original_height,"image_rotation": 0,"value": {"rotation": 0,"x": xyxy[0] / original_width * 100,"y": xyxy[1] / original_height * 100,"width": (xyxy[2] - xyxy[0]) / original_width * 100,"height": (xyxy[3] - xyxy[1]) / original_height * 100,"rectanglelabels": [self.labels[int(prediction.cls.item())]]}})score += prediction.conf.item()print(f"Prediction Score is {score:.3f}.")final_prediction = [{"result": predictions,"score": score / (i + 1),"model_version": "v8n"}]return ModelResponse(predictions=final_prediction)def fit(self, event, data, **kwargs):"""This method is called each time an annotation is created or updatedYou can run your logic here to update the model and persist it to the cacheIt is not recommended to perform long-running operations here, as it will block the main threadInstead, consider running a separate process or a thread (like RQ worker) to perform the training:param event: event type can be ('ANNOTATION_CREATED', 'ANNOTATION_UPDATED', 'START_TRAINING'):param data: the payload received from the event (check [Webhook event reference](https://labelstud.io/guide/webhook_reference.html))"""# use cache to retrieve the data from the previous fit() runsold_data = self.get('my_data')old_model_version = self.get('model_version')print(f'Old data: {old_data}')print(f'Old model version: {old_model_version}')# store new data to the cacheself.set('my_data', 'my_new_data_value')self.set('model_version', 'my_new_model_version')print(f'New data: {self.get("my_data")}')print(f'New model version: {self.get("model_version")}')print('fit() completed successfully.')
启动后端服务
退到上级目录
label-studio-ml start my_ml_backend -p 8009
接入Label Studio
先在Label Studio创建目标检测的project:
再设置Model 预标注。填写后端服务url。validate and save会进行验证,验证通过就可以了。否则根据后端的报错日志进行调整。
打开待标注数据,会自动进行预标注并给出检测框
如果没有给出检测结果请查看后端日志。
数据设置本地存储
情景:现有一批标注好的语音-文本数据或者目标检测数据,需要批量导入预标注数据。需要将数据存储在本地某个位置,传入的json文件能直接读取该位置的文件:
第一种情况:所有环境在本机,没有使用docker的情况:
step1:
配置label studio:
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/data/huangchangbin/label_studio/local_data # 你的本地文件路径Step2:
启动label studio
然后配置Settings的Cloud Storage:
然后将数据比如图片、语音上传到这个字目录中。在导入数据时需要设置路径格式:
同理可以添加标注数据的存储目录。
第二种情景:使用docker部署的情况:路径要与docker内部的路径保持一致。
docker-compose.yml
services:nginx:image: heartexlabs/label-studio:latestrestart: unless-stoppedports:- "8086:8085"- "8087:8086"depends_on:- appenvironment:- LABEL_STUDIO_HOST=http://10.10.185.1:8086/foovolumes:- /data/label_studio/mydata:/label-studio/data:rwcommand: nginxapp:stdin_open: truetty: trueimage: heartexlabs/label-studio:latestrestart: unless-stoppedexpose:- "8000"environment:- LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/data/localfile- LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true- LABEL_STUDIO_HOST=http://10.10.185.1:8086/foo- JSON_LOG=1- LOG_LEVEL=DEBUGvolumes:- /data/label_studio/mydata:/label-studio/data:rwcommand: label-studio-uwsgi
配置的时候要设置docker容器内的路径(从mydata操作或进入容器操作建目录、传文件):