当前位置: 首页 > news >正文

生成式推荐模型的长序列特征:离线存储

文章目录

    • 长序列特征的例子
      • 1. Event-level features
      • 2. Sequence-level features
        • Aggregation Features
        • Session-based Features
        • Temporal Order Features
      • 3. User-level features
      • 4. Interaction features (between user and item/context)
    • how to store the long term user behaviro sequence features in offline data lake storage?
    • how to update this behavior_sequence field efficiently when there is new behavior for the same user?
  • 参考资料

长序列特征的例子

For example, a user’s sequence could look like this:
[ Electronics, Clothing, Books, Home & Kitchen, Electronics, Books, Electronics, Sports & Outdoors, Electronics ]
The interactions could be further refined by adding the type of behavior (e.g., [Electronics:view, Clothing:click, Books:purchase, Home & Kitchen:add_to_cart, …] ).

1. Event-level features

Categorical Encoding: Convert event types (e.g., “click”, “add to cart”, “purchase”, “view”) or item categories into numerical representations using techniques like one-hot encoding or embedding methods.
Temporal Features: Extract time-based features from timestamps, such as hour of day, day of the week, month, and time elapsed since previous interaction.
Interaction-Specific Features: Capture attributes specific to each interaction, like product price, rating, duration of video watched, etc.

2. Sequence-level features

Aggregation Features

Count of specific events: Number of clicks, purchases, or searches in the past week.
Average value of numerical features: Average product price of items viewed or purchased.
Time-based statistics: Maximum, minimum, or average time between consecutive interactions.
Frequency of interactions: Number of interactions per hour or day.

Session-based Features

Session length: Number of events or duration of the session.
Session activity type: Percentage of clicks, purchases, or searches within the session.
Sequence of items/events within a session: Representing the order of actions taken by the user, for example, viewing product A, then B, then adding B to the cart.

Temporal Order Features

Lag features: Features from previous interactions (e.g., the last item viewed, the type of the second-to-last event). GeeksforGeeks notes that lag features are a fundamental technique for time-series data.
Positional embeddings: Add positional information to sequence embeddings to capture the order of events.

3. User-level features

Long-Term Preference Features: Summarize user preferences over a long period:
Most frequently purchased categories: Top categories a user interacts with.
Overall spending patterns: Average purchase value, total purchases, etc.
Average interaction count: Average number of interactions per day or week.
User Embeddings

4. Interaction features (between user and item/context)

User-Item Similarity: Calculate the similarity between the current item and previous items the user interacted with.
Time Since Last Interaction with Item: Capturing recency of interest in a particular item.

how to store the long term user behaviro sequence features in offline data lake storage?

  1. Schema design: see following
  2. File formats:Columnar Formats (Parquet or ORC)
  3. Partitioning strategies:Date-Based Partitioning,User ID/Device ID Partitioning
  4. Data ingestion and processing:Batch Ingestion,Data Enrichment and Transformation
  5. Lifecycle management and cost optimization:Retention Policies

Schema design:

user_id: string
name: string
gender: string
behavior_sequence: array<struct<timestamp: timestamp,category_id: int,action_type: string,product_id: string,price: double>
>

how to update this behavior_sequence field efficiently when there is new behavior for the same user?

Merge Operations (Upserts/MERGE SQL): This allows you to efficiently update existing records (the user_id and its behavior_sequence) and insert new ones

MERGE INTO target_delta_table AS target
USING source_data AS source
ON target.user_id = source.user_id
WHEN MATCHED THENUPDATE SET target.behavior_sequence = array_append(target.behavior_sequence, source.new_behavior)
WHEN NOT MATCHED THENINSERT (user_id, name, behavior_sequence)VALUES (source.user_id, source.name, array(source.new_behavior))

参考资料

和Google的对话记录

http://www.dtcms.com/a/354807.html

相关文章:

  • 超越文本:深入剖析多模态AI的架构原理
  • c++ 观察者模式 订阅发布架构
  • FFmpeg05:编解码实战
  • 机器学习框架下:金价近3400关口波动,AI量化模型对PCE数据的动态监测与趋势预测
  • 企业通讯软件以安全为基,搭建高效的通讯办公平台
  • RA4M2环境搭建与新建工程
  • 新手向:Python开发简易股票价格追踪器
  • Linux内核IPv4 RAW套接字深度解析:从数据包构造到可靠传输的挑战
  • Dify 和 LangChain 区别对比总结
  • 【实操教学】ArcGIS 如何进行定义坐标系
  • Python实现点云基于法向量、曲率和ISS提取特征点
  • 【GM3568JHF】FPGA+ARM异构开发板 使用指南:显示与触摸
  • 第二章:Cesium 视图控制与相机操作
  • Java集合操作:Apache Commons Collections4启示录
  • React中优雅管理CSS变量的最佳实践
  • iOS文件管理在uni-app开发中的实战应用,多工具解决
  • 三、计算机网络与分布式系统(上)
  • Subdev与Media子系统的数据结构
  • 线程池及线程池单例模式
  • 图数据库neo4j的安装
  • Go语言数组完全指南
  • 基于Springboot的酒店房间预订系统源码
  • More Effective C++ 条款13:以by reference方式捕捉exceptions
  • [Mysql数据库] 知识点总结5
  • 【C++游记】物种多样——谓之多态
  • 49个Docker自动化脚本:覆盖全场景运维,构建高可用容器体系
  • 【C初阶】文件操作
  • Claude Code 流畅使用指南
  • java中sleep与wait的区别
  • ES基础知识