当前位置：首页 > news >正文

python提取域名

news 2025/9/19 11:39:44

创建ai流程的时候，碰到需要提取域名，需要使用python或者大模型参数提取，由于想尽量减少大模型的token消耗，因此决定使用python来提取，又快又省，由于用户的输入五花八门，有的输入url，有的输入数字型域名，最奇葩的是域名后面刚好跟着个数字月份，就会识别出错，这下面是经过改良后的最终版本，如果发现bug，请留言。

import re
from typing import Dictdef main(arg1: str) -> Dict[str, str]:# 1. 修正后的域名正则#    (?![a-zA-Z.])  禁止后面直接出现字母或“.”，数字、中文、符号都可以domain_pattern = re.compile(r"(?<![a-zA-Z0-9-])"  # 前面不能是域名字符r"(?:[a-zA-Z0-9-]+\.)+"  # 至少一段“子域.”r"[a-zA-Z]{2,}"  # 顶级域 ≥2 字母r"(?![a-zA-Z.])"  # 后面不能接字母或“.”)# 2. 常见文件扩展名黑名单file_extensions = {".jpg",".jpeg",".png",".gif",".bmp",".pdf",".doc",".docx",".xls",".xlsx",".ppt",".pptx",".txt",".zip",".rar",".php",".html",".htm",".css",".js",".xml",".json",".sql",".log",}# 3. 抓取所有候选域名candidates = [m.group(0) for m in domain_pattern.finditer(arg1)]# 4. 过滤domains = []for d in candidates:if any(d.lower().endswith(ext) for ext in file_extensions):continueif not any(ch.isalpha() for ch in d):  # 必须含字母continueif not any(ch.isalpha() for ch in d.split(".")[-1]):  # 顶级域必须含字母continuedomains.append(d)# 5. 去重并返回unique = list(set(domains))return {"result": ",".join(unique), "domain_count": len(unique)}print(main("rule.hx2.com9月份sdfsf.ask.shop/sdf.dfg.jpg可用性分析"))
# 输出：{'result': 'rule.hx2.com', 'domain_count': 1}