python提取域名
创建ai流程的时候,碰到需要提取域名,需要使用python或者大模型参数提取,由于想尽量减少大模型的token消耗,因此决定使用python来提取,又快又省,由于用户的输入五花八门,有的输入url,有的输入数字型域名,最奇葩的是域名后面刚好跟着个数字月份,就会识别出错,这下面是经过改良后的最终版本,如果发现bug,请留言。
import re
from typing import Dictdef main(arg1: str) -> Dict[str, str]:# 1. 修正后的域名正则# (?![a-zA-Z.]) 禁止后面直接出现字母或“.”,数字、中文、符号都可以domain_pattern = re.compile(r"(?<![a-zA-Z0-9-])" # 前面不能是域名字符r"(?:[a-zA-Z0-9-]+\.)+" # 至少一段“子域.”r"[a-zA-Z]{2,}" # 顶级域 ≥2 字母r"(?![a-zA-Z.])" # 后面不能接字母或“.”)# 2. 常见文件扩展名黑名单file_extensions = {".jpg",".jpeg",".png",".gif",".bmp",".pdf",".doc",".docx",".xls",".xlsx",".ppt",".pptx",".txt",".zip",".rar",".php",".html",".htm",".css",".js",".xml",".json",".sql",".log",}# 3. 抓取所有候选域名candidates = [m.group(0) for m in domain_pattern.finditer(arg1)]# 4. 过滤domains = []for d in candidates:if any(d.lower().endswith(ext) for ext in file_extensions):continueif not any(ch.isalpha() for ch in d): # 必须含字母continueif not any(ch.isalpha() for ch in d.split(".")[-1]): # 顶级域必须含字母continuedomains.append(d)# 5. 去重并返回unique = list(set(domains))return {"result": ",".join(unique), "domain_count": len(unique)}print(main("rule.hx2.com9月份sdfsf.ask.shop/sdf.dfg.jpg可用性分析"))
# 输出:{'result': 'rule.hx2.com', 'domain_count': 1}