NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora)
NLTK库: 数据集3-分类与标注语料(Categorized and Tagged Corpora)
1.二分类语料
主要是电影语料,和情绪(积极消极、主观客观)有关,有以下2个语料:
1.1 movie_reviews: IMDb 影评
IMDb(Internet Movie Database)是一个广泛使用的电影数据库,提供电影、电视剧等的评分和用户评论。
- 数据量
正/负评价标签,共2000个,正负评价各有1000
[‘neg/cv000_29416.txt’, ‘neg/cv001_19502.txt’, ‘neg/cv002_17424.txt’, ‘neg/cv003_12683.txt’, ‘neg/cv004_12641.txt’, ‘neg/cv005_29357.txt’, ‘neg/cv006_17022.txt’, ‘neg/cv007_4992.txt’, ‘neg/cv008_29326.txt’, ‘neg/cv009_29417.txt’, …]
[…, ‘pos/cv992_11962.txt’, ‘pos/cv993_29737.txt’, ‘pos/cv994_12270.txt’, ‘pos/cv995_21821.txt’, ‘pos/cv996_11592.txt’, ‘pos/cv997_5046.txt’, ‘pos/cv998_14111.txt’, ‘pos/cv999_13106.txt’]
- 标签
二分类问题,[‘neg’, ‘pos’]
- 评价内容
第1个negative review (neg/cv000_29416):
plot :
two teen couples go to a church party , drink and then drive .
they get into an accident .
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares .
what's the deal ?
watch the movie and " sorta " find out . . . critique :
a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package .
which is what makes this review an even harder one to write ,
since i generally applaud films which attempt to break the mold ,
mess with your head and such ( lost highway & memento ) ,
but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly .
they seem to have taken this pretty neat concept , but executed it terribly .
so what are the problems with the movie ?
well , its main problem is that it's simply too jumbled .
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on .
there are dreams , there are characters coming back from the dead , there are others who look like the dead ,
there are strange apparitions , there are disappearances , there are a looooot of chase scenes ,
there are tons of weird things that happen , and most of it is simply not explained .
now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again ,
i get kind of fed up after a while , which is this film's biggest problem .
it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes .
and do they make things entertaining , thrilling or even engaging , in the meantime ?
not really .
the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point ,
so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining .
i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before
they are given the secret password to enter your world of understanding .
i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! !
okay , we get it . . . there
are people chasing her and we don't know who they are . do we really need to see it over and over again ?
how about giving us different scenes offering further insight into all of the strangeness going down in the movie ?
apparently , the studio took this film away from its director and chopped it up themselves , and it shows .
there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge ,
would make more sense .
the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood .
but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling .
overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime ,
despite a pretty cool ending and explanation to all of the craziness that came before it .
oh , and by the way , this is not a horror or teen slasher flick . . . it's
just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids .
it also wrapped production two years ago and has been sitting on the shelves ever since .
whatever . . . skip
it ! where's joblo coming from ?
a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 )
第2个negative review (neg/cv001_19502):
damn that y2k bug .
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time )
in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on .
little do they know the power within . . .
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance .
we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship
( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout .
here , it's just " hey , let's chase these people around with some robots " .
the acting is below average , even from the likes of curtis .
you're more likely to get a kick out of her work in halloween h20 .
sutherland is wasted and baldwin , well , he's acting like a baldwin , of course .
the real star here are stan winston's robot design , some schnazzy cgi , and the occasional good gore shot , like picking into someone's brain .
so , if robots and body parts really turn you on , here's your movie .
otherwise , it's pretty much a sunken ship of a movie .
影评内容风格各异,有长有短
1.2 subjectivity:电影摘要与评论
用于主观性分析的数据集,这个语料库由 5000 条主观句子(subjective)和 5000 条客观句子(objective)组成,专门用于情感分析和主观性分类任务。
来源于 Bo Pang 和 Lillian Lee 的研论文《A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts》(ACL 2004)。
-
5000 条主观句子(subjective sentences)来自评论,因为它们表达了作者的观点、情感或评价。
-
5000 条客观句子(objective sentences),来自 《电影剧情摘要》,这些摘要通常是事实性的描述,不带有明显的个人情感或评价。
该数据集样本不用文件夹分割,所有影评放在一个文本
每个句子都经过预处理,单词和标点符号以空格分隔,使用 WhitespaceTokenizer 解析。
- obj的第一个样本
[‘the’, ‘movie’, ‘begins’, ‘in’, ‘the’, ‘past’, ‘where’, ‘a’, ‘young’, ‘boy’, ‘named’, ‘sam’, ‘attempts’, ‘to’, ‘save’, ‘celebi’, ‘from’, ‘a’, ‘hunter’, ‘.’]
[‘obj’, ‘subj’]
这里用的是sents方法,如果用raw()方法会返回全部样本字符
- subj的第一个样本
smart and alert , thirteen conversations about one thing is a small gem .
这里用的是字符串自带方法join(), 即用空格分割列表元素并转为字符串:" ".join()
- 完整代码
from nltk.corpus import subjectivitysubj = subjectivity.sents(categories='subj') # fileids, raw 获取主观句子
obj = subjectivity.sents(categories='obj') # 获取客观句子
categories = subjectivity.categories() # 返回 ['obj', 'subj']print(len(subj)," ".join(subj[0]))
print(len(obj),obj[0])
print(categories)
2. 多分类语料
第一节的路透社(reuters)带有《新闻主题》多标签:
[‘acq’, ‘alum’, ‘barley’, ‘bop’, ‘carcass’, ‘castor-oil’, ‘cocoa’, ‘coconut’, ‘coconut-oil’, ‘coffee’, ‘copper’, ‘copra-cake’, ‘corn’, ‘cotton’, ‘cotton-oil’, ‘cpi’, ‘cpu’, ‘crude’, ‘dfl’, ‘dlr’, ‘dmk’, ‘earn’, ‘fuel’, ‘gas’, ‘gnp’, ‘gold’, ‘grain’, ‘groundnut’, ‘groundnut-oil’, ‘heat’, ‘hog’, ‘housing’, ‘income’, ‘instal-debt’, ‘interest’, ‘ipi’, ‘iron-steel’, ‘jet’, ‘jobs’, ‘l-cattle’, ‘lead’, ‘lei’, ‘lin-oil’, ‘livestock’, ‘lumber’, ‘meal-feed’, ‘money-fx’, ‘money-supply’, ‘naphtha’, ‘nat-gas’, ‘nickel’, ‘nkr’, ‘nzdlr’, ‘oat’, ‘oilseed’, ‘orange’, ‘palladium’, ‘palm-oil’, ‘palmkernel’, ‘pet-chem’, ‘platinum’, ‘potato’, ‘propane’, ‘rand’, ‘rape-oil’, ‘rapeseed’, ‘reserves’, ‘retail’, ‘rice’, ‘rubber’, ‘rye’, ‘ship’, ‘silver’, ‘sorghum’, ‘soy-meal’, ‘soy-oil’, ‘soybean’, ‘strategic-metal’, ‘sugar’, ‘sun-meal’, ‘sun-oil’, ‘sunseed’, ‘tea’, ‘tin’, ‘trade’, ‘veg-oil’, ‘wheat’, ‘wpi’, ‘yen’, ‘zinc’]
共有大约 90 个左右的主题标签,覆盖财经、商品、贸易、市场、货币等领域。
每个新闻稿可以属于多个标签。
其输出代码如下:
from nltk.corpus import reuters
categories = reuters.categories() # # 输出所有的分类标签
print(categories)
另一类多分类的是product_reviews_1商品评论:
2.1 product_reviews_1商品概述
-
Apex_AD2600_Progressive_scan_DVD player.txt
- 逐行扫描 DVD 播放器:支持 480p 高分辨率输出,适合 HDTV 或 HD-ready 电视。
- 多格式兼容:可播放 DVD、MP3 CD、WMA CD、JPEG/Kodak 图片 CD(用于幻灯片播放),部分支持 DVD-R。
- 全屏适配功能(AFF):将 16:9 宽屏视频调整为 4:3 电视屏幕。
- 价格:2003-2004 年价格约 39.99-69.99 美元(折扣后),定位经济型。
- 评论概况:
- 正面:用户称赞其多格式播放能力和性价比,例如“几乎可以播放任何放入的碟片” 。
- 负面:遥控器功能不佳(反应迟钝,非通用型)、部分 DVD(如迪士尼电影)无法播放、耐用性差(部分设备数月内故障)。
- 情感关键词:“picture quality”, “cheap”, “remote doesn’t work”
- 总结:因价格低廉和格式支持广受好评,但可靠性和遥控器问题受批评。
-
Canon_G3.txt
- 数码相机:400 万像素传感器,适合 2000 年代初摄影需求。
- 镜头:佳能高品质镜头,配备光学变焦(约 4 倍)。
- 功能:支持手动控制、RAW 格式拍摄,紧凑设计适合进阶用户、摄影爱好者和半专业人士。
- 评论概况:
- 正面:图像质量高、手动控制灵活、多功能,例如“照片清晰且色彩鲜艳”
- 负面:手动设置学习曲线陡峭,机身较点拍相机稍显笨重。
- 情感关键词:“great pictures”, “slow focus”, “battery life is bad”
- 总结:因专业功能和紧凑设计广受好评,适合追求高质量摄影的用户。
-
Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt
- 便携音频设备(MP3 播放器), 2003 年发布,定位高端音乐爱好者。
- 存储容量:40GB 硬盘,可存储约 10,000 首歌曲。
- 音频支持:支持 MP3、WMA 等格式,音质优异。
- 功能:可更换电池、大屏幕显示、USB 2.0 快速传输。
- 评论概况:
- 正面:大容量和音质受好评,例如“存储大量歌曲,音质清晰” 。
- 负面:硬盘启动慢、界面复杂、偶尔死机或电池寿命短。
- 情感关键词:“sound is awesome”, “software sucks”, “large capacity”
- 总结:因大容量和高音质受青睐,但操作复杂性和可靠性问题被批评。
-
Nikon_coolpix_4300.txt
- 数码相机: 分辨率400 万像素,适合日常摄影。
- 镜头:尼康光学变焦镜头(约 3 倍光学变焦)。
- 功能:自动和手动模式、紧凑设计、易于携带。
- 目标用户:家庭用户和摄影初学者。
- 评论概况:
- 正面:易用、图像质量好、便携,例如“相机小巧,照片效果好” 。
- 负面:低光拍摄效果差、电池耗电快、缺少高级手动功能。
- 情感关键词:“excellent pictures”, “a bit heavy”, “menus are confusing”
- 总结:因便携性和易用性受家庭用户欢迎,但在低光环境下表现一般。
-
Nokia_6610.txt
- 上一代手机(非智能机), 彩色屏幕(128x128 像素) 2003 年左右的畅销机型及主流设计。
- 功能:支持 SMS、MMS、FM 收音机、GPRS 网络、Java 游戏。
- 设计:经典直板设计,简洁外观,耐用,内置天线。
- 评论概况:
- 正面:信号稳定、电池续航长、设计耐用,例如“电池可以用好几天” 。
- 负面:屏幕小、功能较基础(如无蓝牙)、按键手感一般。
- “great signal”, “buttons are too small”, “classic Nokia build”
- 总结:因耐用性和电池续航受好评,但功能相对简单,适合基本通信需求。
2.2 product_reviews_1:商品标签及评论
5类商品评价
2.1.1 Apex_AD2600_Progressive_scan_DVD player.txt
-
评价数量:740
-
标签
[‘1220’, ‘1600’, ‘aff’, ‘amazon’, ‘apex’, ‘audio’, ‘audio output’, ‘auto fit’, ‘build quality’, ‘button’, ‘case’, ‘cd’, ‘cd audio disc’, ‘code’, ‘color’, ‘color signal’, ‘customer service’, ‘customer support’, ‘design’, ‘different file’, ‘direction’, ‘disc’, ‘disk’, ‘disney movie’, ‘display’, ‘divx rip’, ‘door’, ‘dvd’, ‘dvd disc’, ‘dvd media’, ‘dvd player’, ‘external display’, ‘feature’, ‘finish’, ‘format’, ‘forward’, ‘freeze’, ‘freezing’, ‘heat’, ‘jpeg’, ‘jpeg picture’, ‘jpeg slideshow’, ‘layer dvd’, ‘line support’, ‘loading’, ‘look’, ‘machine’, ‘manual’, ‘media’, ‘menu’, ‘motor’, ‘mp3’, ‘mp3 filename’, ‘mpeg’, ‘mpeg1’, ‘no disc’, ‘noise’, ‘off button’, ‘onscreen display’, ‘output’, ‘p button’, ‘panel’, ‘panel button layout’, ‘picture’, ‘picture clarity’, ‘picture quality’, ‘play’, ‘player’, ‘power supply’, ‘price’, ‘product’, ‘progressive scan’, ‘progressive scan player’, ‘quality’, ‘r’, ‘read’, ‘recognize’, ‘reliability’, ‘remote’, ‘remote button’, ‘remote control’, ‘remote layout’, ‘rewind’, ‘run’, ‘screen’, ‘screw tip’, ‘service’, ‘set up’, ‘shipping’, ‘silver plate’, ‘size’, ‘smell’, ‘sound’, ‘speed’, ‘support’, ‘svcd’, ‘sync’, ‘tech support’, ‘technical support’, ‘unit’, ‘universal remote control’, ‘usage’, ‘use’, ‘user interface’, ‘vbr mp3 cd’, ‘vcd’, ‘video’, ‘video format’, ‘video output’, ‘video quality’, ‘weight’, ‘windows media’, ‘work’, ‘zoom’, ‘zoom mode’]
- 样本
1 repost from january 13 , 2004 with a better fit title .
2 does your apex dvd player only play dvd audio without video ?
3 or does it play audio and video but scrolling in black and white ?
4 before you try to return the player or waste hours calling apex tech support , or run the player over with your car ,
try these simple troubleshooting ideas first .
5 no picture :
...
734 however , i do n ' t know the dvd ' s performance on a heavy load of every - day viewing .
735 either way , can ' t go wrong with this price .
736 i am really impressed by this dvd player .
737 if it can fit in the drive bay , this dvd player will play it .
738 for instance , i made several back - ups of my dvd movies using dvd - r ( w ) and + r ( w ) and it plays the dvds .
739 no matter the format .
740 awesome !
2.1.2 Canon_G3.txt
-
评价数量:597
-
标签
[‘4mp’, ‘4mp camera’, ‘4mp resolution’, ‘auto mode’, ‘auto setting’, ‘automode’, ‘battery’, ‘battery charging system’, ‘battery life’, ‘body’, ‘button’, ‘camera’, ‘canera’, ‘canon’, ‘canon g3’, ‘canon powershot g3’, ‘casing’, ‘color’, ‘compactflash’, ‘control’, ‘darn diopter adjustment dial’, ‘delay’, ‘depth’, ‘design’, ‘dial’, ‘digital camera’, ‘digital zoom’, ‘display’, ‘distortion’, ‘download’, ‘exposure control’, ‘external flash hot shoe’, ‘feature’, ‘feel’, ‘finish’, ‘flash’, ‘flash photo’, ‘focus’, ‘four megapixel’, ‘function’, ‘g3’, ‘grain’, ‘highlight’, ‘hot shoe flash’, ‘image’, ‘image quality’, ‘import’, ‘lag’, ‘lag time’, ‘lcd’, ‘learning’, ‘learning curve’, ‘lens’, ‘lens cap’, ‘lens cover’, ‘lense’, ‘lever’, ‘light auto correction’, ‘look’, ‘low light focus’, ‘macro’, ‘made’, ‘manual’, ‘manual function’, ‘manual mode’, ‘memory card’, ‘menu’, ‘metering option’, ‘night mode’, ‘noise’, ‘off button’, ‘optic’, ‘optical zoom’, ‘option’, ‘performance’, ‘photo’, ‘photo quality’, ‘picture’, ‘picture quality’, ‘price’, ‘print’, ‘product’, ‘quality’, ‘raw format’, ‘raw image’, ‘remote’, ‘service’, ‘shape’, ‘shoot’, ‘shot’, ‘size’, ‘software’, ‘speed’, ‘spot metering’, ‘stitch picture’, ‘strap’, ‘tiff format’, ‘unresponsiveness’, ‘use’, ‘viewfinder’, ‘weight’, ‘white balance’, ‘white offset’, ‘zoom’, ‘zooming lever’]
1 i recently purchased the canon powershot g3 and am extremely satisfied with the purchase .
2 the camera is very easy to use , in fact on a recent trip this past week i was asked to take a picture of a vacationing elderly group .
3 after i took their picture with their camera , they offered to take a picture of us .
4 i just told them , press halfway , wait for the box to turn green and press the rest of the way .
...
593 even with these shortcomings , i still think it is the best digital camera available under $ 1200 .
594 definetely a great camera .
595 proven canon built quality and lens .
596 feels solid in hand .
597 rather heavy for point and shoot but a great camera for semi pros .
2.1.3 Creative_Labs_Nomad_Jukebox_Zen_Xtra_40GB.txt
-
评价数量:1716
-
标签
[‘0’, ‘accessing file’, ‘accessory’, ‘affordability’, ‘alarm’, ‘appearance’, ‘audio’, ‘backlight’, ‘balance’, ‘battery’, ‘battery life’, ‘battery top’, ‘bookmakr’, ‘bookmark’, ‘break’, ‘buck’, ‘build’, ‘button’, ‘capacity’, ‘case’, ‘cd burner’, ‘cd rip’, ‘change’, ‘chinese name’, ‘click buttons’, ‘clip’, ‘clock’, ‘color’, ‘construction’, ‘control’, ‘cover’, ‘creative’, ‘creative product’, ‘customer support’, ‘customer support website’, ‘deal’, ‘delete’, ‘design’, ‘display’, ‘durability’, ‘earbud’, ‘earphone’, ‘eax’, ‘eax mode’, ‘enviromental audio’, ‘equalizer’, ‘equilizer’, ‘equipment’, ‘explorer’, ‘face plate’, ‘feature’, ‘feel’, ‘file limit’, ‘file transfer’, ‘finding’, ‘firewire’, ‘firmware’, ‘flip switch’, ‘fly wheel’, ‘flywheel’, ‘fm’, ‘fm receiver’, ‘folder’, ‘folder structure’, ‘freeze’, ‘freeze up’, ‘front cover’, ‘game’, ‘hard drive’, ‘headphone’, ‘headphone jack’, ‘id3’, ‘id3 tag’, ‘installation’, ‘instruction’, ‘interface’, ‘itunes’, ‘jog dial’, ‘lcd’, ‘leather case’, ‘leather pouch’, ‘line out jack’, ‘load’, ‘lock up’, ‘look’, ‘looking’, ‘manage’, ‘manual’, ‘mediasource’, ‘memory’, ‘menu’, ‘menue’, ‘mp3 player’, ‘music’, ‘musicmatch software’, ‘name’, ‘napster’, ‘navigation’, ‘navigation wheel’, ‘navigational system’, ‘nomad’, ‘nomad explorer’, ‘notmad’, ‘notmad software’, ‘online help’, ‘online music service’, ‘operate’, ‘option’, ‘panel’, ‘pause’, ‘pc compatibility’, ‘play’, ‘play mode’, ‘play option’, ‘playback quality’, ‘player’, ‘player hardware’, ‘playlist’, ‘plug and play’, ‘power output’, ‘price’, ‘product’, ‘program’, ‘quality’, ‘recharger’, ‘recognition’, ‘recording’, ‘remote’, ‘remove’, ‘rename’, ‘replacement battery’, ‘rip’, ‘rip cd’, ‘screen’, ‘screen saver’, ‘scroll’, ‘scroll wheel’, ‘set up’, ‘setup’, ‘shuffle’, ‘shuttle’, ‘signal to noise ratio’, ‘size’, ‘software’, ‘song speed’, ‘sorting’, ‘sound’, ‘sound option’, ‘sound quality’, ‘sound setting’, ‘stop button’, ‘stoppage’, ‘storage’, ‘storage capacity’, ‘style’, ‘support’, ‘switch’, ‘sync’, ‘tag’, ‘the unit’, ‘thing’, ‘things’, ‘this’, ‘this item’, ‘this thing’, ‘top’, ‘transfer’, ‘transfter’, ‘unit’, ‘up face’, ‘uploading’, ‘usb recharge’, ‘use’, ‘user interface’, ‘value’, ‘voice recording’, ‘volume’, ‘volume range’, ‘wake up’, ‘warranty’, ‘weight’, ‘wheel’, ‘wma file’, ‘work’, ‘xtra’, ‘zen’, ‘zen xtra’, ‘zx’]
- 样本
1 this is an edited review , now that i have had time to use the device .
2 while , there are flaws with the machine , the xtra gets five stars because of its affordability .
3 it is the most bang - for - the - buck out there .
4 like it ' s predecessor , the quickly revised nx , this player boasts a decent size and weight ,
a relatively - intuitive navigational system that categorizes based on id3 tags , and excellent sound
( widely known to be better than ipod - not surprising considering the number of years creative has been in the audio peripheral business ) .
5 the xtra improves upon the zen nx with a larger , now - blue backlit screen , which is infinitely better .
6 further , the xtra doubles the maximum filecount capacity to 16000 mp3 .
...
1712 in that model the hard drive just died one morning before my class .
1713 it ' s nothing major , just a bad hard drive , any hard drive mp3 player can have that problem .
1714 so rule of thumb , no matter what you end up buying , get the extended warranty !
1715 it always pays off .
1716 hope i ' ve been of some help .
2.1.4 Nikon_coolpix_4300.txt
-
评价数量:346
-
标签
[‘4mp’, ‘8mb’, ‘8mb card’, ‘accessory’, ‘audio’, ‘auto focus’, ‘auto mode’, ‘auto setting’, ‘autofocus’, ‘battery’, ‘battery life’, ‘camera’, ‘closeup mode’, ‘construction’, ‘continuous shot mode’, ‘control’, ‘customer service’, ‘delay’, ‘design’, ‘digital zoom’, ‘download’, ‘ease of use’, ‘feature’, ‘firewire’, ‘focus assist light’, ‘function’, ‘image’, ‘image download’, ‘indoor image’, ‘indoor picture’, ‘indoor shot’, ‘lcd’, ‘learn’, ‘lens cap’, ‘lense cap’, ‘macro’, ‘macro mode’, ‘manual’, ‘manual mode’, ‘memory card’, ‘menu’, ‘menu dial knob’, ‘movie’, ‘movie mode’, ‘nikon’, ‘nikon 4300’, ‘nikon support’, ‘online service’, ‘optic’, ‘optical setting’, ‘optical zoom’, ‘photo’, ‘photo quality’, ‘picture’, ‘picture quality’, ‘price’, ‘print’, ‘print quality’, ‘quality’, ‘rechargable battery’, ‘redeye’, ‘scene mode’, ‘servicing’, ‘size’, ‘software’, ‘sunset feature’, ‘system error’, ‘touchup’, ‘transfer’, ‘txt file’, ‘up shooting’, ‘use’, ‘viewfinder’, ‘weight’, ‘zoomed image’]
- 样本
1 this camera is perfect for an enthusiastic amateur photographer .
2 the pictures are razor - sharp , even in macro .
3 it is small enough to fit easily in a coat pocket or purse .
4 it is light enough to carry around all day without bother .
...
345 the same 4mp chip from the 4500 camera , plus a 3x zoom with the ability to expand upon that with extenders ,
great closeup mode , long lasting rechargable battery , etc etc .
346 in my opinion it ' s the best camera for the money if you ' re looking for something that ' s easy to use ,
small good for travel , and provides excellent , sharp images .
2.1.5 Nokia_6610.txt
-
评价数量:547
-
标签
[‘application’, ‘background’, ‘backlight’, ‘battery’, ‘battery life’, ‘bluetooth’, ‘browsing’, ‘button’, ‘calendar’, ‘call’, ‘camera’, ‘color’, ‘color screen’, ‘command’, ‘construction’, ‘csr’, ‘customer rep’, ‘customer service’, ‘default ringtone’, ‘design’, ‘durability’, ‘ear’, ‘earpiece’, ‘ergonomics’, ‘feature’, ‘fm’, ‘fm radio’, ‘game’, ‘gprs’, ‘gsm’, ‘headphone jack’, ‘headset’, ‘headset jack’, ‘high speed internet’, ‘infrared’, ‘internet’, ‘key’, ‘key lock’, ‘keypad’, ‘layout’, ‘look’, ‘loud phone’, ‘memory’, ‘menu’, ‘menu option’, ‘menu options’, ‘message’, ‘mms’, ‘mobile’, ‘mobile reception’, ‘mobile service’, ‘network’, ‘nokia’, ‘operate’, ‘pc cable’, ‘pc suite’, ‘pc sync’, ‘phone’, ‘phone book’, ‘phone performance’, ‘picture’, ‘picture sharing’, ‘pim’, ‘plan’, ‘quality’, ‘radio’, ‘rate plan’, ‘reception’, ‘resolution’, ‘ring’, ‘ring tone’, ‘ringer’, ‘ringing tone’, ‘ringtone’, ‘screen’, ‘screensaver’, ‘service’, ‘signal’, ‘signal quality’, ‘size’, ‘software’, ‘sound’, ‘sound quality’, ‘sound volume’, ‘speaker’, ‘speaker phone’, ‘speakerphone’, ‘sprint’, ‘sprint customer service’, ‘sprint plan’, ‘sturdy’, ‘t customer service’, ‘tone’, ‘tune’, ‘use’, ‘user interface’, ‘vibrate setting’, ‘vibration’, ‘voice’, ‘voice dialing’, ‘voice quality’, ‘volume’, ‘volume control’, ‘volume key’, ‘wallpaper’, ‘warranty’, ‘web’, ‘weight’, ‘wireless telephone’, ‘work’, ‘zone’]
- 样本
1 i am a business user who heavily depend on mobile service .
2 there is much which has been said in other reviews about the features of this phone ,
it is a great phone , mine worked without any problems right out of the box .
3 just double check with customer service to ensure the number provided by amazon is for the city / exchange you wanted .
4 after several years of torture in the hands of at & t customer service i am delighted to drop them ,
and look forward to august 2004 when i will convert our other 3 family - phones from at & t to t - mobile !
...
544 it is crystal clear .
545 this is one of the nicest phones nokia has made .
546 i do recommend getting the data kit for those geeks .
547 there are a lot of cool websites with games and midi ringtones to download for free .
3.句法分析语料
带词性标注的语料,适用于词性标注训练/测试。
3.1 Brown
新闻Brown Corpus有该类型标注,是一种较为简化的版本,具体Tagset如下:
标签 | 含义 |
---|---|
AT | Article(冠词) |
NN | Noun(名词) |
JJ | Adjective(形容词) |
VBD | Verb, past tense |
NP-TL | Proper noun in title (专有名词) |
NN-TL | Noun in Title (标题名词) |
- 测试代码
from nltk.corpus import brown
tagged_sent = brown.tagged_sents()[0] # 获取标注好的句子(Brown tagset)
print(tagged_sent[:10])
- 输出标注内容
[(‘The’, ‘AT’), (‘Fulton’, ‘NP-TL’), (‘County’, ‘NN-TL’), (‘Grand’, ‘JJ-TL’), (‘Jury’, ‘NN-TL’), (‘said’, ‘VBD’), (‘Friday’, ‘NR’), (‘an’, ‘AT’), (‘investigation’, ‘NN’), (‘of’, ‘IN’)]
3.2 treebank:宾州树库(句法分析)
Treebank 语料库主要基于《华尔街日报》(Wall Street Journal, WSJ)的文章,文件以 wsj_XXXX.mrg 命名, 共200个文件。
- 代码测试文件名
from nltk.corpus import treebank
file_ids = treebank.fileids()
输出文件名:
[‘wsj_0001.mrg’, ‘wsj_0002.mrg’, ‘wsj_0003.mrg’, ‘wsj_0004.mrg’, ‘wsj_0005.mrg’, …, ‘wsj_0199.mrg’]
-
Treebank 语料库包含:
-
词性标注(POS tagged):词和对应的 POS 标签。
-
句法树(Parsed sentences):句子的句法结构,表示为树形结构。
-
原始文本:未标注的句子。
每个文件包含多个句子的标注。Treebank 语料库约 100万词,句法树可能包含嵌套结构,处理时需要熟悉递归或树遍历方法。
3.2.1 词性标注
nltk 默认采用 Penn Treebank 词性标注(POS Tags):
在 Brown 中,冠词标记是 AT;而在 Penn Treebank 中,它们被统一划为 DT(Determiner)。
标记(Tag) | 含义 英文(中文) | 示例 |
---|---|---|
CC | Coordinating conjunction (并列连词) | and, but, or, yet |
CD | Cardinal number (基数) 与 Cardinal Number(序数) | one, two, 1999 |
DT | Determiner (限定词) | the, a, an |
EX | Existential there (存在词) | there (There is …) |
FW | Foreign word (外来词) | c’est, etc., esprit |
IN | Preposition or subordinating conj. (介词/从属连词) | in, of, like, although |
JJ | Adjective (形容词) | big, beautiful, green |
JJR | Adjective, comparative (比较级形容词) | bigger, smaller |
JJS | Adjective, superlative (最高级形容词) | biggest, smallest |
LS | List item marker (列表项标记) | 1), a), B. |
MD | Modal (情态动词) | can, will, should |
NN | Noun, singular (单数名词) | dog, year |
NNS | Noun, plural (复数名词) | dogs, years |
NNP | Proper noun, singular (专有名词,单数) | John, London |
NNPS | Proper noun, plural (专有名词,复数) | Smiths, Americans |
PDT | Predeterminer (前限定词) | all, both |
POS | Possessive ending (所有格结尾) | ’s, ’ |
PRP | Personal pronoun (人称代词) | I, you, he, she, it |
PRP$ | Possessive pronoun (物主代词) | my, your, his, her |
RB | Adverb (副词) | quickly, very, well |
RBR | Adverb, comparative (比较级副词) | better, faster |
RBS | Adverb, superlative (最高级副词) | best, fastest |
RP | Particle (小品词) | up, off, out |
SYM | Symbol (符号) | $, %, + |
TO | to (to词) | to |
UH | Interjection (感叹词) | oh, wow, oops |
VB | Verb, base form (动词原形) | run, eat, be |
VBD | Verb, past tense (过去式) | ran, ate, was |
VBG | Verb, gerund/present participle (现在分词) | running, being |
VBN | Verb, past participle (过去分词) | eaten, been |
VBP | Verb, non-3rd person present (现在式) | run, eat (除 he/she/it 外) |
VBZ | Verb, 3rd person present (现在三单) | runs, eats, is |
WDT | Wh-determiner (限定名词,出现在名词前,相当于形容词性用法) | which, that |
WP | Wh-pronoun (代替人/事物,直接作为主语或宾语使用) | who, what |
WP$ | Possessive wh-pronoun (物主代词 表示所属关系,限定名词) | whose |
WRB | Wh-adverb (副词,修饰整个句子,询问地点、时间、原因、方式。) | where, when, why |
其中符号的标注多为符号本身:
符号 | 词性标记 | 说明 |
---|---|---|
. | . | 句号 |
, | , | 逗号 |
: | : | 冒号、分号、破折号 |
'' | '' | 右引号 |
( ) | -LRB- 、-RRB- | 左右括号 |
... | : 或 … (罕见) | 若是省略号,会显示不同的词素 |
- 代码测试
tagged_words = treebank.tagged_words() #-------------------- 获取所有带 POS 标签的词 -----------------------print(tagged_words[:10]) # 示例:打印前 10 个带 POS 标签的词tagged_words_file = treebank.tagged_words(fileids='wsj_0002.mrg')
print(tagged_words_file[:10]) # 获取特定文件的带 POS 标签的词(例如 wsj_0001.mrg)
[(‘Pierre’, ‘NNP’), (‘Vinken’, ‘NNP’), (‘,’, ‘,’), (‘61’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘,’, ‘,’), (‘will’, ‘MD’), (‘join’, ‘VB’), (‘the’, ‘DT’)]
[(‘Rudolph’, ‘NNP’), (‘Agnew’, ‘NNP’), (‘,’, ‘,’), (‘55’, ‘CD’), (‘years’, ‘NNS’), (‘old’, ‘JJ’), (‘and’, ‘CC’), (‘former’, ‘JJ’), (‘chairman’, ‘NN’), (‘of’, ‘IN’)]
3.2.2 句法树
Treebank 的句法树是其核心内容,存储为 nltk.tree.Tree 对象。
# 获取所有句法树
parsed_sents = treebank.parsed_sents()# 示例:打印第一个句法树
print(parsed_sents[0])# 或者以树形结构可视化(需要安装 graphviz 和 python-graphviz)
parsed_sents[0].draw() # 弹出图形界面显示树# 获取特定文件的句法树
parsed_sents_file = treebank.parsed_sents(fileids='wsj_0001.mrg')
print(parsed_sents_file[0])# 提取纯文本句子(词序列)
tree = parsed_sents[0]
words = tree.leaves()
sentence = ' '.join(words)
print(sentence)
- 不带标注的句子如下:
‘Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .’
- 输出句法树的图像界面如下:
- 输出句法树的命令行如下:
(S(NP-SBJ(NP (NNP Pierre) (NNP Vinken))(, ,)(ADJP (NP (CD 61) (NNS years)) (JJ old))(, ,))(VP(MD will)(VP(VB join)(NP (DT the) (NN board))(PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))(NP-TMP (NNP Nov.) (CD 29))))(. .))
- 结构解析
1. (S …) — 整个句子(Sentence)2. (NP-SBJ …) — 主语短语(Subject Noun Phrase)2.1. (NNP Pierre) (NNP Vinken):人名,专有名词2.2 这是一个带插入语的主语结构,逗号表示插入语边界2.3.ADJP:形容词短语 “61 years old”2.4 (CD 61) (NNS years):数词 + 名词,形成名词短语2.5 (JJ old):形容词 “old”3. (VP …) — 谓语短语(Verb Phrase)3.3 (MD will):情态动词 will3.4 (VB join):动词原形 join3.5 (NP the board):直接宾语3.6 (PP-CLR as a nonexecutive director):清除性介词短语(表示角色),“作为一名非执行董事”3.7 (NP-TMP Nov. 29):临时性时间状语,“11月29日”4. . (. .) — 句号
3.2.3 原始文本
- 代码测试句子
sents = treebank.sents() #所有句子
或
sents_file = treebank.sents(fileids=‘wsj_0001.mrg’) # 获取特定文件的句子
输出
[‘Pierre’, ‘Vinken’, ‘,’, ‘61’, ‘years’, ‘old’, ‘,’, ‘will’, ‘join’, ‘the’, ‘board’, ‘as’, ‘a’, ‘nonexecutive’, ‘director’, ‘Nov.’, ‘29’, ‘.’]
- 统计所有句子pos频率代码
from collections import Counter # 统计所有 POS 标签的频率tags = [tag for word, tag in treebank.tagged_words()]
tag_freq = Counter(tags)
print(tag_freq.most_common(10)) # 打印最常见的 10 个标签
输出
[(‘NN’, 13166), (‘IN’, 9857), (‘NNP’, 9410), (‘DT’, 8165), (‘-NONE-’, 6592),
(‘NNS’, 6047), (‘JJ’, 5834), (‘,’, 4886), (‘.’, 3874), (‘CD’, 3546)]
即单数名词第一多,其次是介词。
3.3 CoNLL2000
这个数据集来自 2000 年的 CoNLL(Computational Natural Language Learning),用于浅层句法分析(shallow chunkers),包括基于规则的分析器,基于机器学习的分类器(如 CRF、MaxEnt、BiLSTM-CRF)或词组结构识别(chunking)任务。
数据来自Penn Treebank 的 Wall Street Journal (WSJ) 语料部分转换而来
-
标注 Chunk 标签(句法结构块)
-
每一个句子是一个 nltk.Tree 对象
分块标签标识句子的短语结构。其中每个子节点是一个表示词块(chunk,如 NP, VP)的子树, 相对于treebank,CoNLL-2000 并不标注完整从句结构, 它只标注扁平的短语块。
因此,IN 类型的从属连词(如 if, when, that),CC 并列连词(如 and, but)副词RB等并不构成单独的 PP(介词短语),也不属于 NP、VP,会被单独列出来(chunk 外部)
3.3.1 三元组标签
三元组包含三个部分:词(word)、词性标签(POS tag)和分块标签(chunk tag)。
- POS 标签(Part-of-Speech Tags)
- 定义:POS 标签表示单词的语法类别,例如名词、动词、形容词等。
- 标签集:在 CoNLL-2000 语料库(你的示例数据来自此语料库)中,POS 标签基于 Penn Treebank 标签集,例如:
- NN:名词(单数)
- IN:介词
- DT:限定词
- VBZ:动词(第三人称单数)
- RB:副词
- VBN:动词(过去分词)
- TO:to(作为介词或不定式标记)
- VB:动词(原形)
作用:POS 标签描述单词的句法角色,独立于句子结构。
-
分块标签(Chunk Tags)
- 定义:分块标签表示单词所属的短语块(chunk),例如名词短语(NP)、动词短语(VP)、介词短语(PP)等。它们使用 IOB 格式 来标记短语的边界。
- IOB 格式:
- B-XXX:表示短语块的开始(Beginning),XXX 是短语类型(如 NP、 VP、 PP)。
- I-XXX:表示短语块的内部(Inside),即短语的后续词。
- O:表示单词不属于任何短语块 。
-
常见分块类型
- NP:名词短语(Noun Phrase),如“the pound”。
- VP:动词短语(Verb Phrase),如“is widely expected”.
- PP:介词短语(Prepositional Phrase),如“in the pound”.
示例:
[('Confidence', 'NN', 'B-NP'), ('in', 'IN', 'B-PP'), ('the', 'DT', 'B-NP'),
('pound', 'NN', 'I-NP'), ('is', 'VBZ', 'B-VP'), ('widely', 'RB', 'I-VP'),
('expected', 'VBN', 'I-VP'), ('to', 'TO', 'I-VP'), ('take', 'VB', 'I-VP'),
('another', 'DT', 'B-NP')]
示例分析:
- (‘Confidence’, ‘NN’, ‘B-NP’):B-NP 表示“Confidence”是名词短语的开始。
- (‘in’, ‘IN’, ‘B-PP’):B-PP 表示“in”是介词短语的开始。
- (‘pound’, ‘NN’, ‘I-NP’):I-NP 表示“pound”是名词短语的内部词(属于前面的“the”开始的 NP)。
- (‘is’, ‘VBZ’, ‘B-VP’):B-VP 表示“is”是动词短语的开始。
3.3.2 浅层结构划分
这里没有嵌套更深的从句结构等, 结构类似于句法树, 测试句子为:
Sentence 1:
['Confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected',
'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures',
'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',',
'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', 'deficits', '.']
浅层结构树为:
(S(NP Confidence/NN)(PP in/IN)(NP the/DT pound/NN)(VP is/VBZ widely/RB expected/VBN to/TO take/VB)(NP another/DT sharp/JJ dive/NN)if/IN(NP trade/NN figures/NNS)(PP for/IN)(NP September/NNP),/,due/JJ(PP for/IN)(NP release/NN)(NP tomorrow/NN),/,(VP fail/VB to/TO show/VB)(NP a/DT substantial/JJ improvement/NN)(PP from/IN)(NP July/NNP and/CC August/NNP)(NP 's/POS near-record/JJ deficits/NNS)./.)
或:
(S(NP Rockwell/NNP International/NNP Corp./NNP)(NP 's/POS Tulsa/NNP unit/NN)(VP said/VBD)(NP it/PRP)(VP signed/VBD)(NP a/DT tentative/JJ agreement/NN)(VP extending/VBG)(NP its/PRP$ contract/NN)(PP with/IN)(NP Boeing/NNP Co./NNP)(VP to/TO provide/VB)(NP structural/JJ parts/NNS)(PP for/IN)(NP Boeing/NNP)(NP 's/POS 747/CD jetliners/NNS)./.)
参照格式:
(Chunk Word/POS)
- 代码
from nltk.corpus import conll2000# 加载数据
train_data = conll2000.chunked_sents('train.txt')
test_data = conll2000.chunked_sents('test.txt')print(train_data[0])
3.4 三个语料的词频统计
top 5 tags
- Brown
[(‘NN’, 152470), (‘IN’, 120557), (‘AT’, 97959), (‘JJ’, 64028), (‘.’, 60638)]
- Treebank top
[(‘NN’, 13166), (‘IN’, 9857), (‘NNP’, 9410), (‘DT’, 8165), (‘-NONE-’, 6592)]
- CoNLL-2000
[(‘NN’, 36789), (‘IN’, 27835), (‘NNP’, 24690), (‘DT’, 22355), (‘NNS’, 16653)]
4.教程代码
整个教程用到的代码放在该文件夹:
- https://github.com/disanda/d_code/tree/master/4.nltk