A repo for news crawling. Then combine similar news.
Temporarily used python and Scrapy framework.
Used Jieba and Scrapy
Find a proper Chinese segmentation toolSplit a JSON file into small files, every file contains only one piece of newsRead some articles about SVM(Did not use SVM, but tfidf and cosin similarity)Try to categorize different news- Build another IDF dictionary from web news
Categorize those similar passages in sina and netease but not in tencentMake a website display the results and show the commentsImprove categorization performance- Make the website more beautiful!
Reconstruct this project with php