Skip to content

hongxin001/news-combinator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Crawler

A repo for news crawling. Then combine similar news. Temporarily used python and Scrapy framework.
Used Jieba and Scrapy

TODO

  1. Find a proper Chinese segmentation tool
  2. Split a JSON file into small files, every file contains only one piece of news
  3. Read some articles about SVM (Did not use SVM, but tfidf and cosin similarity)
  4. Try to categorize different news
  5. Build another IDF dictionary from web news
  6. Categorize those similar passages in sina and netease but not in tencent
  7. Make a website display the results and show the comments
  8. Improve categorization performance
  9. Make the website more beautiful!
  10. Reconstruct this project with php

Website

http://news.reetsee.com

About

Use crawlers to get news, combine the similar ones and display their comments from different websites

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 49.5%
  • PHP 17.4%
  • Python 12.1%
  • JavaScript 8.1%
  • HTML 7.0%
  • CSS 4.5%
  • Other 1.4%