Skip to content

floge/dmoz2mysql

 
 

Repository files navigation

Read and convert data from dmoz.org Uses the structure.u8.rdf and content.u8.rdf files from http://rdf.dmoz.org as input, and writes them into a MySQL and as Turtle TTL files

=== dmoz2mysql ===

PHP commandline scripts to convert dmoz RDF to import into MySQL database

The dmoz RDF files are based on the outdated RDF/XML format, and very hard to parse, and on top of that also non-standard. This makes any standard triple store import tool choke.

Downloaded from http://sourceforge.net/projects/dmoz2mysql/ as ZIP Version 3.0 Author: Amir Salihefendic mailto:amix@amix.dk Copyright: JFL Webcom http://www.webcom.dk See README-dmoz2mysql.html for more info.

Edit config.php with your database info etc. Then call on commandline: php start_script.php It's a PHP console script, not a PHP web page application.

=== mysql2ttl ===

Python scripts by Ben Bucksch to read from MySQL and export to a Turtle ttl triple file

Turtle http://www.w3.org/TR/turtle/ ttl files are the standard format for large RDF / triple / LOD data dumps and can be easily imported into most triple store databases.

Edit config.py with your database info. Then call on commandline: python start.py

Future:

  • I wish that the dmoz.org project would just offer ttl files as download instead of the malformed RDF/XML files. This would remove the need for mysql2ttl.
  • A smaller improvement would be to modify class_parse.php here to directly write out TTL instead of executing SQL INSERT queries. This shouldnt be hard. But I don't want to touch PHP :).
  • Until then, I hope this converter might be useful for somebody.

=== File sizes ===

rdf.dmoz.org download files:

  • 85M structure.rdf.u8.gz
  • 247M content.rdf.u8.gz
  • 331M total

extracted:

  • 886M structure.rdf.u8
  • 1.7G content.rdf.u8
  • 2.5G total

converter output:

  • 22M categories.ttl.gz
  • 26M category-hierarchy.ttl.gz
  • 36M links.ttl.gz
  • 207M link-titles.ttl.gz
  • 290M total

extracted:

  • 175M categories.ttl
  • 592M category-hierarchy.ttl
  • 246M links.ttl
  • 919M link-titles.ttl
  • 1.9G total

About

Scripts by Amir Salihefendic and me to convert dmoz RDF to MySQL database and from MySQL to a Turtle ttl triple file

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • PHP 86.0%
  • HTML 7.6%
  • Python 6.4%