Or any set of <page />
with <revision />
stored in an XML file. See required XML schema.
Expected outcome of this script will give you a Git repository.
With idempotent revisions
- Convert MediaWiki wiki page history into Git repository
- Create git commits per revisions with preserved author and date
- Dump every page into text files, organized by their url (e.g.
css/properties
,css/properties/index.md
), without any modifications - Read data from MediaWiki’s recommended MediaWiki way of backups (i.e.
maintenance/dumpBackup.php
) - Get "reports" about the content: deleted pages, redirects, translations, number of revisions per page
- Harmonize titles and converts into valid file name (e.g.
:
,(
,)
,@
) in their URL (e.g.css/atrules/@viewport
, redirects tocss/atrules/viewport
and serve from HTML file that would be generated fromcss/atrules/viewport/index.md
) - Create list of rewrite rules to keep original URLs refering back to harmonized file name
- Write history of deleted pages "underneath" history of current content
- Ability to run script from backed up XML file (i.e. once we have XML files, no need to run script on same server)
- Import metadata such as Categories, and list of authors into generated files
- Ability to detect if a page is a translation, create a file in the same folder with language name
- Keeps commit dates, but order of commits isn’t in chronological order
- Commits follows this loop: loop through page and page create a commit for each revisions.
- Make sure the folder
mediawiki/
exists side by side with this repository and that you can use the MediaWiki instance.
mediawiki-conversion/
mediawiki/
//...
If you want an easy way, you could use MediaWiki-Vagrant and import your own database mysqldump into it.
-
Make sure you run the following from where you can run PHP code and use the database, e.g. MediaWiki vagrant VM.
-
Run
dumpBackup
make target; This should export content and create a cache of all users
make dumpBackup
- NOTE Once here, we do not need to run the remaining from the same machine as the one we run MediaWiki
MediaWiki isn’t required locally.
Make sure that you have a copy of your data available in a MediaWiki installation running with data, we´ll use the API to get the parser to give us the generated HTML at the 3rd pass.
- Get a feel of your data
Run this command and you’ll know which pages are marked as deleted in history, the redirects, how the files will be called and so on. This gives out a very verbosic output, you may want to send the output to a file.
This command makes no external requests, it only reads data/users.json
(from make dumpBackup
earlier) and
the dumpBackup XML file in data/dumps/main_full.xml
.
mkdir reports
app/console mediawiki:summary > reports/summary.yml
You can review WebPlatform Docs content summary that was in MediaWiki until 2015-07-28 in reports/
directory of
webplatform/mediawiki-conversion repository.
If you want more details you can use the --display-author
switch.
The option had been added so we can commit the file without leaking our users email addresses.
More in Reports below.
- Create
errors/
directory
That’s where the script will create file with the index counter number where we couldn’t get MediaWiki API render action to give us HTML output at 3rd pass.
mkdir errors
- Create
out/
directory
That’s where this script will create a new git repository and convert MediaWiki revisions into Git commits
mkdir out
- Review the following to adapt to your installation
- lib/mediawiki.php;
MEDIAWIKI_API_ORIGIN
to match your own MediaWiki installation you are exporting fromCOMMITER_ANONYMOUS_DOMAIN
to match the domain name you want your users to use email domain. (Useful to expose history and users, without giving away their real email address.)
- If you need to superseed a user, look at the
WebPlatform\Importer\Commands\RunCommand
class at the comment "Fix duplicates and merge them as only one", uncomment and adjust to suit your needs.
- Review TitleFilter and adapt the rules according to your content
Refer to Reports, at the URL parts variants report where you may find possible file name conflicts.
- Run first pass
When you delete a document in MediaWiki, you can set a redirect. Instead of writing history of the page at a location that will be deleted we’ll write it at the "redirected" location.
This command makes no external requests, it only reads data/users.json
(from make dumpBackup
earlier) and
the dumpBackup XML file in data/dumps/main_full.xml
.
app/console mediawiki:run 1
At the end of the first pass you should end up with an empty out/
directory with all the deleted pages history in a new git repository.
- Run second pass
Run through all history, except deleted documents, and write git commit history.
This command can take more than one hour to complete. It all depends of the number of wiki pages and revisions.
app/console mediawiki:run 2
- Run third pass
This is the most time consuming pass. It’ll make a request to retrieve the HTML output of the current latest revision of every wiki page through MediaWiki’s internal Parser API, see MediaWiki Parsing Wikitext.
At this pass you can resume-at and retry pages that didn’t work at a previous run.
While the two other pass commits every revision as a single commit, this one is intended to be ONE big commit containing ALL the conversion result.
Instead of risking to lose terminal feedback you can pipe the output into a log file.
First time 3rd pass
app/console mediawiki:run 3 > run.log
If everything went well, you should see nothing in errors/
folder. If that’s so; you are lucky!
Tail the progress in a separate terminal tab. Each run has an "index" specified, if you want to resume at a specific point
you can just use that index value in --resume-at=n
.
tail -f run.log
3rd pass had been interrupted
This can happen if the machine running the process had been suspended, or lost network connectivity. You can
resume at any point by specifying the --resume-at=n
index it been interrupted.
app/console mediawiki:run 3 --resume-at 2450 >> run.log
3rd pass completed, but we had errors
The most possible scenario.
Gather a coma separated list of erroneous pages and run only them.
// for example
app/console mediawiki:run 3 --retry=1881,1898,1900,1902,1966,1999 >> run.log
This repository has reports generated during WebPlatform Docs content from MediaWiki migration commited in the reports/
folder.
You can overwrite or delete them to leave trace of your own migration.
They were commited in this repository to illustrate how this workbench got from the migration.
This report shows wiki documents that are directly on root, it helps to know what are the pages at top level before running the import.
// file reports/directly_on_root.txt
absolute unit
accessibility article ideas
Accessibility basics
Accessibility testing
// ...
This shows the wiki pages that has more than 100 edits.
// file reports/hundred_revs.txt
tutorials/Web Education Intro (105)
// ...
A summary of the content:
- Iterations: Number of wiki pages
- Content pages: Pages that are still with content (i.e. not deleted)
- redirects: Pages that redirects to other pages (i.e. when deleted, author asked to redirect)
// file reports/numbers.txt
Numbers:
- iterations: 5079
- redirects: 404
- translated: 101
- "content pages": 4662
- "not in a directory": 104
- "redirects for URL sanity": 1217
- "edits average": 7
- "edits median": 5
Pages that had been deleted and author asked to redirect.
This will be useful for a webserver 301 redirect map
// file reports/redirects.txt
Redirects (from => to):
- "sxsw_talk_proposal": "WPD/sxsw_talk_proposal"
- "css/Properties/color": "css/properties/color"
// ...
All pages that had invalid filesystem characters (e.g. :
,(
,)
,@
) in their URL (e.g. css/atrules/@viewport
) to make sure we don’t lose the original URL, but serve the appropriate file.
// file reports/sanity_redirects.txt
URLs to return new Location (from => to):
- "tutorials/Web Education Intro": "tutorials/Web_Education_Intro"
- "concepts/programming/about javascript": "concepts/programming/about_javascript"
- "concepts/accessibility/accessibility basics": "concepts/accessibility/accessibility_basics"
// ...
Shows all pages, the number of revisions, the date and message of the commit.
This report is generated through app/console mediawiki:summary
and we redirect output to this file.
# file reports/symmary.yml
"tutorials/Web Education Intro":
- normalized: tutorials/Web_Education_Intro
- file: out/content/tutorials/Web_Education_Intro/index.md
- revs: 105
- revisions:
- id: 1
date: Tue, 29 May 2012 17:37:32 +0000
message: Edited by MediaWiki default
- id: 1059
date: Wed, 22 Aug 2012 15:56:45 +0000
message: Edited by Cmills
# ...
All URLs sorted (as much as PHP can sort URLs).
// file reports/url_all.txt
absolute unit
accessibility article ideas
Accessibility basics
Accessibility testing
after
alignment
apis
apis/ambient light
apis/appcache
// ...
A list of all URL components, only unique entries.
If you have collisions due to casing, you should review in url parts variants.
// file reports/url_parts.txt
0_n_Properties
1_9_Properties
3d_css
3d_graphics_and_effects
20thing_pageflip
a
abbr
abort
// ...
A list of all URL components, showing variants in casing that will create file name conflicts during coversion.
Not all of the entries in "reports/url_parts_variants.md" are problematic, you’ll have to review all your URLs and adapt your own copy of TitleFilter
, see WebPlatform/Importer/Filter/TitleFilter class.
More about this at Possible file name conflicts due to casing inconsistency
// file reports/url_parts_variants.txt
All words that exists in an URL, and the different ways they are written (needs harmonizing!):
- css, CSS
- canvas_tutorial, Canvas_tutorial
- The_History_of_the_Web, The_history_of_the_Web, the_history_of_the_web
// ...
Beware of the false positives. In the example above, we might have "css" in many parts of the URL, we can’t just rewrite for EVERY cases. In this case, you’ll notice in TitleFilter class that we rewrite explicitly in the following format 'css\/cssom\/styleSheet';
, 'css\/selectors';
, etc.
You’ll have to adapt TitleFilter to suit your own content.
What will be the NGINX redirects.
This will most likely need tampering to suit your own project specifities.
// file reports/nginx_redirects.map
rewrite ^/wiki/css/selectors/pseudo-elements/\:\:after$ /css/selectors/pseudo-elements/after permanent;
rewrite ^/wiki/css/selectors/pseudo-classes/\:lang\(c\)$ /css/selectors/pseudo-classes/lang permanent;
rewrite ^/wiki/css/selectors/pseudo-classes/\:nth-child\(n\)$ /css/selectors/pseudo-classes/nth-child permanent;
rewrite ^/wiki/css/functions/skew\(\)$ /css/functions/skew permanent;
rewrite ^/wiki/html/attributes/background(\ |_)\(Body(\ |_)element\)$ /html/attributes/background_Body_element permanent;
// ...
Here’s a list of repository that were created through this workspace.
- WebPlatform Docs content from MediaWiki into a git repository
Conflicts can be caused to folders being created with different casing.
For example, consider the following and notice how we may get capital letters and others wouldn’t:
- concepts/Internet and Web/The History of the Web
- concepts/Internet and Web/the history of the web/es
- concepts/Internet and Web/the history of the web/ja
- tutorials/canvas/canvas tutorial
- tutorials/canvas/Canvas tutorial/Applying styles and colors
- tutorials/canvas/Canvas tutorial/Basic animations
This conversion workbench is about creating files and folders, the list of titles above would therefore become;
concepts/
- Internet_and_Web/
- The_History_of_the_Web/
- index.html
- the_history_of_the_web/
- es.html
- ja.html
tutorials/
- canvas/
- canvas_tutorial/
- index.html
- Canvas_tutorial/
- Applying_styles_and_colors/
- index.html
Notice that we would have at the same directory level with two folders with almost the same name but with different casing patterns.
This is what TitleFilter class is for.
Two files are required to run the workbench;
- data/dumps/main_full.xml with all the pages and revisions as described in XML Schema
- data/users.json with matching values from contributor XML node from XML Schema, as described in Users.json Schema.
MediaWiki maintenance/dumpBackup
script (see manual, export manual and xsd definition) has the following XML schema but this script isn’t requiring MediaWiki at all.
In other words, if you can get an XML file with the same schema you can also use this script without changes.
Here are the essential pieces that this script expects along with notes about where they matter in the context of this workbench.
Notice the <contributor />
XML node, you’ll have to make sure you also have same values in data/users.json, see [users.json][#usersjson-schema].
<foo>
<!-- The page XML node will be manipulated via the WebPlatform\ContentConverter\Model\MediaWikiDocument class -->
<page>
<!-- The URL of the page. This should be the exact string your CMS supports -->
<title>tutorials/Web Education Intro</title>
<!-- id isn’t essential, but we use it because it helps assess how the run is going -->
<id>1</id>
<!-- The revision XML node will be manipulated via the WebPlatform\ContentConverter\Model\MediaWikiRevision class -->
<revision>
<!-- same as the page id note above -->
<id>39463</id>
<!-- format is in explicit "Zulu" Time. -->
<!-- To import this value in PHP, script does it like this:
$date = new \DateTime($timestamp, new \DateTimeZone('Etc/UTC'))); -->
<timestamp>2013-10-24T20:33:53Z</timestamp>
<!-- contributor XML node requires both username and id pair. The values must match in data/users.json -->
<contributor>
<username>Jdoe</username>
<!-- id must be an integer. This workbench will typecast this node into an integer. -->
<id>11</id>
</contributor>
<!-- comment can be any string you want. The commit message will strip off space, HTML code, and and new lines -->
<comment>Some optionnal edit comment</comment>
<!-- The page content at that revision. Format isn’t important -->
<text xml:space="preserve">Some '''text''' to import</text>
</revision>
<!-- more revision goes here -->
</page>
<!-- more page nodes goes here -->
</foo>
The origin of the data isn’t important but you have to make sure that it matches with values in XML schema:
- "
user_id
" ===//foo/page/revision/contributor/id
. Note that the value is a string but the classWebPlatform\ContentConverter\Model\MediaWikiContributor
will typecast into an integer - "
user_name
" ===//foo/page/revision/contributor/username
.
As for the email address, it isn’t required because we’ll create a git committer ID concatenating the value of "user_name
" AND the value you would set in lib/mediawiki.php
at the COMMITER_ANONYMOUS_DOMAIN
constant (e.g. COMMITER_ANONYMOUS_DOMAIN
is set to "docs.webplatform.org", commit author and commiter will be Jdoe@docs.webplatform.org
).
[
{
"user_email": "jdoe@example.org"
,"user_id": "11"
,"user_name": "Jdoe"
,"user_real_name": "John H. Doe"
,"user_email_authenticated": null
}
]