GitHub - hmmbug/unbindery: PHP/Javascript web app for transcription (images/audio).

Unbindery

A web app for crowdsourcing transcription, written in PHP and JavaScript. Licensed under the MIT license.

Dependencies

Twig (1.9.2 included)
sfYaml (included)
uploadify (2.1.4 included)
MediaElement.js (2.10.0 included)
ffmpeg (for audio transcription)

Installation

Clone this repository or unpack the files.
Create a database and user in MySQL.
Copy config.sample.yaml to config.yaml and edit it.
If you want to transcribe audio, install ffmpeg and set the path to it in config.yaml.
Create the directory htdocs/media and give your web server (Apache, nginx, etc.) write rights to it.
Set your web server to point to htdocs for the site's document root.
In your php.ini, set upload_max_filesize to something big enough (128M, etc.).
In your php.ini, set post_max_filesize to something big enough (128M, etc.).
In your php.ini, set max_file_uploads to something big enough (200, etc.).
Go to /install in your browser and create an admin account.

Getting started

Project types and user roles

After installation, you'll end up on the dashboard. You can create a new project from here, or later from the Admin page. There are two project types:

System: These go under /projects/[slug]. Generally intended for where the installation is for a single project (like the Mormon Texts Project).
User: These go under /users/[username]/projects/[slug]. Mostly intended for small private projects. (But they don't have to be private or small.)

And there are three system roles for users:

User: proofers and reviewers
Creator: can create user projects
Admin: site admin, can create system projects and user projects

Creating a new project

Choose the project type you want, then click the button. You can now fill out the project title, visibility (public or private), and an optional description and language. Public projects are visible to all users on the Projects page. Private projects are invisible to all except for the people who are members of the project.

Click Create Project. You'll be taken to the project admin page, where you can edit the project in detail.

Project Members: This is for adding users to private projects, adding reviewers and admins, and removing people from projects. Type in the username, choose the role, and click Add. To remove a user, click the X to the right of their name.
Custom Item Fields: If you want to add extra fields for users to fill out while transcribing, this is the place. See the examples.
Status: Can be Pending, Active, or Completed. Only active projects show up on the projects page.
Workflow: For now, leave as @proofer, @proofer, @reviewer.
Characters: Space-separated list of characters to be put in the character pad (on the proofing page).
Download Template: The template for each item when downloading a final transcript. Variables can be included using double curly brackets. Available variables:
- {{ transcript }} -- the transcript itself (if there are reviews, this collates each item's reviews; if there are only proofs, this collates each item's proofs)
- {{ item.title }} -- the item title
- {{ item.id }} -- the item ID #
- {{ item.type }} -- the item type (page, audio, etc.)
- {{ item.status }} -- the item status
- {{ item.href }} -- the item href (URL for filename)
- {{ project.title }} -- the project title
- {{ project.public }} -- the project's visibility (public or private)
- {{ project.slug }} -- the project slug
- {{ project.language }} -- the project language
- {{ project.description }} -- the project description
- {{ project.owner }} -- the project owner (a username)
- {{ project.status }} -- the project status
- {{ project.guidelines }} -- the project guidelines
- {{ proofers }} -- a comma-separated list of the users who proofed this item
- {{ reviewers }} -- a comma-separated list of the users who reviewed this item
- {{ fields.field\_slug }} -- the fields filled out by the user (the field slug is generated by taking the field name, lowercasing it, and replacing spaces with underscores -- for example, "Page Number" becomes "page_number")

Adding items to a project

On the project admin page, click SELECT FILES and upload the files you'd like to add to the project.

Page: JPEG/PNG/GIF images
Audio: MP3 files

For images, one item will be added per file. For audio, each MP3 will be split into segments (the length of the segment is set in the config file), and one item will be added per segment.

Importing pre-existing transcripts

If you're doing OCR correction, the way to get the OCRed text into the project is by importing the transcript. Click the Import Transcript button.

On the right there's a list of the items that text will be imported for. If you want to get rid of any (because of blank pages or whatnot), click the X. This will only remove the items from the import procedure and won't delete them.

Paste your pre-existing text into the Transcript box. Then edit the template so it matches the transcript. The format for the template is a regex that matches the named group text. For example, take the following pre-existing transcript:

<page>The text for page 1.</page>
<page id="2">The text for page 2.</page>
<page id="3">The text for page 3.</page>

This template matches the above text:

<page.*?>(?P<text>.*?)<\/page>

There is a live preview so you can make sure things match up. The number of items generated in the preview needs to match the number of items in the right sidebar.

Once everything's ready, click Import Transcript.

Proofing

Add yourself as a proofer (in the Project Members area), then click Save and Activate. (This is the same as changing the project status to Active and clicking Save Project.) Now click on the Dashboard link.

You should see your project in the Proofing Queue area. Click Get new item. This assigns the next available item to you and sends you to the proofing page.

Fill out the transcript. For page images, you can click on the image to set a highlight bar that helps you keep track of where you are.

Buttons on the proofing page:

Project Guidelines: show guidelines for the project (if any have been set up)
Characters: toggle a character pad where you can easily add characters that are hard to type
Save draft: save the current transcript as a draft and return to the dashboard
Finish: finish the current transcript and return to the dashboard
Finish & Continue: finish the current transcript and proof the next item

Reviewing

When the requisite number of proofs for an item have been finished, the item will move to the review queue. Reviewing is similar to proofing, except that a review will concatenate existing proofs and populate the transcript box with the difference between them.

Downloading the final transcript

After proofing and reviewing is done, return to the project admin page and click Download Transcript in the upper left to download the final transcript, using the download template as described above.

Customization

Each entry in config.yaml, explained in more detail.

db: The database engine. Right now MySQL is the only option. To add a new database adapter, create a new directory in modules/db with the name of the adapter, then create a DbYourAdapterName.class.php file in that directory. And change this db entry to point to it. The adapter will need to expose all the functions found in DbMySQL.class.php.
auth: The authentication engine. Right now Alibaba is the only option. To add a new authentication adapter, create a new file in modules/auth with a AuthYourAdapterName.class.php file.
database: database settings. Generally, you'll want to leave host as 127.0.0.1 (localhost), but you'll need to set database to the name of the database you've created, username to the database user's username, and password to the database user's password.
title: The page title for this installation of Unbindery. Shows up in the upper left of all pages.
app_url: Ignore the &SITEROOT at the beginning -- it's used so you don't have to define this again later on. Change http://path/to/unbindery to the URL you're hosting Unbindery at.
sys_path: The filesystem path to Unbindery.
admin_email: Email address for the site administrator.
language: Default is en. If you change this, make sure you have a corresponding file in the translations/ directory.
email_subject: A string that is prepended to any notification emails. Can be blank.
theme_cached: If you want the Twig templates cached, set this to true.
theme: The system theme. Default is core.
external_login: If you change the auth adapter to something else (like Google account), set this to true.
allow_signup: Whether to show the signup link on the home page. Only applies if external_login is false.
download_template: The default download template for new projects.
alibaba: Configuration options for Alibaba. You shouldn't need to change anything here.
system_guidelines: guidelines to show up on every project page before project-specific guidelines. HTML.
private_key: Used for web service authentication. Not currently used.
devkeys: Used for web service authentication. Not currently used.
google_analytics: If you want to hook your installation of Unbindery up to Google Analytics, put your UA code here.
scoring: How many points users get for proofing and reviewing.
editors: A list of editor types (page and audio are the defaults), with css or js arrays for CSS/JS files to be included (paths are relative to htdocs/themes/[theme]/[css|js]/editors/[editor-type]).
uploaders: A list of item uploader module types. The corresponding PHP files are found in modules/uploaders. For example, the code for the Page uploader type is in modules/uploaders/PageUploader.class.php. Each uploader type has an extensions field which is an array of extensions this item type uploader handles. Additional options may also be set (as seen in the Audio uploader options).
notifications: A list of notifications, with an array of targets for each. Targets can be @user, @projectadmin, or @admin. If you add new notifications, make sure you add them to the translations file as well.

Adding new editor types

To add a new editor type (for example, one with XML tagging support), you need to do the following:

Add the editor type to config.yaml (see above).
Add the HTML for the editor at templates/core/editors/[editor-name].html
Add the CSS for the editor at htdocs/themes/core/css/editors/[editor-name]/[editor-name].css
Add the JavaScript for the editor at htdocs/themes/core/js/editors/[editor-name]/[editor-name].js

You can look at the existing editors to see how things are set up.

Adding new uploader types

To add a new item uploader type (for example, one that takes a PDF and splits it into individual pages and converts each page to a JPEG), you need to do the following:

Add the uploader type to config.yaml (see above).
Add the PHP for the uploaer at modules/uploaders/[UploaderName]Uploader.class.php

An uploader extends the ItemTypeUploader class and defines the preprocess function, which takes a list of filenames ($filenames) and processes them. The uploader needs to set the following properties:

$this->files: the final array of files to be moved into the media directory. For example, the PDF splitter would create a new entry in $this->files for each page of the original PDF.
$this->itemData: an array of $item (itself an array of title, project_id, transcript, type, and href), used to add the items to the database. Needs to have the same number of items as $this->files.

See the audio uploader type for an example.

Acknowledgments

Thanks to Ryan Martinsen for his fork of Raymond Hill's FineDiff.

Name		Name	Last commit message	Last commit date
Latest commit History 508 Commits
controllers		controllers
docs		docs
helpers		helpers
htdocs		htdocs
lib		lib
model		model
modules		modules
templates/core		templates/core
translations		translations
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
MIT-LICENSE.txt		MIT-LICENSE.txt
README.md		README.md
config.sample.yaml		config.sample.yaml

License

hmmbug/unbindery

Folders and files

Latest commit

History

Repository files navigation