Deployed independent from the Semantic MediaWiki project. The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution.
This library includes:
Transliterator
to help convert diacritics or greek letters into a romanized versionStopwordAnalyzer
to manage a list of registered stopwordsTokenizer
to split a text by common punctuation
PHP 5.3 / HHVM 3.5 or later
The recommended installation method for this library is by adding the following dependency to your composer.json.
{
"require": {
"onoi/tesa": "~0.1"
}
}
$sanitizer = new Sanitizer( $string );
$sanitizer->reduceLengthTo( '200' );
$sanitizer->toLowercase();
$sanitizer->replace(
array( "'", "http://", "https://", "mailto:", "tel:" ),
array( '' )
);
$sanitizer->applyTransliteration(
Transliterator::DIACRITICS | Transliterator::GREEK
);
use Onoi\Cache\CacheFactory;
$cacheFactory = new CacheFactory();
$cache = $cacheFactory->newMediaWikiCache( wfGetCache( 'redis' ) );
$stopwordAnalyzer = new StopwordAnalyzer( $cache );
$stopwordAnalyzer->loadListBy( StopwordAnalyzer::DEFAULT_STOPWORDLIST );
$stopwordAnalyzer = new StopwordAnalyzer( $cache );
$sanitizer = new Sanitizer( $string );
$sanitizer->toLowercase();
$string = $sanitizer->sanitizeBy(
$stopwordAnalyzer
);
- It is recommended that the
StopwordAnalyzer
is invoked using a responsive cache provider (such as APC or redis) to minimize any latency when the stopword list is loaded.
- The
Transliterator
used diacritics conversion table has been copied from http://jsperf.com/latinize. - The stopwords used by the
StopwordAnalyzer
have been collected from different sources where eachjson
file identifies its origin.
If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here.
The library provides unit tests that covers the core-functionality normally run by the
continues integration platform. Tests can also be executed manually using the
composer phpunit
command from the root directory.
- 0.1.0 Initial release (2015-11-??)