Cross-Language Dataset

Description

This dataset is a multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. More precisely, the characteristics of this dataset are the following:

it is multilingual: French, English and Spanish;
it proposes cross-language alignment information at different granularities: document-level, sentence-level and chunk-level;
it is based on both parallel and comparable corpora;
it contains both human and machine translated text;
part of it has been altered (to make the cross-language similarity detection more complicated) while the rest remains without noise;
documents were written by multiple types of authors: from average to professionals.

Characteristics

Sub-corpus	Alignment	Authors	Translations	Translators	Alteration
JRC Acquis²	Parallel	Politicians	Human	Professional	No
Europarl¹	Parallel	Politicians	Human	Professional	No
Wikipedia²	Comparable	Anyone	-	-	Noise
PAN-PC-11³	Parallel	Professional authors	Human	Professional	Yes
APR (Amazon Product Reviews⁴)	Parallel	Anyone	Machine	Google Translate	No
Conference papers	Comparable	Computer scientists	Human	Computer scientists	Noise

Statistics

Sub-corpus	# Aligned documents	# Aligned sentences	# Aligned noun chunks
JRC-Acquis²	10,000	149,506	10,094
Europarl¹	9,431	475,834	25,603
Wikipedia²	10,000	4,792	132
PAN-PC-11³	2,920	88,977	1,360
APR (Amazon Product Reviews⁴)	6,000	23,235	2,603
Conference papers	35	1,304	272

For more statistics, see the STATS/ directory.

Repository description

In the Aligned_Documents_Sub_Corpus/ directory, you can find the dataset of parallel and comparable files aligned at document-level (one file represents one document).
In the Aligned_Sentences_Sub_Corpus/ directory, you can find the dataset of parallel and comparable files aligned at sentence-level (one line of a file represents one sentence).
In the Aligned_Chunks_Sub_Corpus/ directory, you can find the dataset of parallel and comparable files aligned at chunk-level (one line of a file represents one noun chunk).
In the STATS/ directory, you can find XLSX and HTML files with statistics on the dataset.
In the tools/ directory, you can find all the useful files to re-build the dataset from the pre-existing corpora.
In the Aligned_Documents_Sub_Corpus/Conference_papers/ directory, you can also find a pdf_conference_papers/ directory containing the original scientific papers in PDF format.
In the *_Sub_Corpus/PAN11/ sub-directories, you can also find a metadata/ directory containing additional information about the PAN-PC-11 alignments.

Tools directory

This directory contains tools that we used for corpus building. We also provide them in case somebody would be interested to extend the corpus.

In the tools/chunking/ directory, you can find a script to extract noun chunks from a POS sequence from TreeTagger⁵.
In the tools/create_translations_dico/ directory, you can find a script to build an unigram translation dictionary for the use of HunAlign⁶.
In the tools/create_verif_align/ directory, you can find a script to print and save the alignments in a readable format.
In the tools/enrich_dico_with_dbnary/ directory, you can find a script to enrich an unigram translation dictionary with DBNary⁷ entries.
In the tools/parse_APR_collection/ directory, you can find a script to parse the Webis-CLS-10⁴ corpus and extract the English-French pairs.
In the tools/parse_PAN_collection/ directory, you can find a script to parse the PAN-PC-11³ corpus and extract the English-Spanish pairs with metadata.
In the tools/parse_conf_papers_bibtex/ directory, you can find a script to parse the TALN BibTeX⁸, crawl the web and thus allow the construction of French-English conference paper pairs.

To manage the encoding of the files, we use the ForceUTF8⁹ class coded by Sebastián Grignoli.
To detect the language of a text, we use the PHP implementation¹⁰ by Nicholas Pisarro of the Cavnar and Trenkle (1994)¹¹ classification algorithm.
To query DBNary⁷, we use PHP class-interfaces¹².

If you have additional questions, please send it to me by email at jeremy.ferrero@imag.fr.

References, tools used and pre-existing collections

Europarl
Philipp Koehn (2005).
Europarl: A Parallel Corpus for Statistical Machine Translation.
In Conference Proceedings: the tenth Machine Translation Summit, pages 79–86. AAMT.
url: http://opus.lingfil.uu.se/Europarl.php
CL-PL-09 (JRC-Acquis + Wikipedia)
Martin Potthast, Alberto Barrón-Cedeño, Benno Stein, and Paolo Rosso (2011).
Cross-Language Plagiarism Detection.
In Language Ressources and Evaluation, volume 45, pages 45–62.
url: http://users.dsic.upv.es/grupos/nle/downloads.html
PAN-PC-11
Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso (2010).
An Evaluation Framework for Plagiarism Detection.
In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, August 2010. Association for Computational Linguistics.
url: http://www.uni-weimar.de/en/media/chairs/webis/corpora/pan-pc-11/
Webis-CLS-10 (Amazon Product Reviews)
Peter Prettenhofer and Benno Stein (2010).
Cross-language text classification using structural correspondence learning.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1118-1127.
url: http://www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-webis-cls-10/
TreeTagger
Helmut Schmid (1994).
Probabilistic Part-of-Speech Tagging Using Decision Trees.
In Proceedings of the International Conference on New Methods in Language Processing.
url: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
HunAlign
Dániel Varga, Péter Hálacsy, Viktor Nagy, Lázló Németh, András Kornai, and Viktor Trón (2005).
Parallel corpora for medium density languages.
In Recent Advances in Natural Language Processing (RANLP 2005), pages 590–596.
url: http://mokk.bme.hu/en/resources/hunalign/
licence: GNU LGPL version 2.1 or later
DBNary
Gilles Sérasset (2014).
DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.
In to appear in Semantic Web Journal (special issue on Multilin- gual Linked Open Data).
url: http://kaiko.getalp.org/about-dbnary/
licence: Creative Commons Attribution-ShareAlike 3.0
TALN Archives
Florian Boudin (2013).
TALN Archives : a digital archive of French research articles in Natural Language Processing (TALN Archives : une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue) [in French]).
In Proceedings of TALN 2013 (Volume 2: Short Papers), pages 507–514.
url: https://github.com/boudinfl/taln-archives
licence: Creative Commons Attribution-NonCommercial 3.0
ForceUTF8
url: https://github.com/neitanod/forceutf8
licence: BSD
Text Language Detect
url: https://github.com/webmil/text-language-detect
licence: BSD
William B. Cavnar and John M. Trenkle (1994).
N-Gram-Based Text Categorization.
In Proceedings of SDAIR- 94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161–175.
DBNary PHP Interface
url: https://github.com/FerreroJeremy/DBNary-PHP-Interface
licence: Creative Commons Attribution-ShareAlike 4.0 International

Licence

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

For more details on licenses of every tools used and existing collections, refer to LICENCE.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aligned_Chunks_Sub_Corpus

Aligned_Chunks_Sub_Corpus

Aligned_Documents_Sub_Corpus

Aligned_Documents_Sub_Corpus

Aligned_Sentences_Sub_Corpus

Aligned_Sentences_Sub_Corpus

STATS

STATS

tools

tools

LICENCE.md

LICENCE.md

README.md

README.md

Repository files navigation

Cross-Language Dataset

Description

Characteristics

Statistics

Repository description

Tools directory

References, tools used and pre-existing collections

Licence

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Aligned_Chunks_Sub_Corpus		Aligned_Chunks_Sub_Corpus
Aligned_Documents_Sub_Corpus		Aligned_Documents_Sub_Corpus
Aligned_Sentences_Sub_Corpus		Aligned_Sentences_Sub_Corpus
STATS		STATS
tools		tools
LICENCE.md		LICENCE.md
README.md		README.md

License

leloulight/Cross-Language-Dataset

Folders and files

Latest commit

History

Repository files navigation

Cross-Language Dataset

Description

Characteristics

Statistics

Repository description

Tools directory

References, tools used and pre-existing collections

Licence

About

Resources

License

Stars

Watchers

Forks