Skip to content

sajib-hassan/solr-schema-dataimport-generator

 
 

Repository files navigation

#Solr Config Generator These php scripts generate 3 config files

  • generateSchema.php => schema.xml see
  • generateDataimport.php => dataimport-config.xml see
  • generatePHPAPIConfig.php => apiconfig.json (custom format)

based on 1 central config.json for use in Solr Core & API


Table of Contents


##config.json

  1. Copy the config/config.default.json to config/config.json or create new json

The config.json consits of several required properties:

  • name : The name of the schema (not very important, but should be equivilent to the solr core name)
  • solrSchemaVersion : This should be equivilent to the current used schema doc, currently 1.5
  • fields : fields in core, detailed config
  • fieldTypes : field types in core, detailed config
  • dataSources : datasources for data-import, detailed config
  • entityQueries : entity definition & SQL query generator syntax, detailed config
  • searchOptions : search options per search action type, detailed config
  • dependencyVariables : dependacy field, for generating & queries, detailed config
{
	name: "outlets",
	solrSchemaVersion: "1.5",
	dependencyVariables: {...},
	dataSources: {...},
	entityQueries: {...},
	searchOptions: {...},
	fields: {...},
	fieldTypes: {...}
}

####fields fields consists of a list of fieldnames with a config object encapsulated. All parts of the config can be templated for [dependancyVariables]

Properties of a specific set of properties

required:

  • dataSourceEntity : The name of the entity this field belongs to. see
  • dataSourceStatement : The SELECT part of the SQL-query for this field. Can be a complext statement or just a fieldname (don't forget the table prefix for ambiguity)
  • types : An array of fieldTypes for this field.
    • The fieldname is postfixed ([fieldname]_[key]) for each field type. The key is arbitrary, but the value should match an actual fieldType.
    • Special case is the key '_' which represent no postfix ([fieldname]) and should always be present
    • The values to the other types are copied with a copyField from the _ (source field)
  • searchOptions : Config for searching on this field (for apiconfig) see

optional:

*** <= config.json***

fieldname_[lang] : {
	_comment: "Some demo field",
	uniqueKey: false,
	types: {
		_: "string",
		suggest: "text_[lang]_splitting",
		edge: "autocomplete_edge",
		ngram: "autocomplete_ngram",
		reverse: "text_general_reversed"
	},
	__dependency: {
		lang: "language_codes"
	},
	indexed: true,
	stored: true,
	required: true,
	multiValued: true,
    dataSourceStatement: "GROUP_CONCAT(DISTINCT kt_type.waarde_[lang_map] SEPARATOR '|')",
    dataSourceMultivaluedSeperator: "|",
	dataSourceStatementDependencyMapping: {
		lang: {
			en: "english",
			nl: "dutch",
			de: "deutsch",
		}
	},
	dataSourceEntity: "mainentity",
	searchOptions: {...}
}	
...
dependencyVariables: {
	language_codes: [
		"en",
		"nl",
		"de"
		]
	}

*** => schema.xml***

...
<! -- fieldname_en:  Some demo field -->
  <field name="fieldname_en" type="string" indexed="true" stored="true" required="true" multiValued="true"/>
  <field name="fieldname_en_suggest" type="text_en_splitting" indexed="true" stored="false" required="false" multiValued="true"/>
  <copyField source="fieldname_en" dest="fieldname_en_suggest"/>
  <field name="fieldname_en_edge" type="autocomplete_edge" indexed="true" stored="false" required="false" multiValued="true"/>
  <copyField source="fieldname_en" dest="fieldname_en_edge"/>
  <field name="fieldname_en_ngram" type="autocomplete_ngram" indexed="true" stored="false" required="false" multiValued="true"/>
  <copyField source="fieldname_en" dest="fieldname_en_ngram"/>
  <field name="fieldname_en_reverse" type="text_general_reversed" indexed="true" stored="false" required="false" multiValued="true"/>
  <copyField source="fieldname_en" dest="fieldname_en_reverse"/>
  ...

*** => dataimport-config.xml***

...
SELECT ... GROUP_CONCAT(DISTINCT kt_type.waarde_english SEPARATOR '|') as fieldname_en, ...
...

 <field column="fieldname_en" sourceColName="fieldname_en" splitBy="|"/>
 <field column="fieldname_nl" sourceColName="fieldname_nl" splitBy="|"/>
 <field column="fieldname_de" sourceColName="fieldname_de" splitBy="|"/>

#####[field] searchOptions Field's search options are used for apiconfig.json for querying fields

  • search : true/false whether this field should be searched on (has to be indexed: true) qf
  • return : true/false whether this field should be returned (has to be stored: true) fl
  • facet : whether the sourcefield (not the copied fields), should be a facet facet
  • actions : list of search actions, in which this field should be considered. Has to correspond with keys in global searchOptions
  • boost : base boostValue for this field, can/will be modified by each fieldType & individual phraseQueries field^boost
  • fuzzy : Levensthein Distance to search for in the term [term~fuzzy](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Fuzzy Searches)
  • phraseField : true/false/int whether it should be used as phraseField and if true: factor = 1, if int: the multiplier is used on boost pf
  • phraseFieldBiGram : same as phraseField pf2
  • additional : array of custom addition actions. For now only stats.field & stats.facet are supported

*** <= config.json***

searchOptions: {
	search: true,
	return: true,
	actions: [
		"search",
		"autocomplete",
		"location"
	],
	facet: false,
	boost: 100,
	fuzzy: 4,
	phraseField: 4,
	phraseFieldBiGram: 3,
	phraseFieldTriGram: 2,
	additional: ["stats.field"]
}

*** => api-config.json***

fields{
...
,
      "fieldname_en": {
        "field": "fieldname_en",
        "fuzzy": 4,
        "main" : true
      },
      "fieldname_en_suggest": {
        "field": "fieldname_en_suggest",
        "fuzzy": false,
        "main" : false
      },
      "fieldname_en_edge": {
        "field": "fieldname_en_edge",
        "fuzzy": false,
        "main" : false
      },
      "fieldname_en_ngram": {
        "field": "fieldname_en_ngram",
        "fuzzy": false,
        "main" : false
      },
      "fieldname_en_reverse": {
        "field": "fieldname_en_reverse",
        "fuzzy": 2,
        "main" : false
      },    
,
queryFields: {             
...
	 "fieldname_en": {
        "field": "fieldname_en",
        "boost": 10,
        "statement": "fieldname_en^10"
      },
      "fieldname_en_suggest": {
        "field": "fieldname_en_suggest",
        "boost": 8,
        "statement": "fieldname_en_suggest^8"
      },
      ...
      

####fieldTypes FieldTypes are generated from the json into xml for the schema.xml as is. Structure, fields and values are preserved. Exception to this rule are all fieldType properties prefixed with __

  • __dependency : An optional key value array, that is used to generate multiple fieldTypes based on a dependency.
    • All field properties and names are matched for the string '[key]' and linked to the name of the mapping declared in dependencyVariables
    • e.g. a fieldType, named text_[lang]_splitting is mapped by a config {"lang"=>"language_codes"} to the array defined in dependencyVariables with the key "language_codes", generating for each defined "language_code" a fieldType
  • __searchBoostFactor : float, A factor applied to the boost property of a field's searchOptions
  • __searchBoostValue : float, A value replacing he boost property of a field's searchOptions (if not false)
  • __searchFuzzyFactor : float, A factor applied to the fuzzy property of a field's searchOptions
  • __searchFuzzyValue : float, A value replacing he fuzzy property of a field's searchOptions (if not false)
  • __searchPhraseField : true/false, whether a field of this type can be considered as phrasefield

*** <= config.json***

...
boolean: {
	class: "solr.BoolField",
	sortMissingLast: true,
	__searchPhraseField: false
},
...
text_[lang]_splitting: {
	class: "solr.TextField",
	positionIncrementGap: 100,
	autoGeneratePhraseQueries: true,
	__dependency: {
		lang: "language_codes"
	},
	analyzer: {
		index: {
			charFilter: [{
				class: "solr.MappingCharFilterFactory",
				mapping: "mapping-ISOLatin1Accent.txt"
			}],
			tokenizer: [{
				class: "solr.WhitespaceTokenizerFactory"
			}],
			filter: [{
				class: "solr.StopFilterFactory",
				ignoreCase: true,
				words: "lang/stopwords_[lang].txt"
			},{
				class: "solr.WordDelimiterFilterFactory",
				generateWordParts: 1,
				generateNumberParts: 1,
				catenateWords: 1,
				catenateNumbers: 1,
				catenateAll: 0,
				splitOnCaseChange: 1
			},{
				class: "solr.LowerCaseFilterFactory"
			},{
				class: "solr.PorterStemFilterFactory"
			}]},
		query: {
			charFilter: [{
				class: "solr.MappingCharFilterFactory",
				mapping: "mapping-ISOLatin1Accent.txt"
			}],
			tokenizer: [{
				class: "solr.WhitespaceTokenizerFactory"
			}],
			filter: [{
				class: "solr.SynonymFilterFactory",
				synonyms: "synonyms.txt",
				ignoreCase: true,
				expand: true
			},{
				class: "solr.StopFilterFactory",
				ignoreCase: true,
				words: "lang/stopwords_[lang].txt"
			},{
				class: "solr.WordDelimiterFilterFactory",
				generateWordParts: 1,
				generateNumberParts: 1,
				catenateWords: 0,
				catenateNumbers: 0,
				catenateAll: 0,
				splitOnCaseChange: 1
			},{
				class: "solr.LowerCaseFilterFactory"
			},{
				class: "solr.PorterStemFilterFactory"
			}]
		}
	},
	__searchBoostFactor: 0.8,
	__searchFuzzyValue: false
},
...

*** => schema.xml***

<! -- boolean -->
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>

<! -- text_en_splitting -->
  <fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
    <analyzer type="index">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>
  

####dataSources This section covers the configuration of datasources. Of course solrconfig.xml has to be configured for use of these datasources As for now only Jdbc data sources have been tested, but in theory it should support different kinds as well.

Each (jdbc) data source is identified by a key (for reference in entityQueries ) and sees all properies as attributes for in dataimport-config.xml.

e.g. for JDBC/MySQL (only working test case)

*** <= config.json***

db1: {
	type: "JdbcDataSource",
	driver: "com.mysql.jdbc.Driver",
	url: "jdbc:mysql://[host]/[db]",
	user: "[user]",
	password: "[password]",
	batchSize: -1
},

*** => dataimport-config.xml***

<dataSource name="db1" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://host.com/db" user="uname" password="pwd" batchSize="-1"/>

####entityQueries

Solr supports many entities in dataimport-config.xml that save to 1 record in the core It is adviced to do as much using JOINS in your query, since performance is overall better, but exceptions can be made.

entityQueries consists of key value pairs, where the key is the name/id of the entity and the value consists of it's configuration.

EntityQuery configs have these properties:

  • dataSource : the key of the dataSource defined in dataSources. One dataSource can have multiple entities, but an entity only has 1 dataSource
  • transformers : List of used Solr transformers
  • tables : List of custom sets of from / join strings. The reason is that in all the queries a lot of the same table (groups) are queries, this way you can combine them any way you like, by refencing the tablegroup label instead of repeating FROM's and JOINS over and over
    • So the tables property has an lis of key-value-pairs, where the key is used as identifier and the value is a list of strings, containing FROM & JOINS
  • queries : The entity has some (optional) specific queries that can be used accoring to the Solr Documentation. These are the (optional) properties of the queries object. Each defining which tables, filters and optionally fields should be queried. Like: query, deltaQuery, parentDeltaQuery, deletedPkQuery, deltaImportQuery
    • filter : contains a string with the WHERE clause (optionaly also grouping/having) for this specific query. Use placeholders where nessecary
    • tables: contains a list of the id's of the applicable tablegroups use
    • fields: (optional). Ordinary selects all fields applicable for this entity (dataSourceEntity property in fields). Defining this property with a list of the fieldNames, it only queries those.
  • parentEntity : Optionally this propertie enables entity nesting see examples
  • pk: If a parentEntity is defined, this forces the PrimaryKey

*** <= config.json***

entityQueries: {
	mainentity: {
		dataSource: "db1",
		transformers: [
			"RegexTransformer"
			],
		tables: {
			main: [
				"FROM `maintable` t1"
			],
			all: [
				"LEFT JOIN `subtable` subt_land ON t1.`land`=subt_land.`code`",
				"LEFT JOIN `subtable` subt_provincie ON t1.`provincie`=subt_provincie.`code`",
				"LEFT JOIN `subtable` subt_moeder ON t1.`scode`=subt_moeder.`code`",
				"LEFT JOIN `subtable` subt_scode ON t1.`scode`=subt_scode.`code`",
				"LEFT JOIN `subtable` subt_bcode ON t1.`bcode`=subt_bcode.`code`"
			]
		},
		queries: {
			query: {
				filter: "WHERE t1.`status` > 0 GROUP BY t1.`klantnr`",
				tables: [
					"main",
					"all"
				]
			},
			deltaImportQuery: {
				filter: "WHERE t1.`klantnr`='${dih.delta.id}' GROUP BY t1.`klantnr`",
				tables: [
					"main",
					"all"
				]
			},
			deltaQuery: {
				filter: "WHERE t1.`updatedon` > '${dih.last_index_time}'",
				tables: [
					"main"
				],
				fields: [
					"id"
				]
			}
		}
	},
	subentity: {
		dataSource: "db2",
		parentEntity: "mainentity",
		pk: "klantnr",
		tables: {
			main: [
				"FROM `t1_surrounding` t1s"
			]
		},
		queries: {
			query: {
				filter: "WHERE t1s.`klantnr`= '${mainentity.id}'",
				tables: [
					"main"	
				]
			}
		}
	}
	...

*** => dataimport-config.xml***

  <document>
    <entity name="mainentity" dataSource="db1" transformer="RegexTransformer" pk="id" query="SELECT 	
    	t1.`klantnr` as `id`,
    	t1.`status` as `status`,
    	...
    	FROM `maintable` t1 
    	LEFT JOIN `subtable` subt_land ON t1.`land`=subt_land.`code`
    	...
		WHERE t1.`status` &gt; 0 GROUP BY t1.`klantnr`;" 
	deltaImportQuery="SELECT 
		t1.`klantnr` as `id`,
    	t1.`status` as `status`,
    	...
    	FROM `maintable` t1 
    	LEFT JOIN `subtable` subt_land ON t1.`land`=subt_land.`code`
    	...
		WHERE t1.`klantnr`='${dih.delta.id}' GROUP BY t1.`klantnr`;
	" deltaQuery="SELECT 
		t1.`klantnr` as `id` 
		FROM `maintable` t1 
		WHERE t1.`updatedon` &gt; '${dih.last_index_time}';">
		<field ... />
		<field ... />
		...
		<entity name="subentity" dataSource="db2" transformer="RegexTransformer" pk="klantnr" query="SELECT...
	
												

####searchOptions This section defines global search options and the possibel search actions to be perfomed by the API.

Each property of searchOptions is a key-value-pair, where the key is the identifier for the search action to be performed/handled by the api.

Every action has these attributes

*** <= config.json***

searchOptions: {
	search: {
		type: "EDisMax",
		facets: true,
		options: {
			lowercaseOperators: true,
			stopwords: true,
			indent: true,
			stats: true
		},
		minimumMatch: "2<-1 5<80%",
		queryPhraseSlop: 2,
		phraseSlop: 2,
		phraseBiGramSlop: 2,
		phraseTriGramSlop: 2
},

*** => apiconfig.json***

 "search": {
    "type": "EDisMax",
    "facets": true,
    "options": {
      "lowercaseOperators": true,
      "stopwords": true,
      "indent": true,
      "stats": true,
      "stats.field": [
        "latitude",
        "longitude"
      ]
    },
    "minimumMatch": "2<-1 5<80%",
    "queryPhraseSlop": 2,
    "phraseSlop": 2,
    "phraseBiGramSlop": 2,
    "phraseTriGramSlop": 2,
    ...

####dependencyVariables The dependency variables are a construct to create many similar fields without copy/pasting The idea is that a single field can duplicates be used to store for example different translations.

Every dependency variable is crossed with each other to create unique combinations so 3 dependencies with resp. 5, 8 and 3 iterators will generate 120 base field (multplied by number of subtypes) on fields dependant on all 3 dependencies.

In theory one field can reuse the same dependency, to create cross products of a single dependency

dependencyVariables: {
	language_codes: [
		"en",
		"nl",
		"de",
		"fr",
		"es",
		"it",
		"pl"
	],
	"colors": {
		"red",
		"blue",
		"green"
	}
},

About

Generates schema, dataimport & conf for single point maintenance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • PHP 100.0%