LSCluster documentation

 

This software is intended to be used for large-scale sequence analysis, but can also be used for any text-based search. It uses pattern matching, searching from left to right or vice versa of the sequence(s). The software does not use heuristics or substitution tables (use Blast or Clustal for that).

 

Quickstart

 

Example 1: Alignment of a set of sequences to the first sequence in a list

 

  1. Assemble all query sequences (for example in a spreadsheet (column 1)), you can add identifiers such as gene names or accession numbers in column 2, but this is not necessary

 

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK        Homo sapiens

MADQLTEEQIAEFKEAGSLFDKDGDGTITTKELGTVMRSVGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEILEAFQGFDKDGNGFISAAELRHMMTNLGEKLTDEEVDEMIREADIDGDGQINYEEFVKMMMSK      Achlya klebsiana

MADQLSEEQISEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPSQAELEDMINEVDADGNGTIDFPEFLTMMARKMRDTDSEEEIKEAFKVFDKDGNGYISAAELRHVMTNLGEKLTDSEVDEMIREADVDGDGQINYEEFVKMMLSK        Agaricus bisporus

MADQLSEEQISEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMRDTDSEEEIKEAFKVFDKDGNGYISAAELRHVMTNLGEKLTDNEVDEMIREADVDGDGQINYEEFVKMMLSK        Pleurotus ostreatus

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK        Pongo abelii

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLSLMARKMKDTDTEEELIEAFKVFDRDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADVDGDGQINYEEFVKMMMAK       Prorocentrum minimum

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEILEAFKVFDKDGNGFISAAELRHIMTNLGEKLTDEEVDEMIREADIDGDGQINYEEFVKMMMSK          Pythium splendens

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGDGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVTMMTSK        Pyuridae sp.

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK        Xenopus laevis

 

  1. Paste this into the edit window #1 (make sure you have selected the ‘Edit’ button)
  2. Select Output: Show alignment results and show graphical alignment results
  3. Select Process: Align to first in cluster
  4. Press Run
  5. The output window will contain the graphical alignment results…

 

Origin Xen     MADQLTEEQI AEFKEAFSLF DKDGDGTITT KELGTVMRSL GQNPTEAELQ  50

Homo sapie     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Achlya kle     |||||||||| ||||||G||| |||||||||| |||||||||V ||||||||||  50

Agaricus b     |||||S|||| S||||||||| |||||||||| |||||||||| ||||SQ|||E  50

Pleurotus      |||||S|||| S||||||||| |||||||||| |||||||||| ||||||||||  50

Pongo abel     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Prorocentr     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Pythium sp     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Pyuridae s     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Xenopus la     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  50

Zea mays       ||||||D||| |||||||||| ||||||C||| |||||||||| ||||||||||  50

 

Origin Xen     DMINEVDADG NGTIDFPEFL TMMARKMKDT DSEEEIREAF RVFDKDGNGY  100

Homo sapie     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  100

Achlya kle     |||||||||| |||||||||| |||||||||| ||||||L||| QG|||||||F  100

Agaricus b     |||||||||| |||||||||| |||||||R|| ||||||K||| K|||||||||  100

Pleurotus      |||||||||| |||||||||| |||||||R|| ||||||K||| K|||||||||  100

Pongo abel     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  100

Prorocentr     |||||||||| |||||||||| SL|||||||| |T|||LI||| K|||R||||F  100

Pythium sp     |||||||||| |||||||||| |||||||||| ||||||L||| K||||||||F  100

Pyuridae s     |||||||||| D||||||||| |||||||||| |||||||||| |||||||||F  100

Xenopus la     |||||||||| |||||||||| |||||||||| |||||||||| ||||||||||  100

Zea mays       |||||||||| ||||||||L| NL|||||||| |||||LK||| ||||||Q||F  100

 

Origin Xen     ISAAELRHVM TNLGEKLTDE EVDEMIREAD IDGDGQVNYE EFVQMMTAK  149

Homo sapie     |||||||||| |||||||||| |||||||||| |||||||||| |||||||||  149

Achlya kle     ||||||||M| |||||||||| |||||||||| ||||||I||| |||K||MS|  149

Agaricus b     |||||||||| |||||||||S |||||||||| V|||||I||| |||K||LS|  149

Pleurotus      |||||||||| |||||||||N |||||||||| V|||||I||| |||K||LS|  149

Pongo abel     |||||||||| |||||||||| |||||||||| |||||||||| |||||||||  149

Prorocentr     |||||||||| |||||||||| |||||||||| V|||||I||| |||K||M||  149

Pythium sp     ||||||||I| |||||||||| |||||||||| ||||||I||| |||K||MS|  149

Pyuridae s     |||||||||| |||||||||| |||||||||| |||||||||| |||T|||S|  149

Xenopus la     |||||||||| |||||||||| |||||||||| |||||||||| |||||||||  149

Zea mays       |||||||||| |||||||||| |||||||||| V|||||I||| |||KV|M||  149

 

  1. … and the deprecated list of sequence differences

 

name

Cluster

Subcluster

Splice

Variant

Info

score

Homo sapiens

 

 

parent

100

Achlya klebsiana

 

Variant|17|17|G|-|PADB//Variant|40|40|V|-|PADB//Variant|87|87|L|-|PADB//Variant|91|92|QG|-|PADB//Variant|100|100|F|-|PADB//Variant|109|109|M|-|PADB//Variant|137|137|I|-|PADB//Variant|144|144|K|-|PADB//Variant|147|148|MS|-|PADB//

identical

92.61745

Agaricus bisporus

 

Variant|6|6|S|-|PADB//Variant|11|11|S|-|PADB//Variant|45|46|SQ|-|PADB//Variant|50|50|E|-|PADB//Variant|78|78|R|-|PADB//Variant|87|87|K|-|PADB//Variant|91|91|K|-|PADB//Variant|120|120|S|-|PADB//Variant|131|131|V|-|PADB//Variant|137|137|I|-|PADB//Variant|144|144|K|-|PADB//Variant|147|148|LS|-|PADB//

identical

90.60403

Pleurotus ostreatus

 

Variant|6|6|S|-|PADB//Variant|11|11|S|-|PADB//Variant|78|78|R|-|PADB//Variant|87|87|K|-|PADB//Variant|91|91|K|-|PADB//Variant|120|120|N|-|PADB//Variant|131|131|V|-|PADB//Variant|137|137|I|-|PADB//Variant|144|144|K|-|PADB//Variant|147|148|LS|-|PADB//

identical

92.61745

Pongo abelii

 

 

parent

100

Prorocentrum minimum

 

Variant|71|72|SL|-|PADB//Variant|82|82|T|-|PADB//Variant|86|87|LI|-|PADB//Variant|91|91|K|-|PADB//Variant|95|95|R|-|PADB//Variant|100|100|F|-|PADB//Variant|131|131|V|-|PADB//Variant|137|137|I|-|PADB//Variant|144|144|K|-|PADB//Variant|147|147|M|-|PADB//

identical

91.94631

Pythium splendens

 

Variant|87|87|L|-|PADB//Variant|91|91|K|-|PADB//Variant|100|100|F|-|PADB//Variant|109|109|I|-|PADB//Variant|137|137|I|-|PADB//Variant|144|144|K|-|PADB//Variant|147|148|MS|-|PADB//

identical

94.63087

Pyuridae sp.

 

Variant|61|61|D|-|PADB//Variant|100|100|F|-|PADB//Variant|144|144|T|-|PADB//Variant|148|148|S|-|PADB//

identical

97.31544

Xenopus laevis

 

 

parent

100

Zea mays

 

Variant|7|7|D|-|PADB//Variant|27|27|C|-|PADB//Variant|69|69|L|-|PADB//Variant|71|72|NL|-|PADB//Variant|86|87|LK|-|PADB//Variant|97|97|Q|-|PADB//Variant|100|100|F|-|PADB//Variant|131|131|V|-|PADB//Variant|137|137|I|-|PADB//Variant|144|145|KV|-|PADB//Variant|147|147|M|-|PADB//

identical

90.60403

 

The name is the identification of the sequence as provided in the original data, cluster and subcluster information is empty, splice and variant columns list the differences compared to the first sequence, the info column denotes to what level the sequences aligned, and the score column is the identity percentage compared to the first sequence.

 

 

Example 2: Clustering of a set of proteins to generate clusters

 

  1. Assemble all query sequences in a spreadsheet (column 1) and add appropriate identifiers such as gene names or accession numbers (column 2)
  2. Save this as a text file
  3. Select your text file in the ‘Files #1’ window (make sure you have selected the ‘File’ button)
  4. Select Output: cluster results
  5. Select Process: Cluster and no alignment
  6. Press Run
  7. The output window will contain the cluster results

 

row number

seq. name

cluster no.

alignment position

1

Homo sapiens

CLUST000008

0+25+50+75+100

2

Colletotrichum trifolii

CLUST000004

0+25+50+75+100

3

Ctenopharyngodon idella

CLUST000008

0+25+50+75+100

4

Danio rerio

CLUST000008

0+25+50+75+100

5

Daucus carota

CLUST000008

0+25+50+75+100

6

Capsicum annuum

CLUST000008

0+25+50+75

7

Drosophila melanogaster

CLUST000008

0+25+50+75+100

8

Electrophorus electricus

CLUST000008

0+25+50+100

9

Epinephelus akaara

CLUST000008

0+25+50+75

10

Euglena gracilis

CLUST000009

100

 

 

 

Example 3: Matrix clustering (all against all) using a score (identity) cut-off of 95%

 

  1. Assemble all query sequences in a spreadsheet (column 1) and add appropriate identifiers such as gene names or accession numbers (column 2)
  2. Save this as a text file
  3. Select your text file in the ‘Files #1’ window (make sure you have selected the ‘File’ button)
  4. Select output:
  5. Select Process: Align and use matrix scoring
  6. Press Run
  7. The output window will contain the aligned cluster results

 

name

Cluster

Subcluster

Splice

Variant

Info

score

p(arent) name

Homo sapiens

MATRIXcluster00001

Variant|100|100|Y|-|PADB//Variant|144|144|Q|-|PADB//Variant|148|148|A|-|PADB//

identical

97.98658

Aplysia californica

Achlya klebsiana

MATRIXcluster00008

 

parent

100

Achlya klebsiana

Agaricus bisporus

MATRIXcluster00004

Variant|45|46|SQ|-|PADB//Variant|50|50|E|-|PADB//Variant|120|120|S|-|PADB//

identical

97.31544

Pleurotus ostreatus

Alexandrium fundyense

MATRIXcluster00003

 

parent

100

Alexandrium fundyense

Anas platyrhynchos

MATRIXcluster00001

Variant|100|100|Y|-|PADB//Variant|144|144|Q|-|PADB//Variant|148|148|A|-|PADB//

identical

97.98658

Aplysia californica

Aplysia californica

MATRIXcluster00001

 

parent

100

Aplysia californica

Arabidopsis thaliana

MATRIXcluster00002

Variant|8|8|E|-|PADB//Variant|75|75|K|-|PADB//Variant|123|123|E|-|PADB//Variant|145|145|I|-|PADB//

identical

97.31544

Bryonia dioica

Blastocladiella emersonii

MATRIXcluster00009

 

parent

100

Blastocladiella emersonii

Bos taurus

MATRIXcluster00001

Variant|100|100|Y|-|PADB//Variant|144|144|Q|-|PADB//Variant|148|148|A|-|PADB//

identical

97.98658

Aplysia californica

Branchiostoma floridae

MATRIXcluster00001

 

parent

100

Aplysia californica

Branchiostoma lanceolatum

MATRIXcluster00001

 

parent

100

Aplysia californica

Bryonia dioica

MATRIXcluster00002

 

parent

100

Bryonia dioica

Caenorhabditis elegans

MATRIXcluster00001

Variant|148|148|T|-|PADB//

identical

99.32886

Aplysia californica

Candida albicans

MATRIXcluster00010

 

parent

100

Candida albicans

Capsicum annuum

MATRIXcluster00002

Variant|140|140|D|-|PADB//

identical

99.32886

Bryonia dioica

Chlamydomonas reinhardtii

MATRIXcluster00011

 

parent

100

Chlamydomonas reinhardtii

 

Note that the cut-off value of 95% is set as default and can be changed in the Settings menu.

 

Input and Output formats

 

Input

order in a spreadsheet

type

use for

column 1

column 2

column 3

column 4

column 5

pairwise

alignments

sequence 1

tag 1, such as name (optional)

sequence 2

tag 2, such as name (optional)

 

list

clustering and alignments

sequence 1

tag 1, such as name (optional)

cluster number to define groups (optional)

 

 

 

 

sequence 2

tag 2, such as name (optional)

cluster number to define groups (optional)

 

 

FASTA/own

clustering and alignments

">" followed by a tag 1, such as a gene name (optional)

 

 

 

 

 

 

on next line(s): sequence, which needs to end with "//"

 

 

 

 

de-merging

re-generate sequences using the difference list

sequence

deprecated splice info (in PADB format, see below)

 

 

 

re-merging

automatically shift a deprecated list from one molecule to another one by first de-merging the original set, followed by alignment to the new sequence and outputting the new deprecated list

sequence 1

sequence 2

tag 2, such as name

deprecated splice info from the sequence 2 (in PADB format, see below)

deprecated variant info from the sequence 2 (in PADB format, see below) (optional)

 

Input

Examples

type

column 1

column 2

column 3

column 4

column 5

pairwise

SLFDKDGDGTITTKELGTVMRSLGQNPTEAE

Q863567

DEMIREADIDGDGQVNYEEFVTM

F786

 

 

...

 

...

 

 

list

SLFDKDGDGTITTKELGTVMRSLGQNPTEAE

Q863567

1

 

 

 

DEMIREADIDGDGQVNYEEFVTM

F786

5

 

 

 

...

 

 

 

 

FASTA/own

>Seq1

 

 

 

 

 

SLFDKDGDGTITTKELGTVMRSLGQNPTEAE//

 

 

 

 

 

...

 

 

 

 

de-merging

SLFDKDGDGTITTKELGTVMRSLGQNPTEAE

Splice|6|8|missing|in isoform B|PADB//Splice|1|11|S|in isoform Truncated|PADB//Variant|6|6|S|-|PADB//

 

 

 

 

...

...

 

 

 

re-merging

SLFDKDGDGTITTKELGTVMRSLGQNPTEAE

DEMIREADIDGDGQVNYEEFVTM

F786

Splice|6|8|missing|in isoform B|PADB//Splice|1|11|S|in isoform Truncated|PADB//

Variant|6|6|S|-|PADB//

 

...

...

...

...

 

 

 

Output

 

order in a spreadsheet

 

 

 

 

 

 

type

process used

column 1

column 2

column 3

column 4

column 5

column 6

column 7

pairwise

alignment

tag 1

tag 2

splice info (deprecated)

variant info (deprecated)

alignment text (parent/identical/too divergent)

score (% identity)

 

list

alignment

tag 1

cluster number

subcluster / matrix cluster number

splice info (deprecated)

variant info (deprecated)

alignment text (parent/identical/too divergent)

score (% identity)

 

clustering

row number

tag

cluster number

alignment position

 

 

 

FASTA/own

alignment

same as list-alignment

 

 

 

 

 

 

de-merging

de-merge

1 demerged sequence per line, first sequence is always the original sequence

tag

 

 

 

 

 

re-merging

re-merge

same as list-alignment

 

 

 

 

 

 

 

Output

 

 Examples

 

 

 

 

 

 

type

process used

column 1

column 2

column 3

column 4

column 5

column 6

column 7

pairwise

alignment

Spinacia oleracea

Homo sapiens

Splice|6|8|missing|in isoform Homo sapiens|PADB//

Variant|8|8|E|-|PADB//

identical

99.789

 

list

alignment

Trypanosoma cruzi

CLUST000009

MATRIXcluster00001

Splice|6|8|missing|in isoform B|PADB//

Variant|109|109|V|-|PADB//

identical

99.3288591

 

clustering

5

GAPDH#12

CLUST000003

0+20+40+100

 

 

 

FASTA/own

alignment

Trypanosoma cruzi

CLUST000009

MATRIXcluster00001

Splice|6|8|missing|in isoform B|PADB//

Variant|109|109|V|-|PADB//

 

 

de-merging

de-merge

MARKMKDTDSEEELKEAFRVF

Truncated

 

 

 

 

 

re-merging

re-merge

SLFDKDGDGTITTKELGTVMRSLG

Splice|18|23|missing|in isoform F786|PADB//Splice|18|23|missing|in isoform Variant-1|PADB//

Variant|6|6|S|-|PADB//

 

 

 

 

 

 

Deprecated Splice and Variant annotation

 

Splice

aa position from

aa position to

replaced by what (i.e. replacement sequence, or 'missing')

in isoform xxx (xxx is the name or tag  of the sequence)

software source (LSCluster sets this as PADB)

//

Variant

aa position from

aa position to

replaced by what (i.e. replacement sequence, or 'missing')

- (dash)

software source (LSCluster sets this as PADB)

//

 

note: the '|' character is always used as a separator in the output (see examples above).

 

 

Main window settings

 

Input

 

File

allows you to choose the file(s) holding the sequence information without loading/editing the information.

Edit

allows you to enter sequences/text (or load a file via 'File Open') and edit them prior to analysis.

Left hand side (#1)

defines the parent (or primary, or master) file you want to use. It is obligatory to have either a defined file or entries in the edit window for the software to function.

Right hand side (#2)

define the child (or secondary) file you want to use (optional). This file is used to align or cluster sequences (under #1) against.

 

 

Output

 

Alignment results

choose whether you want to show and/or save alignment results. These are the aggregated alignment differences between sequences and differ from the graphical output. This option only applies for alignment processes.

Cluster results

choose whether you want to show and/or save the cluster results. This option applies only to Cluster procedures.

Graphical alignment results

choose whether you want to show and/or save the graphical representation of the alignment(s). Only applies to alignment processes. Note that only the alignment of one cluster can be graphically displayed.

Destination folder

define the working directory where the output will be saved to (if 'save' has been selected).

 

 

Output files are named (if you choose to automatically save them):

            'ALIGNnnn.txt' for alignments.

            'CLUSTERnnn.txt' for clustering.

            'VISALIGNnnn.txt' for graphical alignments.

            'MATRIXdump_all_nnn.txt' for Matrix alignment (will save the entire Matrix).

            'DEMERGEDnnn.txt' for de-merged sequences.

            'REMERGEDnnn.txt' for re-merged sequences.

            (nnn’ denotes a running number)

 

Process

 

Find identical

searches all sequences and identifies identical sequences. Note that tags (or names) are required! The input restrictions are: 1 sequence per row, up to 100.000 rows.

Find best alignment

uses all-against-all aligning and scoring to decide which is the best match. It will only report the best match for each sequence. Note: this search requires a name (or tag) for each sequence, but cluster numbers will be ignored.

Find best alignment, no inserts

all-against-all alignments to find the best alignments, but if the alignment results in an insert which is different in length of the maximal Variant size (see settings) then this alignment is discarded and the search continues. Note: Names (or tags) are required.

Cluster, no alignment

searches 1 or 2 files to define clusters. This is a more rapid way to pre-define clusters prior to alignment analysis (if required).

It is intended to compare sequences used for group-clustering or domain pattern matching or cross-species comparison. It uses 'in-string' searching and reports matches by cluster-numbering.

File formats accepted are: tabulated text with up to 100.000 entries (rows) and up to 2 columns for the master file, and up to 2 columns for the look-up (child) file with no upper limit for the number of sequences/rows (if searching with 2 files). File extensions (names) do NOT need to be of any specific kind, as long as the files are readable as text.

Note: This process will not align sequences!

Cluster and align to first

will first do a clustering approach, followed by an alignment process, where within any cluster all sequences are aligned to the first sequence in any cluster.

Cluster and align to longest

will first do a clustering approach, followed by an alignment process, where within any cluster all sequences are aligned to the longest sequence.

Cluster and align to longest or use Matrix scoring

will first do a clustering approach, followed by an alignment process, where within any cluster all sequences are aligned to the longest sequence, and if any of those alignments have a score below the pre-set cut-off value then the whole cluster will be analysed by Matrix scoring.

Align paired

align paired sequences in the following format:   

sequence 1 <tab> parent tag (name; optional)<tab> sequence 2 <tab> child tag (name; optional)

The input restrictions are: 2 sequences per row, with up to 100.000 rows, minimum 1, no headers.

Align to first in cluster

will align sequences against the first one within any cluster. The input restrictions are: 1 sequence per row, up to 100.000 rows, minimum 2 rows; sequences can contain numbers or spaces or non-letters (all these will be removed). Note: cluster numbers are optional, if there is none defined then all sequences will be treated as belonging to one cluster.

Alternative input is FASTA format: ('>' followed by title, then in next row the sequence, which can span several rows, followed by a '//' or another chosen end-denominator.

Align to longest in cluster

pre-clustered sequences, where all sequences in one cluster are compared to the longest sequence in the same cluster. If no cluster numbers are included, then all sequences will be treated as if they belong to one cluster. Input limits: 1 sequence per row, up to 100.000 rows, minimum 2.

Align using Matrix scoring

uses all-against-all and scoring, and based on the sum of the scores decides which is the best match for all alignments, removes this sub-cluster and repeats this process until there are no further pairments possible.

Note that this requires a pre-defined score cut-off!

If cluster numbers are defined then matrix searching will only be performed for each individual cluster. Input limits are: 1 sequence per row, 1000 rows maximum per cluster, 100.000 rows maximum.

Align to longest or use Matrix scoring

will first align all sequences to the longest sequence within each cluster, and if any sequence alignment within one cluster is below the score-cut-off point the it will re-align the sequences in this cluster using the Matrix scoring method.

De-merge/Re-merge

demerge sequences previously aligned using any of the methods above, thereby re-creating the original sequences. The software will automatically detect whether it needs only to de-merge and/or re-merge. The output file will be called 'DEMERGEDnnn.txt' or 'REMERGEDnnn.txt' with 'nnn' being the original file name.

 

 

Adjusting settings

 

 

Input Settings

 

File format

Pairwise/List/FASTA/own/Auto-detect defines the file format of the original data (see main help for data and file formats). If in doubt choose 'auto-detect' to let the software decide what format the data is in.

Sequences are

is used to force sequences to the same (UPPER) case since the software deals with characters in a case-sensitive manner (i.e. 'N' is not 'n').

FASTA/own identifier at the beginning of a sequence

(only used if the file format is set to FASTA/own) lets you choose the first character used to mark a new sequence entry (needs to be immediately BEFORE the name/tag on the line ABOVE the actual sequence). This can also be used to define a non-standard file format with a different marker at the beginning of the sequence.

FASTA/own identifier at the end of a sequence

(only used if the file format is set to FASTA/own) lets you choose the last characters used to mark the end of a sequence (can be IMMEDIATELY after the sequence or on a line below; the marker length can be up to 128 characters long). This can also be used to define a non-standard file format with a different marker at the end of the sequence.

 

 

Output settings

 

Sort Splice and Variant forms

lets you decide whether you want to split 'Variant' and 'Splice' information into two columns

Maximum size of 'Variant'

use to define the maximum lenght of a stretch which is not identical in both sequences which will then be termed 'Variant'.

 

 

Alignment settings

The software will first look for identical blocks in both sequences (from left to right or vice versa), then uses the defined cluster size to fill existing gaps. Then it will extend the matches using one of the 'gap filling' methods.

 

Direction

allows to set the direction of the pattern matching (either left to right, or right to left).

Minimal identical size

defines the minimal stretch of characters which need to be identical in the sequences to be compared (minimum of 5).

Minimal gap-filling size

defines the minimal stretch of characters which will be used to fill unmatched gaps within the sequence comparisons. It has to be smaller than the block size.

Fill single gaps

defines whether existing gaps should be filled by character hopping (i.e. NxMxP) (either check only the first 2 characters of unmatched gaps (this will omit both left and right ends) or to a full check within the unmatched stretch (either omitting both ends or including them).

Score cut-off value

defines the minimal score where an alignment will be taken into consideration (i.e. not set to zero).

Score cut-off re-adjustment query

(only applicable for Matrix searching) lets you choose whether you want to re-adjust the score (only possible to a higher value than set before) after the full all-against-all alignment. It will also allow you to save the current all-against-all matrix. This can be useful for manual checking of alignments and deciding which one might be a more sensible pairing than the one suggested by the software.

 

 

Graphical alignment settings

 

Include titles

include names/tags of the sequences in the graphical output

Include sequence positions

include sequence positions at the end of each line

Include a gap every n'th position

will include a space after every n characters

Numbers of characters per line

choose how many characters should be displayed per line

Substitute identical characters with

select the character which is used to show that two aligned sequences do match in this position

Substitute non-identical characters with

select the character which is used to show that two aligned sequences do not match in this position

 

 

 

Score settings

 

Include/show alignment scores

allows to either include or exclude alignment scores in the output.

Gap penalty

is used to define the value of the gap penalty (usually 1), and can be any number. A gap is defined as an unmatched character in the query sequence.

Maximum number of Variants

is used to define a cut-off where the number of variants (as set above) can not exceed this value per 100 characters in the parent sequence (e.g. '10' would mean that a maximum of 10 variants are allowed per 100 amino acids in a protein sequence). If it does, then the score will be set to zero.

Scoring method

is used to define other gap penalties used for calculating the alignment score. The choice here is to (1) add an extra penalty if the length of unmatched sequences is not the same, (2) to exclude unmatched stretches at both ends (i.e. to exclude those in the calculations), (3) to allow inserted sequences (where there is no gap in the original/parent sequence), and (4) to ignore all small unmatched stretches which are defined by the Variant size.

 

 

Cluster settings

These settings relate to the part of the software which is intended to compare sequences used for group-clustering or domain pattern matching. It uses 'in-string' searching and reports matches by cluster-numbering.

 

Use values as

use either a fixed value for the search sequence (absolute) or calculate it based on the sequence length (%)

Match length to identify cluster

defines the length of the sequence to look for, based on the value set previously

Minimal lenght of match

defines the minimum length (absolute value) which will be searched. This is to prevent looking for small sequences which might occur often and thereby leading to clustering of (nearly) everything into one cluster

Maximal lenght of match

defines the maximum length (absolute value) which will be searched. This value is different from the actual length of the sequence. This option is only relevant if option (1) is set to 'percentage'(%)

Start position to search

defines where to start in the query sequence (in percent). 0 is the start, 100 is the end

Search in steps of

defines how many times/which parts of the query sequence should be searched

 

 

 Examples for these last two options are:

start

0

25

25

0

0

 step

25

25

50

50

100

 position (in%)

0+25+50+75+100

25+50+75+100

25+75

0+50+100

0(all)

 total searches

5

4

2

3

1

 

 

Sliding scale for small sequences

defines if a sliding scale should be applied for small sequences. This means that if a sequence is smaller than three times the search-length then the new search-length for this sequence will be a third of its own sequence length with a minimum length as defined in option (3)(relevant only if (1) is set to 'absolute')

Cluster number aggregation

defines whether clusters are reported as one single index per sequence, or as multiple indices dependent on the instances where matches were found

Cluster number format

defines whether results should be displayed as running cluster numbers

 

 

Source code is available upon request by sending us an email.