Types of Upload Files
The DURel system uses three specific types of files: Uses, Instances, and Annotations files. Please ensure all these files are in .tsv format, which means they should be tab-separated tables, and encoded in UTF-8.
Uses Files
Uses files contain the study data itself, i.e., the uses that you want to study and meta information about these uses. A Uses file consists of the following columns:
- lemma: This is the base form of a word that you would find in a dictionary. Lemmas should not contain spaces, commas, or hyphens. In case of a multi-word lemma, use an underscore (_). For instance, "distant_past" would be a proper representation of the multiword lemma "distant past". Important: Each file should contain only one lemma!
- pos: The 'Part of Speech' and indicates the grammatical role that a word plays in the sentence. "Noun" is an example of such a role, though you are not restricted, and you can use any tag set you like. The same formatting restrictions as the lemma apply to POS.
- date: Refers to the original date of the use. For instance, if a use is from a book published in 1953, the date should be represented as 1953. There no restrictions with respect to the format, but you will only be able to filter by date if the format of this column is
"yyyy", "yyyy-MM", "yyyy/MM", "yyyy-MM-dd", or "yyyy/MM/dd".
- grouping: An optional classification category for the word. You can leverage this for different operations like filtering, sampling, and visualization during your studies.
- identifier: This is a unique ID serving as the identifier for each record. Identifiers have to be unique within each project. To allow repeated uploads of the same project, DURel will assign an additional internal identifier on upload.
- description: Contains any additional details pertaining to the use instance.
- context: The use that will be annotated in the study.
- indexes_target_token: The character index of the target token or the identified use of the lemma in the provided context. This index is used for highlighting purposes during the study. It should follow the pattern n:m, where n is the position of the starting character of the token and m is the position of the ending character. For example, in the context "He ran a great distance.", the indexes_target_token would be 12:20 for the token "distance".
- indexes_target_sentence: Refers to the index of the sentence containing the target token. The same format and restrictions as for indexes_target_token apply here.
To help you to get started with creating your own Uses files, a sample file is provided here.
Instances Files
Instances files play a crucial role in defining pairs of uses that will be annotated in the study. These files must contain the following columns:
- lemma: Similar to the Uses files, this is the base form of a word. Lemmas should not contain spaces, commas, or hyphens. Use an underscore (_) for multi-word lemmas. Important: Each file should contain only one lemma!
- identifier1: The unique identifier of the first use in a pair that you want to compare within the study. This identifier should match the corresponding use's identifier provided in the Uses files.
- identifier2: The unique identifier of the second use in a pair for comparison. Ensure this identifier matches the one provided for the respective use in the Uses files.
While creating the Instances file, make sure the mentioned columns are in the correct order. There is a sample file here.
Annotations Files
Lastly, the Annotations files enlist information related to annotations made by users, or gold annotations. Annotations files have the following structure:
- identifier1: The unique identifier of the first use in an annotated pair. This identifier should match the corresponding use's identifier provided in the Uses files.
- identifier2: The unique identifier of the second use in an annotated pair. This identifier should match the corresponding use's identifier provided in the Uses files.
- annotator: The username of the annotator who provided the annotation for the respective pair of uses. Note: Annotators must be added to the system before uploading a project.
- judgment: The judgment or decision given by the annotator for the pair of uses. This should be parseable as fload. Decimals must be denoted with a period (.) and whole and negative numbers are acceptable.
- comment: Any comment or additional information provided by the annotator regarding the judgment or annotation.
- lemma: The base form of the word, exactly the same as in the Uses and Instances files.
- timestamp: The time and date when the annotation was made.
When preparing an Annotations file, ensure your file contains these columns in the right order for a seamless upload. A sample file is available here.