Grades responses on five attributes helpfulness, correctness, coherence, complexity and verbosity.
The Nemotron-4-340B-Reward is a multi-dimensional Reward Model that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs; Nemotron-4-340B-Reward consists of the Nemotron-4-340B-Base model and a linear layer that converts the final layer representation of the end-of-response token into five scalar values, each corresponding to a HelpSteer2 attribute.
It supports a context length of up to 4,096 tokens.
Given a conversation with multiple turns between user and assistant, it rates the following attributes (typically between 0 and 4) for every assistant turn.
Nonetheless, if you are only interested in using it as a conventional reward model that outputs a singular scalar, we recommend using the weights [0, 0, 0, 0, 0.3, 0.74, 0.46, 0.47, -0.33]
to do elementwise multiplication with the predicted attributes (which outputs 9 float values in line with Llama2-13B-SteerLM-RM but the first four are not trained or used)
Under the NVIDIA Open Model License, NVIDIA confirms:
Nemotron-4 340B Reward Model is a pretrained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).
Nemotron-4 340B-Reward can be used in the alignment stage to align pretrained models to human preferences. It can also be used in cases like Reward-Model-as-a-Judge.
Model Developer: NVIDIA
Model Input: Text only
Input Format: String
Model Output: Scalar Values (List of 9 Floats)
Output Format: Float
Model Dates: Nemotron-4-340B-Reward was trained between December 2023 and May 2024
Data Freshness: The pretraining data has a cutoff of June 2023
BF16 Inference:
You can use the model with NeMo Aligner following SteerLM training user guide.
python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \ rm_model_file=Nemotron-4-340B-Reward \ trainer.num_nodes=2 \ trainer.devices=8 \ ++model.tensor_model_parallel_size=8 \ ++model.pipeline_model_parallel_size=2 \ inference.micro_batch_size=2 \ inference.port=1424
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ --input-file=data/oasst/train.jsonl \ --output-file=data/oasst/train_labeled.jsonl \ --port=1424
{ "conversations": [ {"value": <user_turn_1>, "from": "User", "label": None}, {"value": <assistant_turn_1>, "from": "Assistant", "label": <formatted_label_1>}, {"value": <user_turn_2>, "from": "User", "label": None}, {"value": <assistant_turn_2>, "from": "Assistant", "label": <formatted_label_2>}, ], "mask": "User" }
Ideally, each <formatted_label_n>
refers to the ground truth label for the assistant turn but if they are not available, we can also use helpfulness:4,correctness:4,coherence:4,complexity:2,verbosity:2
(i.e. defaulting to moderate complexity and verbosity, adjust if needed. or simply helpfulness:-1
. It must not be None
or an empty string.
Nemotron-4-340B-Reward is extended from Nemotron-4-340B-Base with an additional linear layer. It was trained with a global batch-size of 128.
Architecture Type: Transformer Decoder (auto-regressive language model)
Nemotron-4-340B-Reward is a pretrained Reward Model intended for use in English Synthetic Data Generation and English Reinforcement Learning from AI Feedback (RLAIF).
Nemotron-4-340B-Reward was trained for 2 epochs using the NVIDIA HelpSteer2 data. The HelpSteer2 dataset is a permissively licensed preference dataset (CC-by-4.0) with ten thousand English response pairs and can be found here.
Evaluated using RewardBench - as introduced in the paper RewardBench: Evaluating Reward Models for Language Modeling.
Overall | Chat | Chat-Hard | Safety | Reasoning |
---|---|---|---|---|
92.0 | 95.8 | 87.1 | 91.5 | 93.7 |
This model was trained using an English dataset, and as such its use is optimized for English language use cases. In order to extend this model to other language domains, fine-tuning will be required.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.
If you find this model useful, please cite the following works
@misc{wang2024helpsteer2, title={HelpSteer2: Open-source dataset for training top-performing reward models}, author={Zhilin Wang and Yi Dong and Olivier Delalleau and Jiaqi Zeng and Gerald Shen and Daniel Egert and Jimmy J. Zhang and Makesh Narsimhan Sreedhar and Oleksii Kuchaiev}, year={2024}, eprint={2406.08673}, archivePrefix={arXiv}, primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} }