This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .proto
file syntax and how to generate data access classes from your .proto
files. It covers the proto3 version of the protocol buffers language: for information on the proto2 syntax, see the Proto2 Language Guide.
This is a reference guide – for a step by step example that uses many of the features described in this document, see the tutorial for your chosen language.
Defining A Message Type
First let’s look at a very simple example. Let’s say you want to define a search request message format, where each search request has a query string, the particular page of results you are interested in, and a number of results per page. Here’s the .proto
file you use to define the message type.
syntax = "proto3";
message SearchRequest {
string query = 1;
int32 page_number = 2;
int32 results_per_page = 3;
}
- The first line of the file specifies that you’re using
proto3
syntax: if you don’t do this the protocol buffer compiler will assume you are using proto2. This must be the first non-empty, non-comment line of the file. - The
SearchRequest
message definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.
Specifying Field Types
In the earlier example, all the fields are scalar types: two integers (page_number
and results_per_page
) and a string (query
). You can also specify enumerations and composite types like other message types for your field.
Assigning Field Numbers
You must give each field in your message definition a number between 1
and 536,870,911
with the following restrictions:
- The given number must be unique among all fields for that message.
- Field numbers
19,000
to19,999
are reserved for the Protocol Buffers implementation. The protocol buffer compiler will complain if you use one of these reserved field numbers in your message. - You cannot use any previously reserved field numbers or any field numbers that have been allocated to extensions.
This number cannot be changed once your message type is in use because it identifies the field in the message wire format. “Changing” a field number is equivalent to deleting that field and creating a new field with the same type but a new number. See Deleting Fields for how to do this properly.
Field numbers should never be reused. Never take a field number out of the reserved list for reuse with a new field definition. See Consequences of Reusing Field Numbers.
You should use the field numbers 1 through 15 for the most-frequently-set fields. Lower field number values take less space in the wire format. For example, field numbers in the range 1 through 15 take one byte to encode. Field numbers in the range 16 through 2047 take two bytes. You can find out more about this in Protocol Buffer Encoding.
Consequences of Reusing Field Numbers
Reusing a field number makes decoding wire-format messages ambiguous.
The protobuf wire format is lean and doesn’t provide a way to detect fields encoded using one definition and decoded using another.
Encoding a field using one definition and then decoding that same field with a different definition can lead to:
- Developer time lost to debugging
- A parse/merge error (best case scenario)
- Leaked PII/SPII
- Data corruption
Common causes of field number reuse:
- renumbering fields (sometimes done to achieve a more aesthetically pleasing number order for fields). Renumbering effectively deletes and re-adds all the fields involved in the renumbering, resulting in incompatible wire-format changes.
- deleting a field and not reserving the number to prevent future reuse.
The max field is 29 bits instead of the more-typical 32 bits because three lower bits are used for the wire format. For more on this, see the Encoding topic.
Specifying Field Labels
Message fields can be one of the following:
-
optional
: Anoptional
field is in one of two possible states:- the field is set, and contains a value that was explicitly set or parsed from the wire. It will be serialized to the wire.
- the field is unset, and will return the default value. It will not be serialized to the wire.
You can check to see if the value was explicitly set.
-
repeated
: this field type can be repeated zero or more times in a well-formed message. The order of the repeated values will be preserved. -
map
: this is a paired key/value field type. See Maps for more on this field type. -
If no explicit field label is applied, the default field label, called “implicit field presence,” is assumed. (You cannot explicitly set a field to this state.) A well-formed message can have zero or one of this field (but not more than one). You also cannot determine whether a field of this type was parsed from the wire. An implicit presence field will be serialized to the wire unless it is the default value. For more on this subject, see Field Presence.
In proto3, repeated
fields of scalar numeric types use packed
encoding by default. You can find out more about packed
encoding in Protocol Buffer Encoding.
Well-formed Messages
The term “well-formed,” when applied to protobuf messages, refers to the bytes serialized/deserialized. The protoc parser validates that a given proto definition file is parseable.
In the case of optional
fields that have more than one value, the protoc parser will accept the input, but only uses the last field. So, the “bytes” may not be “well-formed” but the resulting message would have only one and would be “well-formed” (but would not roundtrip the same).
Adding More Message Types
Multiple message types can be defined in a single .proto
file. This is useful if you are defining multiple related messages – so, for example, if you wanted to define the reply message format that corresponds to your SearchResponse
message type, you could add it to the same .proto
:
message SearchRequest {
string query = 1;
int32 page_number = 2;
int32 results_per_page = 3;
}
message SearchResponse {
...
}
Combining Messages leads to bloat While multiple message types (such as message, enum, and service) can be defined in a single .proto
file, it can also lead to dependency bloat when large numbers of messages with varying dependencies are defined in a single file. It’s recommended to include as few message types per .proto
file as possible.
Adding Comments
To add comments to your .proto
files, use C/C++-style //
and /* ... */
syntax.
/* SearchRequest represents a search query, with pagination options to
* indicate which results to include in the response. */
message SearchRequest {
string query = 1;
int32 page_number = 2; // Which page number do we want?
int32 results_per_page = 3; // Number of results to return per page.
}
Deleting Fields
Deleting fields can cause serious problems if not done properly.
When you no longer need a field and all references have been deleted from client code, you may delete the field definition from the message. However, you must reserve the deleted field number. If you do not reserve the field number, it is possible for a developer to reuse that number in the future.
You should also reserve the field name to allow JSON and TextFormat encodings of your message to continue to parse.
Reserved Fields
If you update a message type by entirely deleting a field, or commenting it out, future developers can reuse the field number when making their own updates to the type. This can cause severe issues, as described in Consequences of Reusing Field Numbers.
To make sure this doesn’t happen, add your deleted field number to the reserved
list. To make sure JSON and TextFormat instances of your message can still be parsed, also add the deleted field name to a reserved
list.
The protocol buffer compiler will complain if any future developers try to use these reserved field numbers or names.
message Foo {
reserved 2, 15, 9 to 11;
reserved "foo", "bar";
}
Reserved field number ranges are inclusive (9 to 11
is the same as 9, 10, 11
). Note that you can’t mix field names and field numbers in the same reserved
statement.
What’s Generated from Your .proto
?
When you run the protocol buffer compiler on a .proto
, the compiler generates the code in your chosen language you’ll need to work with the message types you’ve described in the file, including getting and setting field values, serializing your messages to an output stream, and parsing your messages from an input stream.
- For C++, the compiler generates a
.h
and.cc
file from each.proto
, with a class for each message type described in your file. - For Java, the compiler generates a
.java
file with a class for each message type, as well as a specialBuilder
class for creating message class instances. - For Kotlin, in addition to the Java generated code, the compiler generates a
.kt
file for each message type with an improved Kotlin API. This includes a DSL that simplifies creating message instances, a nullable field accessor, and a copy function. - Python is a little different — the Python compiler generates a module with a static descriptor of each message type in your
.proto
, which is then used with a metaclass to create the necessary Python data access class at runtime. - For Go, the compiler generates a
.pb.go
file with a type for each message type in your file. - For Ruby, the compiler generates a
.rb
file with a Ruby module containing your message types. - For Objective-C, the compiler generates a
pbobjc.h
andpbobjc.m
file from each.proto
, with a class for each message type described in your file. - For C#, the compiler generates a
.cs
file from each.proto
, with a class for each message type described in your file. - For PHP, the compiler generates a
.php
message file for each message type described in your file, and a.php
metadata file for each.proto
file you compile. The metadata file is used to load the valid message types into the descriptor pool. - For Dart, the compiler generates a
.pb.dart
file with a class for each message type in your file.
You can find out more about using the APIs for each language by following the tutorial for your chosen language. For even more API details, see the relevant API reference.
Scalar Value Types
A scalar message field can have one of the following types – the table shows the type specified in the .proto
file, and the corresponding type in the automatically generated class:
.proto Type | Notes | C++ Type | Java/Kotlin Type[1] | Python Type[3] | Go Type | Ruby Type | C# Type | PHP Type | Dart Type |
---|---|---|---|---|---|---|---|---|---|
double | double | double | float | float64 | Float | double | float | double | |
float | float | float | float | float32 | Float | float | float | double | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 |
uint32 | Uses variable-length encoding. | uint32 | int[2] | int/long[4] | uint32 | Fixnum or Bignum (as required) | uint | integer | int |
uint64 | Uses variable-length encoding. | uint64 | long[2] | int/long[4] | uint64 | Bignum | ulong | integer/string[6] | Int64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 228. | uint32 | int[2] | int/long[4] | uint32 | Fixnum or Bignum (as required) | uint | integer | int |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 256. | uint64 | long[2] | int/long[4] | uint64 | Bignum | ulong | integer/string[6] | Int64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 | Fixnum or Bignum (as required) | int | integer | int |
sfixed64 | Always eight bytes. | int64 | long | int/long[4] | int64 | Bignum | long | integer/string[6] | Int64 |
bool | bool | boolean | bool | bool | TrueClass/FalseClass | bool | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 232. | string | String | str/unicode[5] | string | String (UTF-8) | string | string | String |
bytes | May contain any arbitrary sequence of bytes no longer than 232. | string | ByteString | str (Python 2) bytes (Python 3) |
[]byte | String (ASCII-8BIT) | ByteString | string | List |
[1] Kotlin uses the corresponding types from Java, even for unsigned types, to ensure compatibility in mixed Java/Kotlin codebases.
[2] In Java, unsigned 32-bit and 64-bit integers are represented using their signed counterparts, with the top bit simply being stored in the sign bit.
[3] In all cases, setting values to a field will perform type checking to make sure it is valid.
[4] 64-bit or unsigned 32-bit integers are always represented as long when decoded, but can be an int if an int is given when setting the field. In all cases, the value must fit in the type represented when set. See [2].
[5] Python strings are represented as unicode on decode but can be str if an ASCII string is given (this is subject to change).
[6] Integer is used on 64-bit machines and string is used on 32-bit machines.
You can find out more about how these types are encoded when you serialize your message in Protocol Buffer Encoding.
Default Values
When a message is parsed, if the encoded message does not contain a particular implicit presence element, accessing the corresponding field in the parsed object returns the default value for that field. These defaults are type-specific:
- For strings, the default value is the empty string.
- For bytes, the default value is empty bytes.
- For bools, the default value is false.
- For numeric types, the default value is zero.
- For enums, the default value is the first defined enum value, which must be 0.
- For message fields, the field is not set. Its exact value is language-dependent. See the generated code guide for details.
The default value for repeated fields is empty (generally an empty list in the appropriate language).
Note that for scalar message fields, once a message is parsed there’s no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false
) or just not set at all: you should bear this in mind when defining your message types. For example, don’t have a boolean that switches on some behavior when set to false
if you don’t want that behavior to also happen by default. Also note that if a scalar message field is set to its default, the value will not be serialized on the wire. If a float or double value is set to +0 it will not be serialized, but -0 is considered distinct and will be serialized.
See the generated code guide for your chosen language for more details about how defaults work in generated code.
Enumerations
When you’re defining a message type, you might want one of its fields to only have one of a predefined list of values. For example, let’s say you want to add a corpus
field for each SearchRequest
, where the corpus can be UNIVERSAL
, WEB
, IMAGES
, LOCAL
, NEWS
, PRODUCTS
or VIDEO
. You can do this very simply by adding an enum
to your message definition with a constant for each possible value.
In the following example we’ve added an enum
called Corpus
with all the possible values, and a field of type Corpus
:
enum Corpus {
CORPUS_UNSPECIFIED = 0;
CORPUS_UNIVERSAL = 1;
CORPUS_WEB = 2;
CORPUS_IMAGES = 3;
CORPUS_LOCAL = 4;
CORPUS_NEWS = 5;
CORPUS_PRODUCTS = 6;
CORPUS_VIDEO = 7;
}
message SearchRequest {
string query = 1;
int32 page_number = 2;
int32 results_per_page = 3;
Corpus corpus = 4;
}
As you can see, the Corpus
enum’s first constant maps to zero: every enum definition must contain a constant that maps to zero as its first element. This is because:
- There must be a zero value, so that we can use 0 as a numeric default value.
- The zero value needs to be the first element, for compatibility with the proto2 semantics where the first enum value is the default unless a different value is explicitly specified.
You can define aliases by assigning the same value to different enum constants. To do this you need to set the allow_alias
option to true
. Otherwise, the protocol buffer compiler generates a warning message when aliases are found. Though all alias values are valid during deserialization, the first value is always used when serializing.
enum EnumAllowingAlias {
option allow_alias = true;
EAA_UNSPECIFIED = 0;
EAA_STARTED = 1;
EAA_RUNNING = 1;
EAA_FINISHED = 2;
}
enum EnumNotAllowingAlias {
ENAA_UNSPECIFIED = 0;
ENAA_STARTED = 1;
// ENAA_RUNNING = 1; // Uncommenting this line will cause a warning message.
ENAA_FINISHED = 2;
}
Enumerator constants must be in the range of a 32-bit integer. Since enum
values use varint encoding on the wire, negative values are inefficient and thus not recommended. You can define enum
s within a message definition, as in the earlier example, or outside – these enum
s can be reused in any message definition in your .proto
file. You can also use an enum
type declared in one message as the type of a field in a different message, using the syntax _MessageType_._EnumType_
.
When you run the protocol buffer compiler on a .proto
that uses an enum
, the generated code will have a corresponding enum
for Java, Kotlin, or C++, or a special EnumDescriptor
class for Python that’s used to create a set of symbolic constants with integer values in the runtime-generated class.
Important
The generated code may be subject to language-specific limitations on the number of enumerators (low thousands for one language). Review the limitations for the languages you plan to use.During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation. In languages with closed enum types such as Java, a case in the enum is used to represent an unrecognized value, and the underlying integer can be accessed with special accessors. In either case, if the message is serialized the unrecognized value will still be serialized with the message.
Important
For information on how enums should work contrasted with how they currently work in different languages, see Enum Behavior.For more information about how to work with message enum
s in your applications, see the generated code guide for your chosen language.
Reserved Values
If you update an enum type by entirely removing an enum entry, or commenting it out, future users can reuse the numeric value when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto
, including data corruption, privacy bugs, and so on. One way to make sure this doesn’t happen is to specify that the numeric values (and/or names, which can also cause issues for JSON serialization) of your deleted entries are reserved
. The protocol buffer compiler will complain if any future users try to use these identifiers. You can specify that your reserved numeric value range goes up to the maximum possible value using the max
keyword.
enum Foo {
reserved 2, 15, 9 to 11, 40 to max;
reserved "FOO", "BAR";
}
Note that you can’t mix field names and numeric values in the same reserved
statement.
Using Other Message Types
You can use other message types as field types. For example, let’s say you wanted to include Result
messages in each SearchResponse
message – to do this, you can define a Result
message type in the same .proto
and then specify a field of type Result
in SearchResponse
:
message SearchResponse {
repeated Result results = 1;
}
message Result {
string url = 1;
string title = 2;
repeated string snippets = 3;
}
Importing Definitions
In the earlier example, the Result
message type is defined in the same file as SearchResponse
– what if the message type you want to use as a field type is already defined in another .proto
file?
You can use definitions from other .proto
files by importing them. To import another .proto
’s definitions, you add an import statement to the top of your file:
import "myproject/other_protos.proto";
By default, you can use definitions only from directly imported .proto
files. However, sometimes you may need to move a .proto
file to a new location. Instead of moving the .proto
file directly and updating all the call sites in a single change, you can put a placeholder .proto
file in the old location to forward all the imports to the new location using the import public
notion.
Note that the public import functionality is not available in Java.
import public
dependencies can be transitively relied upon by any code importing the proto containing the import public
statement. For example:
// new.proto
// All definitions are moved here
// old.proto
// This is the proto that all clients are importing.
import public "new.proto";
import "other.proto";
// client.proto
import "old.proto";
// You use definitions from old.proto and new.proto, but not other.proto
The protocol compiler searches for imported files in a set of directories specified on the protocol compiler command line using the -I
/--proto_path
flag. If no flag was given, it looks in the directory in which the compiler was invoked. In general you should set the --proto_path
flag to the root of your project and use fully qualified names for all imports.
Using proto2 Message Types
It’s possible to import proto2 message types and use them in your proto3 messages, and vice versa. However, proto2 enums cannot be used directly in proto3 syntax (it’s okay if an imported proto2 message uses them).
Nested Types
You can define and use message types inside other message types, as in the following example – here the Result
message is defined inside the SearchResponse
message:
message SearchResponse {
message Result {
string url = 1;
string title = 2;
repeated string snippets = 3;
}
repeated Result results = 1;
}
If you want to reuse this message type outside its parent message type, you refer to it as _Parent_._Type_
:
message SomeOtherMessage {
SearchResponse.Result result = 1;
}
You can nest messages as deeply as you like. In the example below, note that the two nested types named Inner
are entirely independent, since they are defined within different messages:
message Outer { // Level 0
message MiddleAA { // Level 1
message Inner { // Level 2
int64 ival = 1;
bool booly = 2;
}
}
message MiddleBB { // Level 1
message Inner { // Level 2
int32 ival = 1;
bool booly = 2;
}
}
}
Updating A Message Type
If an existing message type no longer meets all your needs – for example, you’d like the message format to have an extra field – but you’d still like to use code created with the old format, don’t worry! It’s very simple to update message types without breaking any of your existing code when you use the binary wire format.