By Joe Provino

jVoiceBridge is a Java-based software conference bridge. It uses the SIP and RTP protocols to send signaling and audio data, respectively. It is designed to be highly reliable and scalable, while still providing advanced features such as high-fidelity, stereo audio and individual mixes.

jVoiceBridge is currently used in a variety of applications, including conference calling and 3D virtual worlds. Since the voice bridge is based on the SIP standard, it interoperates with existing SIP components such as software phones, and VoIP to PSTN gateways. The jVoiceBridge package also includes a Java-based software phone which is optimized to work with the bridge.


This package includes the source for the jVoiceBridge distribution. jVoiceBridge consists of two main parts: the voice bridge and the softphone. There is also a monitoring tool called bridgemoniter. Also included is the voicelib which is the audio interface for the wondlerland server.

The voice bridge is a server application which mixes audio for a variety of uses. The softphone is a client-side Java application that can be used to connect to the voice bridge in high-fidelity stereo. While the voice bridge and softphone are optimized to work together, any SIP-based software phone should work with the voice bridge.


Building jVoiceBridge requires the followings software:

  • Java SE, version 1.5 or later
  • Apache Ant, version 1.6.5 or later

To build the voice bridge and softphone, expand the source bundle into a directory, which we will refer to as "<bridge_dir>". <bridge_dir> contains the top-level and files, as well as the following directories:

      <bridge_dir>/common:    shared libraries and code
      <bridge_dir>/stun:      implementation of RFC 3489
      <bridge_dir>/voip:      voice bridge
      <bridge_dir>/softphone: software phone implementation

Each of these directories contains an independent sub-project, with separate build scripts.

To compile a distribution, from <bridge_dir>, run the command:

      $ ant dist
This will create a distribution bundle in the directory <bridge_dir>/dist. This should include the softphone, bridge and doc directories.

The <bridge_dir>/dist/softphone directory contains the standalone softphone.jar file. This jar includes all classes and data needed to run the softphone.

The <bridge_dir>/dist/bridge directory contains all the jar files needed to run the bridge, and a run.xml file which can be used with ant to launch the bridge.

the <bridge_dir>/dist/doc directory contains documentation.


To run the voice bridge, from <bridge_dir> run:

      $ ant run-bridge
NOTE (Mac OS X only): If you're running the JDK 1.6 Developer Preview release, you'll need to comment out the following lines from <bridge_dir>/dist/bridge/run.xml:
      <!--jvmarg value="-XX:+UseParNewGC"/-->
      <!--jvmarg value="-XX:+UseConcMarkSweepGC"/-->
This will start up the voice bridge on the local host, port 5060. To change the port the bridge runs on, you can create a file called "" in <bridge_dir> and add the following line:
Once the bridge is running, you should be able to telnet to the bridge control port (port 6666) and type "help" to see a list of bridge commands.


To run the softphone, from <bridge_dir> run:

      $ ant run-softphone
This will launch the softphone.

By default, the softphone does not have a SIP registrar set, so will not register properly. Generally, it is a good idea to use the voice bridge as a SIP registrar. You can do this by selecting Settings in the softphone menu, then Configure, click on sip in the Configuration window, scroll down to Registrar address and enter <registrar IP address>;sip-stun:<registrar port> Click Save, Close, then restart the softphone.

When the softphone registers correctly, you will see the current registered address in green at the bottom of the softphone. You can mouse over the address to see the full address if it's not already visible. When placing calls to the softphone, use this address.


The voice bridge can be remotely controlled using the bridge control port, by default port 6666. From any client, you can telnet to this port and get information and send commands to the bridge. For example, you can place a call with the following commands:

      <blank line>
This will call the sip address and place the call in the conference called Test. Additional calls can be added to the conference by sending the above commands with different phone numbers. All of the calls will be in the same conference and able to hear each other. These are the minimal commands to put calls in a conference.

Some commands are acted upon immediately to set or get information. For example, to display information about all calls use the callInfo (or ci) command.

For detailed information about calls use getCallStatus (or gcs):
When you are done issuing commands, you can leave calls connected by typing
A full description of the commands for the bridge does not currently exist, but there is a Java API to access the most common functionality. Note that these APIs do not cover all the features of the bridge.

The voice bridge API is in the com.sun.voip.client.connector package in the voip directory. This package is built into the bridge_connector.jar file, which is available in <bridge_dir>/dist/bridge.

JavaDoc for the bridge API is available in the <bridge_dir>/dist/doc/connector directory. There is a sample program that places calls using the bridge connector API in the <bridge_dir>/dist/doc/example directory, called

To build the PlaceCall example, use a command like:

      $ javac -cp ../../bridge/bridge_connector.jar
To run it, use:
      $ java -cp ../../bridge/bridge_connector.jar:. PlaceCall
This will place a call to the given SIP address from the given bridge. For example, if your softphone has the address "sip:test@", you can place a call to that phone with:
      $ java PlaceCall localhost 6666 testConf sip:test@
This will create a conference called "testConf" and place a call to the softphone. If you call another phone with the same conference name, the two parties will be able to talk to each other.


The voice bridge can be configured to use a gateway for calls to the Public Switched Telephone Network (PSTN). To set the gateway when the bridge starts up, edit and add the following line:

      voicebridge.sip.gateways=[, gateway IP Address> ...]
For a temporary change which will last until the voice bridge is restarted, do the following:
      telnet  6666
      voipGateways=[, gateway IP Address> ...]
We have successfully used a Cisco 3745 Sip / VoIP gateway. An Asterisk box can also be used as a gateway.

There is also a hosting service provided at If you use this service you will need to configure your account so that Junction Networks knows the voice bridge IP address.

Once you've established an account, click on PTSN Gateway, click on Account Login, clickon Configuration.. Under Sip Configuration, click on modify. Under TERMINATION enter the IP Address of the voice bridge and disable the SIP Password. Under ORIGINATION select Use IP Address for SIP Origination Method. Enter the IP address

If you want to have support for dial-in, request a phone number and have calls to that number forwarded to the voice bridge IP address and the Port Number (usually 5060). Click Submit.

Now when the voice bridge is asked to place a call to a non-SIP address such as a regular phone number, the SIP INVITE will be sent to the gateway.


A call is the basic unit of audio. It is identified by a call id. A CallParticipant is used to hold the parameters for a call. There's one CallParticipant, one CallHandler and one ConferenceMember per call. The CallHandler handles the call setup and call status. The ConferenceMember deals with the call being part of a conference. It has a MemberSender and a MemberReceiver.
A conference is identified by a conference id. A conference has members which can talk to each other.
Input Treatment
An input treatment is a call which has no remote endpoint. Instead it uses a treatment as the source of its audio. The audio from an input treatment is handled like microphone input from any other call.
Each call has MixDescriptors which describe what audio the call should hear. A descriptor has information about the source of the audio and how loud and in what direction the call should hear the audio.

In a simple conference, there's a "common mix" to which the data from all calls in the conference are added. In order to not hear yourself when you talk, your own audio needs to be subtracted out of the common mix when the data is sent to you. So you have a descriptor for the common mix with a volume level of one and another descriptor for your own audio source with a volume level of minus 1.

But let's say you want to raise the volume of my audio because I speak too softly. This is what we call a private mix. You would need another descriptor with my audio source and a volume level of say .2. It's .2 because my audio is already in the common mix at a volume of 1. So when everything is added together, the volume you hear from me will be 1.2.

If there are a lot of private mixes, the common mix may not make any sense to maintain. We support this too. In that case, you only hear audio sources for which you have a private mix (i.e. a mix descriptor).

For example, if you and I want to hear each other, we'd have a mix descriptor for each other than that's it.

A "treatment" is short for audio treatment which is simply an audio file that's played to a call or a conference. For example, if you were doing a two party call, the first party would be your phone number and the second party would be the number you want to call. You'd probably want to specify a couple of treatments to be played when you answer the phone and to be stopped when the other person answers. When you answer your phone, you'd probably want to hear "please wait while I connect your call", followed by a ringing sound. Or maybe just the ringing sound. These sounds are called treatments.
A whisper group is like a sub conference. Each member is talking in only one whisper group.


The following commands are not acted upon immediately but collected and used to set up a call. A blank line indicates the commands are complete and the call should be setup.

    callAnswerTimeout | to = <seconds>
This is the number of seconds to wait before terminating the call if the call is unanswered. The default is 90 seconds.
    callAnsweredTreatment | at = <treatment>
The is the audio played when the call is answered.
    callEndTreatment | et = <treatment>
This is the audio played when the call ends.
    callEstablishedTreatment | et = <treatment>
This is the audio played when the call is Established. Normally a call is established when it it is answered. However, if a joinConfirmationTimeout is specified, when the call is answered it is not established until the joinConfirmation key is pressed.
    callId | id = <String>
This is the call Id used for identifying and controlling a call. If it isn't specified the bridge will pick a unique call Id using the bridge location String concatonated with a monotonically increasing number.
    conferenceId | c = <string>
This is the conference Id. All calls in the same conference will be able to hear each other.
    conferenceJoinTreatment | jt = <treatment>
This is the audio played to the entire conference when a new call joins the conference.
    conferenceLeaveTreatment | lt = <treatment>
This is the audio played to the entire conference when a call ends and leaves the conference.
    displayName | d = <string>
This is the information that will be displayed with the caller Id. If it's not specified, the conference id is used except for a two party call in which case the second party name or number is used.
    doNotRecord | dnr = true | false
If set to true the bridge will not record data from this call.
    directConferencing | dc = true | false
Defaults to true. TODO: Explain
    dtmfDetection | dtmf = true | false
Check for dtmf keys in voice data. Default is false.
    dtmfSuppression | ds = true | false
Suppress dtmf sounds from the conference so that others don't hear when someone presses a dtmf key. Default is true.
    duplicateCallLimit | dcl = <int>
This is the maximum number of calls to the same number. Default is 100.
    firstRtpPort = <int>
This is the lowest port number to use for RTP for call data. The default is 0 which means to let the system choose an appropriate port.
    encryptionKey | ek = <string
This is the key used to encrypt the voice data sent to this call and to decrypt data from this call.
    encryptionAlgorithm | ea = <string
This is the encryption algorithm to encrypt / decrypt data for the call. The algorithm must be recognized by javax.crypto.Crypt.getInstance().
    forwardDataFrom | fdf = <callId
Data from callId will be forwarded to this call. This is used when calls are on multiple bridges. Assume Jon's call is on bridge 1 and Joe's call is on bridge 2 and both calls are in the same conference. In order for Jon and Joe to hear each other the two bridges would have to send data to each other. One way to accomplish this is the set up virtual calls between the two bridges. Joe's data on bridge 2 is sent to call Id v-Joe on bridge one and Jon's data on bridge 1 is sent to v-Jon on bridge 2. The forwardDataFrom command is used to forward Jon's data to v-Jon and Joe's data to v-Joe.
    firstConferenceMemberTreatment | fm = <treatment>
This is the audio played when the first member joins a conference.
    handleSessionProgress | hsp = true | false
This is used to work around a bug. We have seen problems where our SIP/VoIP gateway reponds to a SIP INVITE with only SESSION_PROGRESS and it never sends a SIP OK. Setting handleSessionProgress to true means that the voice bridge should treat SESSION_PROGRESS like OK.
    ignoreTelephoneEvents | ite = true | false
When true, telephone events such as dtmf will be ignored.
    inputTreatment | it = <treatment>
An input treatment is a special call which plays an audio treatment to a conference. No data is ever sent to an input treatment. We have used this to play recordings in a virtual world.
    joinConfirmationKey | jck = <string 1 char dtmf key>
This is the key which must be pressed when a joinConfirmationTimeout is specified. The default is 1.
    joinConfirmationTimeout | jc = <seconds>
When a call is answered if a joinConfirmationTimeout is specified, then the join confirmation key must be pressed before the timeout, otherwise the call will be terminated. This is useful when a call is answered automatically by something like voicemail or call forwarding. In this case, the call is prevented from joining the conference.
    lastRtpPort = <int>
This is the highest RTP port number to be used for voice data.
    migrate = <existing callId> : <new Phone Number> | Id-<new callId>
This is used to transfer a call from one phone to another. For example, you may be in a meeting at work using your office phone but then need to leave work. Migrate lets you seamlessly transfer the call to your cell phone.
    migrateToBridge | mtb = <bridge host> : <bridge port> : <callId>
TODO: Describe!
    mute | m = true | false
If true, the call is muted as soon as it's established.
    muteWhisperGroup | mwg = true | false
If the call is whispering in a whisper group and that whisper group is muted, the call is muted.
    muteConference | mc = true | false
When set to true, the call will not hear any audio from the conference.
    name | n = <string>
The name associated with the call.
    phoneNumber | pn = <phone number>
This is the phone number to call. It can be a valid phone number or a SIP URI.
    phoneNumberLocation | pnl = <3 character phone number location>
We have used this to determine if the phone number needs a prefix or not for an internal extension.
    protocol | p = <signaling protocol>
The default protocol is SIP. NS (for non-signaling) is used for setting up input treatments.
    remoteCallId | rid = <remote callId>
If specified this is added to the SDP as an attribute a=callId:<remoteCall>
    remoteMediaInfo | rm = <String>
This is used with the non-signaling incoming call setup agent NSIncomingCallAgent .
    voipGateway | vg = <ip address>
This is the SIP/VoIP gateway to use for this call. If not specified, the voipGateways for the system is used to pick a gateway.
    useConferenceReceiverThread | ucrt = true | false
    voiceDetection | v = true | false
When set to true, the audio data from the call will be analyzed to determine if a person is speaking. A STARED_SPEAKING message is sent to the socket when it's determined that a person just started speaking. A STOPPED_SPEAKING message is sent when the person stops talking.
    voiceDetectionWhileMuted | vm = true | false
Enable speech detection even if the call is muted.
<treatment> is either
	file:<path to audio file> or
	dtmf:<numbers 0-9 # *> or
Enter ? to display current call parameters


addCallToWhisperGroup | acwg = <whisperGroupId> : <callId>
Add the call to the whisper group.
allowShortNames | asn = true | false
When set to true (the default), the name associated with the call can be used to identify the call. If false, only the call id can be used to identify the call.
bridgeLocation | bl = <3 character bridge location>
This is a 3 character identifier used as part of the call id when the bridge generates a call id.
callAnswerTimeout | cat = <seconds>
This is the default number of seconds in which a call will be terminated if not answered.
callAnsweredTreatment | at = <answer treatment>: <conferenceId>
This is the default audio treatment to be played when the call is answered.
cancel = <callId>
Terminate the call with specified call id. If the call id is 0, all calls will be terminated.
cancelMigration | cm = <callId>
Cancel a call migration in progress.
comfortNoiseType | cnt = 0 | 1 | 2
Comfort noise is white sent sent when there's no real audio data to send. The gives the receiver comfort knowing that the connection is still alive. 0 disables sending comfort noise, 1 means to generate a packet with noise, 2 means to use the RTP comfort noist payload. Default is 2.
comfortNoiseLevel | cnl = <byte level>
This determines how low the comfort noise is. Default is 62.
commonMixDefault | cmd = true | false
The default for a whisper group is to have a common mix. This means that data data received from each call is added to the common mix. When data is sent to a member, the data sent is the common mix minus the members contribution to the common mix. If a whisper group has no common mix, then each member must have a private mix for other members it wants to hear.
conferenceInfo | ci
Display information about all conferences and the members in each conference.
createConference | cc = <conferenceId>:PCM[U]|SPEEX/<sampleRate>/<channels>[:<displayName>]
Conferences are automatically created when the first member joins the conference and removed when the last member leaves. For automatically created conferences, the audio quality is PCMU/8000/1. createConference is used to create a conference with the specified audio quality. Once a conference is created, it persists until it is explicitly removed.
createWhisperGroup | cwg = <conferenceId> : <whisperGroupId> [: <attenuation factor>]
Create a whisper group in the conference. If the attenuation factor is 0, members in the group will hear each other only. If the attenuation is non-zero, members will hear the whisper group attenuated when they are talking in a different whisper group. Members talking in the whisper group will hear other whisper groups to which they belong attenuated.
endConference | ec = <conferenceId>
Terminate all of the calls in the conference.
deferMixing | dm = true | false
Default is false meaning that when a packet is received for a call it is put in the jitter buffer and the next packet to be taken from the jitter buffer is processed. If true, the next packet to be taken from the jitter buffer is deferred until it's time to send data to calls.
destroyWhisperGroup | dwg = <conferenceId> : <whisperGroup>
Remove a whisper group.
cnThresh | ct = <int>
This is a parameter of the speech detector. Default value is 50. It is the number of averages to test for determining whether someone is speaking or not. Changing this value changes the sensitivity of the speech detector.
dtmfSuppression | ds = true | false
Default is true meaning that when a dtmf key is detected, an attempt will be made to suppress the sound from being heard by other members.
Disconnect from the bridge and do not terminate any calls started on this connection.
doNotRecord | dnr = true | false [:<callId>]
Do not record data from this call.
prefixPhoneNumber | ppn = true | false
Default is true meaning that phone number prefixes such as 9 for an outside line will be prefixed.
Forces the log file buffers to be written to the file.
forcePrivateMix | fpm = true | false
Used for debugging.
fowardData | fd = <dest callId> : <src callId

fowardDtmfKeys | fdtmf = true | false
This is used for two party calls where one party is a conference bridge that is to be joined a conference.
Force a garbage collection.
getMixDescriptors | gmd [= <callId>]
Display the mix descriptors for a call. If the call id is omitted, then the mix descriptors for all calls are displayed.
Get bridge status. This shows the number of conferences and the number of calls.
help | h | ?
Display the list of commands.
incomingCallTreatment | ict = <treatment>
This is the default audio treatment to play for an incoming call.
incomingCallVoiceDetection | icvd = true | false
Default is true.
internationalPrefix | ip = <String>
This is the prefix to use for international calls. Default is 011.
localhostSecurity | lhs = true | false
Default is false meaning that a connection to the bridge can be made from anywhere. If true, connections to the bridge are only allowed from the machine on which the bridge is running. TODO: Extend this is allow a list of trusted hosts.
logLevel | l = [0 - 10]
Default log level is 3. Higher levels mean more log messages. 5 is often useful. Beyond 8 there are way too many messages!
longDistancePrefix = <String>
Phone number prefix for long distance calls. Default is 1.
maxJitterBufferSize | maxjb = <int> : <callId>
Default max jitter buffer size is 9 packets.
minJitterBufferSize | minjb = <int> : <callId>
Default min jitter buffer size is 3 packets.
monitorCallStatus | mcs = true | false :<callId>
Monitor call status for a particular call.
monitorConferenceStatus | mcc = true | false :<conferenceId>
Monitor call status for all calls in a conference.
monitorIncomingCalls | mic = true | false
Monitor call status for incoming calls.
monitorOutgoingCalls | moc = true | false
Monitor call status for outgoing calls.
mute | m = true | false :<callId>
Mute a call.
muteWhisperGroup | mwg = true | false :<callId>
Mute the call in a particular whisper group.
muteConference | mc = true | false :<callId>
Mute the conference from the call. The call will not hear anything.
numberOfMembers | nm = <conferenceId>
Displays the number of calls in a conference.
outsideLinePrefix = <String>
Phone number prefix for an outside line. Default is 9. Two single or double quotes can be used to specify no outside line prefix.
pause | p = <int ms>
For debugging. Stops the receiver thread from receiving for the specified number of milliseconds.
pauseTreatmentToCall = <callId>
Pause the treatment to a call.
pauseTreatmentToConference = <conferenceId>
Pause the treatment to a conference.
voiceDetectionWhileMuted | vm = true | false [: <callId>]
When true, voice detection is enabled for the call even while it is muted.
playTreatmentToCall | ptc = <treatment> [:<callId>]
Play an audio treatment to the specified call.
playTreatmentToConference | pc = <treatment> [:<conferenceId>]
Play an audio treatment to all the members in the specified conference.
playTreatmentToAllConferences | pca = <treatment>
Play an audio treatment to all calls in all conferences.
packetLossConcealmentClass | plcc = <String plc class name> : <callId>
This is the class name to use for packetLossConcealment. The class must implement the interface com.sun.voip.Plc. Default is com.sun.voip.PlcCompress. The other implementation is com.sun.voip.PlcDuplicate.
powerThresholdLimit | ptl = <double>
This is a parameter for the speech detector. Default is 1.05. Changing this will change the speech detector sensitivity.
printStatistics | ps
Log statistics for all calls.
pmx = <frontBack -1 to 1> : <leftRight -1 to 1> :  <volume> : <callId> : <callId with private Mix>
Set a private mix for a call.
privateMix | pm = <volumes> : <callId> : <callId with private Mix>
recordConference | rc = true | false :<conferenceId> :<recording file path>
Record everything from the conference whisper group.
recordingDirectory | rd = <directory path>
Location of recording files
recordFromMember | rfm = true | false :<callId> :<recording file path>
Record data received from a call.
recordToMember | rtm = true | false :<callId> :<recording file path>
Record data sent to a particular call.
removeCallFromWhisperGroup | rcwg = <whisperGroupId> : <callId>
Remove call from whisper group.
releaseCalls = true | false
Release calls so that when this connection is closed the calls started from this connection won't be terminated.
removeConference | rconf = <conferenceId
Remove a conference which was previously created.
restartInputTreatment | rit = <callId>
Restart an input treatment.
resumeTreatmentToCall = <callId>
Resume a paused treatment to a call.
resumeTreatmentToConference = <callId>
Resume a paused treatment to a conference.
rtpTimeout | rt = <seconds> 
This is the number of seconds by which either an RTP or RTCP packet must be received. If a packet is not received in this time, the call is terminated. The default is 330 seconds. This is useful if a softphone dies while it has a call in progress with the voice bridge.
sendSipUriToProxy = true | false
Default is false. If the sip URI can be resolved to a host and a port, the SIP INVITE will be sent directly to the host instead of the proxy.
conferenceJoinTreatment | jt = <join treatment>:<conferenceId>
This is the audio treatment to play when a call joins the conference.
showWhisperGroups | swg
Show information about all whisper groups.
setInputVolume | siv = <volume> : <callId>
Default is 1.0. This value is used to multiply each sample thus increasing or decreasing the volume from the call.
setOutputVolume | sov = <volume> : <callId>
Default is 1.0. This value is used to multiply each sample of audio sent to the call.
statistics | stat = <seconds>

Print bridge statistics periodically.

stopTreatmentToConference | sc = <conferenceId> : <treatment>
Stop playing a treatment to the members of a conference.
stopTreatmentToCall | stc = <callId> : <treatment>
Stop playing a treatment to a call.
synchronousMode = true | false
When true every command returns a status of success or failure.
defaultProtocol | dp = <SIP Proxy host name>
Default is SIP. Also this is the only valid value.
defaultSipProxy | dsp = <ip address>
This ishost to which INVITE requests are sent when the bridge can't send them directly.
Write a message to the log and also send a packet to the discard port of the first voipGateway. This is solely for debugging.
traceCall = true | false : <callId>
Turn on more logging for an individual call.
transferCall | tc = <callId> : <conferenceId>
Transfer an incoming call to its target conference.
tuneableParameters | tp
Display all of the tuneable parameters
useSingleSender = true | false
Default is true. If false, a separate sender is used for each conference. If true, one sender is used for all conferences.
useTelephoneEvent = true | false
Use telephone event payload for dtmf keys if supported. Default is true.
voIPGateways | vgs = <ip address>[,<ip address>...]
This is the list of SIP / VoIP gateways used for calling to the PSTN. Each gateway is tried until one is successful.
wisperGroupOptions | wgo  = <conferenceId> : <WhIsperGroupId> [: Locked=t|f ]
[: transient=t|f]
[: attenuation=<double>]
[: noCommonMix=t|f
Set whisper group options.
whisper | w = <whisperGroupId>: <callId>
Stop whispering in previous whisper group and start whispering in this whisper group.
whisperAttenuation | wa = <double>
Default whisper group attenuation.
writeThru | wt = true | false
When true the buffered log file will be flushed to the file every time a line is written.
<callId> is a String identifying the call
<whisperGroup> is a String identifying a whisper group


Initial audio problems are quite common due to the wide variety of audio hardware and operating systems. When using audio for the first time it is best to use a standard program to ensure audio is working.

Detailed instructions for audio setup can be found at*checkout*/lg3d-wonderland/src/classes/org/jdesktop/lg3d/wonderland/scenemanager/resources/audio.html

One you are sure you have selected the right device and audio is working, then it's time to try the softphone. Start the softphone as described in RUNNING THE SOFTPHONE. In the softphone window, click on Settings and select Test audio. Make sure the correct devices are selected for the Mic and speaker. Then click Test. If you hear the tone and your own voice, audio is working.

Here are some hints as to what to do when there is no audio.

Verify network connectivity

Make sure the machine running the softphone can communicate with the voice bridge.

Make sure the softphone has registered with the voice bridge.

There should be a green SIP address at the bottom of the softphone window. If the address is, the softphone will only work if the voice bridge is running on the same machine and also using as its IP address.

Make sure the softphone has connected to the bridge.

Line 1 in the softphone window should have both the red and green lights on.

Check the softphone log for errors.

If another program (such as another copy of the softphone) is already running and has the audio attached, there may be error messages indicating the audio system couldn't be started. Terminate all programs using the audio system and restart the softphone.

Check the voice bridge log for errors.

Look for IOExceptions and SIP Exceptions which cause calls to END. A common problem is a bad IP address either for the bridge or softphone. localhost ( can only be used if both the softphone and bridge use this address.

NAT / Firewall problems.

If the voice bridge is behind a firewall or NAT, then the SIP Port (default 5060) and a range of RTP ports must be opened up for clients such as the softphone to use. If the softphone is behind a firewall or NAT, then it must be configured to use the voice bridge as the registrar. This can be done by passing in the argument -r ";sip-stun: or by going to Settings / Configure in the softphone window and select sip. Then set the register address to ;sip-stun: and click Save and close.

Verify that the bridge knows about the softphone call.

Use telnet to connect to the bridge. Display call information and mix descriptors using ci and gmd commands.

If audio is working but it sounds bad, try to determine the source of the problem. There are a number of things which could go wrong. There are a number tools to help identify the problem. The softphone can record audio it sends and receives. The voice bridge can also record the audio it sends and receives. Both the softphone and the voice bridge can display graphs of the data received.

Packet loss.

If you're using a wireless network, there might not be enough bandwidth or packets may be getting dropped. In the softphone window, click on Call and select Performance Monitor. In the Performance Monitor window select Received Packets, Missing Packets, Average Receive Time, and Jitter. All of these graphs should be almost straight horrizontal lines. Received packets should be 25, Missing Packets should be 0, Average Receive Time should be 20 and Jitter should be a small constant. If you're seeing other values then you are likely to be having network problems.

Microphone Overrun

If someone else sounds choppy, it could be that they are overruning their microphone. They should use Call / Record in the softphone to record the audio being read from their microphone. If that's bad, then they should go to Call / Settings /media and set the microphone buffer size to a bigger value such as 200 milliseconds. If that doesn't help, then run the bridge monitor in jvoicebridge/bridgemonitor. Enter the public address and ports for the bridge. Press Monitor. In the right window, select the call for the softphone. In the window with the call CallStatus, click Monitor. This will start the receive information graphs for the sfotphohne.


If there's a long delay between when someone speaks and you hear them, try doing a simple latency test. Take turns counting. One person starts by saying "one" and as soon as the other person hears it they say "two", etc. You will get a good idea of how long it takes for a message to get to you. If latency is high, try using a utility such as ping or trace route to determine the actually latency.


Last updated 5/12/2008

This section describes the bridge's functionality as well as the current architecture and design.


The voice bridge is a java application used to initiate phone calls and mix them together into meetings (aka conferences). Calls to the bridge use SIP and RTP, standard VoIP protocols described below. Call setup, control and status is accesible over a control channel, which uses ascii commands over a TCP socket.

The bridge application is written in 100% Java. It uses the NIST SIP Stack as a basis for the SIP implementation. On Solaris when run as root, it is scheduled in the Real-Time scheduling class to better meet timing requirements.

Below are two diagrams that describe the bridge functionality as well as the overall architecture:

Conference Mixing, The Basics#

The Voice Bridge handles a number of calls, each of which represents audio data received from and sent to a single source (e.g. a telephone or softphone). Each call is assigned to exactly one conference. In the simple case, each call receives audio from all the other calls in the same conference, so when Jon, Joe and Nicole are in a conference together, they each hear what any of the others say. The process of combining the audio data is called mixing.

The mixing itself is quite straight forward: audio is converted into a linear representation, and samples from every member are added togther to create the output, which is then sent to each member minus the member's contribution. There are, however, a number of complicating factors. Mixing over IP means the bridge receives unreliable, packetized audio, and must account for latency, jitter and dropped packets. Bridge features such as private mixes and voice chats require more complicated mixing. Finally, the bridge receives a potentially continuous stream of data, and must do all of its work in real-time -- if it gets behind, it results in poor audio quality.

Below, we discuss the protocols our bridge handles, as well as the key features that differentiate our bridge from others.

Protocols: Control, SIP and RTP#

The Control Protocol is a series of ascii commands that are used to setup and control calls in progress. A complete list of commands can be found above.

An example of commands to place two calls in the same conference:

telnet escher 6666
conferenceId=Test Meeting
<blank line>
The bridge sends status back:
SIPDialer/1.0 100 INVITED CallId=...
SIPDialer/1.0 110 ANSWERED CallId=...
SIPDialer/1.0 200 ESTABLISHED...

conferenceId=Test Meeting
<blank line>

SIPDialer/1.0 100 INVITED CallId=...
SIPDialer/1.0 110 ANSWERED CallId=...
SIPDialer/1.0 200 ESTABLISHED...
The two calls are now in the same conference and each call can hear what the other is saying. More calls can be added to the conference by sending more commands to the bridge using the same conferenceId.

Session Initiation Protocol (SIP, RFC 3261) is used to initiate an audio connection between the voice bridge and other endpoints (called SIP user agents) such as a Cisco Gateway connected to the PSTN, a softphone, or a SIP IP phone. The result of a SIP negotiation is that both user agents know each other's UDP port number to which data should be sent. Both sides also agree on the sample rate and channels and RTP payload number to use for sending RTP data.

Here is an example of a SIP exchange to initiate a call:

                  Voice Bridge                               SIP User agent
       INVITE (with SDP parameters) --------------------->
                                    <---------------------   OK (with SDP parameters)
       ACK                          --------------------->
                                    <==== RTP DATA ======>
       BYE                          --------------------->
                                    <---------------------   OK                                                                      
The Session Description Protocol (SDP, RFC 2327) sent with the INVITE contains lines such as these:
m=audio 62952 RTP/AVP  0 102 111

a=rtpmap:0 PCMU/8000/1
a=rtpmap:102 PCM/8000/2
a=rtpmap:111 PCMU/8000/2

The UDP port number at which the the bridge will expect data for this call is 62952. The RTP payload types supported by the bridge are 0, 102, 111. The rtpmap entries describe what each payload represents. The bridge also specifies that the preferred payload is 111 which is the payload for the sample rate and channels for the meeting. The default meeting parameters are PCMU/8000/2.

The other SIP User agent must pick one of the offered payloads to use in the SDP when it sends the OK. For example, it may reply with

m=audio 12345 RTP/AVP 111
which means the bridge must send data to UDP port 12345 for this call and use RTP payload 111.

Real-time Transport Protocol (RTP, RFC 1889) has a 12 byte header followed by audio data.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   |V=2|P|X|  CC   |M|     PT      |       sequence number         |
   |                           timestamp                           |
   |           synchronization source (SSRC) identifier            |
   |            contributing source (CSRC) identifiers             |
   |                             ....                              |

The payload type (PT) in the header indicates the type of data (e.g., 8000 ulaw, PCM/44100/2, etc.) There is a sequence number in the header which indicates the packet number in the stream. There is also a "timestamp" which indicates the starting sample number in this packet. For example, the first packet would have a timestamp of 0 and if there are 160 samples in the packet, the next packet would have a timestamp of 160.

An RTP packet generally contains 20ms of data. User agents are expected to send one data packet every 20ms (which is 50 packets per second). To give you an idea of packet sizes, for 8000 samples per second ulaw, each packet contains 160 8-bit samples + the 12 byte RTP header. For PCM/44100/2 there are 44100 / 50 * 2 16-bit samples which is 3528 bytes of data + the 12 byte RTP header.

Whisper Groups#

A whisper group is a subset of members in a conference. The subset of members hear each other at full volume and the main conference attenuated. A member may be in more than one whisper group at a given time but the member may only be talking in either the main conference, or one whisper group. When a member is in multiple whisper groups and talking in one, the member hears all other voice data attenuated expect for the wg the member is talking in.

Whisper groups are used as the basis of many features in Meeting Central, including voice chat, the waiting room, and dialing out.

Private mixes#

A private (a.k.a. customized) mix allows a member to adjust the volumes of other members as desired. With stereo this allows moving members around in the stereo field. This private mix is heard only by the one member who creates the mix -- everyone in the meeting can potentially have his own private mix for other members.

Codecs and Resampling#

We support the standard telephone quality of 8k ulaw encoding as well as some of our own defined encodings: PCM and PCMU encoding at sample rates 8k, 16k, 32k, 44.1k (cd-quality), and 48k, in either mono or stereo. We also support SPEEX encoding.

Each conference has a defined sample rate, which is the maximum sample rate for any call in the conference. Calls may be at lower sample rates, and will be resampled so they can be mixed with calls at the higher sample rate. A typical call might include several people on softphones at 16000/stereo and people on regular phones at 8k/mono.


The bridge has the ability to play audio clips, known as treatments, to all calls in a conference or to an individual call. These are used, for example, to prompt the user for input, or signify that someone has joined or left the conference.

Speech and DTMF detection#

The bridge has a speech detector, which is used by Meeting Central to display speaking indicators. It also has a DTMF detector that can detect when you press a number on your telephone keypad, and determine which number was pressed.


Meetings (a.k.a. Conferences)#

A meeting has
  • String id
  • Single sender thread to schedule sending voice data every 20 ms
  • Single receiver thread to dispatch data from calls.
  • List of the members (calls) in the meeting.
  • Preferred sample rate and channels


A member is associated with a SIP user agent (phone call). A member is the object which joins a meeting. Members have private mix instructions and the socket to send data to the user agent. The member is responsible for receiving an RTP stream of packets and for calculating what data should be sent to the call.

Receiver Thread#

Uses java.nio selectors. Waits for data to arrive from calls and dispatches it to the members of a meeting.

Mixer/Sender Thread#

Wakes up every 20ms, asks each member what data to send and sends data to each call. Uses available processors to divide up the work to send data to each call. (((Each thread is given a subset of the calls to handle. Perhaps it would be better to use one queue of work along with lock-free work stealing?)))

Basic mixing: the 20ms clock#

The meeting receiver thread waits for data to arrive from calls. Each packet represents data for a 20ms time period. When a packet arrives, the receiver thread dispatches the data to the appropriate member. The member decodes the packet to 16 bit signed linear data and resamples if necessary to match the the sample rate for the meeting. The member appends the resulting packet to its private linked list of packets. The index of the packet in this linked list is used to index into a linked list for the common mix for the meeting. The member's data is mixed (by adding samples) to the appropriate common mix list element.

Every 20ms, the sender thread processes and removes the first element in every call's linked list plus the first element in the common mix. A packet is sent to each call containing the common mix, minus the call's contribution (if any) to the common mix.

Here is a graphical representation of the data to and from each call (A - F) when all the calls are in the main conference. Whisper groups could be shown by adjusting the appropriate arrows.

Whisper groups#

For private chats (a.k.a whisper groups), mixing is more involved. This is accomplished by creating mix descriptors for each member in the whisper group describing the audio from other members which should be sent to the member.

Private mixes#

Private mixes involve adjusting sample volumes by multiplying each sample by a factor to amplify (greater than 1) or attenuate (less than 1).

Private mixes are handled by each member when asked by the sender thread to calculate the data to send. The member adjusts the volumes of the other members as desired and creates a mix of data.

Codecs and Resampling#

To upsample (go from a lower sample rate to a higher sample rate), we interpolate to get a new sample between existing samples. This may involve looking across packet boundaries to avoid aliasing at the beginning and end of each packet. After upsamling, a low-pass filter is used to avoid adding high-frequency noise to the sample.

To downsample, average a set of samples into one sample. This also involves looking across packet boundaries to avoid aliasing. Before down-sampling, a low-pass filter is used to take out high-frequency noises that are another source of aliasing (heard as very slurred "s" sounds).


Trying to get the mixer/sender thread to run at precise 20ms intervals is very difficult with standard java. We strive to average 20ms. On solaris we run in the Real-Time Scheduling class.

Garbage collection pause times are also a concern but we have been able to choose the gc parameters so that the pause times don't interfere with the real-time requirements of the sender thread. We use ParNew with a small young gen of 3mb, and CMS old.


We have tested successfully with 400 calls at 8k/1 ulaw. The total time sender thread cycle must be less than 20ms. As the number of private mixes and whisper groups increase, more work needs to be done to calculate what data should be sent to each call.

It is possible to have more than one voice bridge each with calls for the same conference. The user of the voice bridges is responsible for setting up virtual calls between the voice bridges so that private mixes can be set to route data for calls between the voice bridges. This is done in the voicelib which is used by


We need to be able to quickly add new features to the bridge without changing or breaking existing functionality. Specfically, we would like to be able to easily add new commands, new codecs, and new forms of mixing. This will allow us to experiment more with the bridge without worrying about rearchitecting the bridge every time.


The most common mixing is with all members in the main meeting. Things start to get more complex when using private mixes and whisper groups. There are objects to specify private mixes, and objects to specify whisper groups and mix descriptors to specify audio sources.

Private Mix Scalability#

This is an order n square problem. Each member is allowed to have a private mix for every other member so in the worse case each member has to adjust the volumes of every other member.


There is potential contention between the sender and receiver threads. By taking a snapshot of the first element in each linked linked involved with mixing while holding a lock we can then process the data later without holding the lock. This turned out to improve scaling greatly. Before we were holding the lock while processing all of the data.

We have experimented with several designs for concurrency in the bridge, including managing threads by call, by conference and in the bridge as a whole. We have also experimented with single, mutliple and pools of threads for dealing with both the input and output data from each conference. We would like to further explore how to optimize the use of threads and thread pools to handle input, mixing and output.

Control Channel API#

The control channel API is responsible for parsing and executing commands sent to the bridge. Currently, this is implemented by two classes:
  • RequestHandler reads input from the socket.
  • RequestParser parses the result.

The handler and parser are tightly coupled to each other and to the text-based input that the bridge currently supports. We have requests to provide different remote APIs, such as one based on XMPP and one based on JERI, but we cannot because of this coupling.

The collaboration server adapts the text-based API with a more object oriented approach:

  • Bridge represents the top-level connection to the voice bridge.
  • CallControl represents the actions for an individual call.


The mixer processes audio data from the calls in a conference, and produces output for each member of the conference every 20ms. The main classes associated with the mixer are shown in the diagram below:

The mixing process is divided into two separate tasks: receiving audio and sending audio. The next two sections describe these processes.

Receiving Audio#

Each call in a conference is associated with an RTP connection, which is differentiated by UDP port on the bridge. Each call periodically sends UDP packets, averaging one packet every 20ms. The actual interarrival time varies, for example some clients send two packets every 40ms. A client may send silence (packets of all 0s), or may send a comfort payload (an indication that no audio will be sent for a while), or may simply stop sending data. All these conditions must be treated as silence by the bridge.

Below is an interaction diagram describing how packets are handled as they are received:

An NIO-based SocketHandler receives data for all connections in a conference. When data is received, it is annotated with information about the Member that it came from and added to a queue of incoming packets to process. A pool of workers removes entries from this queue, and calls the audioReceived() method of the associated Member. The Member then processes the audio, converting it to a format appropriate for the current Mix, and performs additional functions such as speech and DTMF detection, all of which are encapsulated in an AudioProcessor. The Member then adds the data to its current Mix.

Sending Audio#

Every 20ms, the Bridge sends each member its audio data. This data is different for each member: even when everyone is hearing the same thing, each member's own contributions must be subtracted out so there is no echo when the member speaks. In addition, differing private mixes and whisper groups mean that each member hears different audio.

Below is an interaction diagram describing how packets are generated for each member every 20ms:

Every 20ms, the Timer takes a snapshot of each Mix in the conference, creating an immutable array of MixSnapshot objects. It then takes a snapshot of the set of Members in the conference, an adds each Member to a queue to send data to. A pool of workers removes the entries from this queue, and calls each Member's getAudio() method, passing in the MixSnapshots. The Member's getAudio() method performs the actual mixing for that member, taking into account the contributions for each member, as available from the MixSnapshot. Once the audio to send has been created, it is converted to a format appropriate to send using an AudioProcessor. The worker then constructs an RTP packet and sends it to the appropriate socket.


The Bridge object represents the bridge as a whole. Each conference in the bridge is represented as a instance of the ConferenceManager class. The Bridge object is primarily responsible for managing conferences, and includes methods to add and remove conferences, as well as manage the control channel.


The ConferenceManager object represents a single conference taking place in the Bridge. Each conference has a set of parameters, such as sample rate, associated with it, as well as Members and Mixes. Members represent the calls in a conference, and the conference manager lets you add and remove Members as calls are added to and removed from the conference. Mixes represent a subset of calls within a conference that all hear each other. Conferences have a "main" or "default" mix associted with them, and may create additional mixes such as waiting rooms of voice chats.


Every conference has a Timer, which is responsible for sending packets every 20ms. The Timer executes a loop every 20ms, first taking a snapshot of the current Mixes and Members, then enqueueing them for output by a pool of workers, and finally sleeping for the remainder of the 20ms. Java timers generally have 10ms resoltion, so different implementations of Timer may use different strategies such as calculating and correcting for mis-sleeps or using native methods to acheive a 20ms average send time. We have found that having a single sender thread for all conferences works well. This thread creates a list of work to do and dispatches as many threads as there are processors to do the actual sending.


The Member interface represents a single call within the conference. Each member is associated with an RTP connection, and stores the hostname and port where RTP data should be sent, as well as information about the negotiated RTP media information (sample rate, channels and encoding) for that connection.

Each Member is associated with one current Mix. This is the Mix that the member is speaking in, and all data received from the member will eventaully be added to that Mix. The control channel notifies a Member that their Mix has changed using the setMix() method of the Member.

The Member maintains private information about how to create that Member's audio. The setPrivateMix() method is used by the control channel to adjust the volume level and location in the stereo field for other members of the conference. If no private mix is set for a given Member, their contribution is used directly. Similarly, the Member maintains a list of attenuations for each Mix in the conference, allowing the Member to individually adjust the volume level and location in the stereo space of a voice chat or the waiting room. The attenuation levels for the Member are set by the control channel using the setAttenuation() method.

Whenever a packet is received from the call associated with the member, the Member's audioReceived() method is called. The audioReceived() method is responsible for converting the received media into a format appropriate for the current Mix. This conversion is generally done through an AudioProcessor, as described below.

Every 20ms, the Member's getAudio() method is called to get the mixed audio for that Member. The Member uses the data in the MixSnapshots it is given, as the basis for generating their own private mix. Once the data is mixed, it is converted to an appropriate form for output, again using the AudioProcessor.


The Mix interface represents a group of Members in the conference who all hear the same audio. Each conference has one main Mix, and may have others to represent waiting rooms or voice chats. The Mix interface primarily represents the Members associated with the Mix. Internally, Mixes may be much more complicated, for example storing a pre-mixed value of all contributions that can be modified for each Member to create the Member's mix.

Every 20ms, the Timer calls the snapshot() method on each Mix. This generates an immutable MixSnapshot object, which can be used by the Members to construct their private mixes.


MixSnapshot is a tagging interface to represent the private data stored by the Mix at snapshot() time. The data is passed in to the Member, which can use it as approriate the generate the Member's private mix.


A ConferenceFactory is responsible for creating Members and Mixes for the conference. The interfaces for Member and Mix specifically avoid any restrictions on data representation within the Mix and Member, or the implementation of the audioReceived() and getAudio() methods. This is intended to allow flexibility in trying out new data representations for mixing. The ConferenceFactory guarantees that the implementation class of the Member, Mix and MixSnapshot interfaces will be known ahead of time, so that Member and Mix objects can use methods outside the interface to store data and perform actual mixing.


One of the goals of this design is to allow experimenting with mixers with different data representations. Is the proposed architecture of ConferenceManager, Mix, Member, and ConferenceFactory a good abstraction? What problems will we likely run into if we try to replace the mixer?

  • How should we compose audio transforms?

We need to apply a number of audio transforms (filters, resampling, speech and DTMF detection) to each packet that comes in and goes out. How can we represent these transforms in a way that is composable and efficient?

List Driven Mixing#

The current bridge implementation maintains a list of descriptors for each member which describe the exact mix the member wants.

At any point in time, each member "knows" what data and volume level it wants from each other member.

With this approach, every member is talking in one and only one whisper group. The common mix is just another whisper group.

Each whisper group maintains its own mix for the members in the group.

A member starts out in the common mix so its descriptors would look like this:

  • common mix, volume level 1.0
  • member A, volume level -1.0

Minus 1 means to subtract out the data. This is typically used by a member to subtract out its own data so it doesn't hear an echo of what its saying.

When the worker thread asks the member what data it wants, a mix is created using the descriptors above. In this case, it's the common mix minus the member's own data.

As another example, if member A has a private mix adjusting the volume for member B to 1.5, then member A's list has

* common mix, volume level 1 * member A, volume level -1 * member B, volume level .5 (It's .5 because member B is already in the common mix volume 1.0)
When this mix is created, it will have member B at the adjusted volume level.

Now let's assume that a whisper group is created such that when a member is talking in that whisper group all other whisper groups are attenuated to .13. When the member belongs to the whisper group but isn't talking in it, the wg is attenuated to .13.

When member A joins whisper group W, the descriptors look like this:

* common mix, volume level 1 * member A, volume level -1 * whisperGroup W, volume level .13
When A starts talking in W,
* common mix, volume level .13 * member A, volume level -1 * whisperGroup W, volume level 1.0
Now lets add B to whisper group B and B starts talking in that group. There's no change to A's descriptors because W already has B's data at volume level 1.0.

Now lets add a private mix for B with volume 1.5.

* common mix, volume level .13 * member A, volume level -1 * whisperGroup W, volume level 1.0 * member B, volume level .5
Now A stops talking in W and resumes talking in the common mix.
* common mix, volume level 1.0 * member A, volume level -1 * whisperGroup W, volume level .13 * member B, volume level .5 * .13
Sometimes it's necessary to play an audio message to either an individual member ("dial 1 to join your meeting") or to all the members of the conference (i.e. join click when a new member joins). These audio messages are called treatments and they too can be added to the mix easily with a descriptor for the treatment.
* common mix, volume level 1.0 * member A, volume level -1 * whisperGroup W, volume level .13 * member B, volume level .5 * .13 * dialToJoinTreament, volume 1.0
As you can see, the mix descriptors are updated when a private mix is created or any member of a whisper group starts or stops talking or a treament is added.

The advantage of this approach is that the descriptors are adjusted once and used until there changes.

The mixing algorithm simply takes each descriptor and multiplies and adds the data to produce a mix.

For the simple case, only two descriptors are needed. For the most complex cases with lots of whisper groups and each member having a private mix for each other member, the number of descriptors will be the number of whisper groups plus the number of members.

It is also possible to have a conference with no common mix. In this case, the mix descriptors for a member describe precisely what that member should hear. We use with with Project Wonderland where only avatars in range of each other should hear the others' audio.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-6) was last changed on 07-Apr-2011 23:12 by Nicole Yankelovich