07-27-2014, 06:26 PM
Hey Guys,
recently three of my older cards brick (1 x hd5970, 1 x hd6990 and 1 x hd7970) so bought three hd7970's.
The new cards are all GIGABYTE GV-7970C-3GD which are overclocked to 1000mhz by vendor.
Problem is that when running them in parallel the OS hangs after 1-2 minutes.
But I think the GPUs are ok because if I run them solo with -d 1, -d 2 and -d 3 the OS does not hang. It's only if I run at least two in parallel the OS hangs. It doesn't matter if they run it in a single or in multiple oclHashcat instances.
To sort out the problem i tried a lot of different scenarios but now I am out of ideas
First I tried them on two different systems that I use for a couple of time with other GPUs:
1st:
- Intel I7 4770k
- ASUS Z87 Expert
- Ubuntu 14.04 lts, 64 bit
2nd:
- Intel I7 4770k
- ASUS Z87 A
- Ubuntu 12.04 lts, 64 bit
On both systems the behavior is exactly the same, and since I used these system with other cards my feeling tells me there's no hardware defect on the boards, cpus or rams.
More Information about hardware:
- The original cooler as been removed and replaced with EK Watercooling blocks. They are connected in serial with a watercooling bridge. The cooling flow works fine
- There are no extender cables/risers involved. All cards sit directly on the board
- All cards run headless, none of them is connected to a monitor
Heat: The GPU's run at ~40c on idle and increase to ~55c under load before the OS hangs. There's no special heat threshold that lead to OS hangs, it happens somewhere > 50c.
Power: each GPU has a dedicated 700W power supply
Things I tried to change:
- Tryed with catalyst 14.4 and 14.6 beta on both systems, always ran amdconfig --initial -f --adapter=all afterwards and rebooted
- Updated the Mainboards bios to the latest versions (1803)
- Updated the GPUs bios to the latest versions (F72)
- Manually switched PCI-E settings in bios to from x16 to x1
- Manually disabled ASPM in bios
- Manually disabled all other power-managed related stuff in bios
- Underclocked the cards to stock hd7970 settings (925/1375)
- Attached original fan to fan-plug on the cards
- Switched the GPU positions from 1 to 2, 2 to 3, 3 to 1, etc..
- Disabled iommu on kernel commandline
- Blacklisted mei and mei_me modules
- Tried only with 2 cards
- Tried both, ALU intensive and memory intensive algorithm
- Bought a new mainboard (AMD AM3+ with 990fx chipset); to make sure it's not the Z87
- Switched the switch onboard to "2" to use the F70 bios that comes per default
- Switched back to "1" and tried to flash with reference hd7970 bios. This action nearly bricked the card as it was causes instant kernel reboots. So I flashed it back to F72 which is the latest version
- Installed a fresh Windows 7 (64 bit) and tried on windows
- Attached the crossfire bridges
- Removed the crossfire bridges
- dmesg didn't say anything usefull
- X11 log didn't say anything usefull
- Replaced the PSU's with other ones
One thing to note is that when I disabled X11 (so that ADL can't work and oclHashcat can not read temps etc) it looks like this:
... and when I then continiously press "s" it seems #1 continues to work ...
But when I have X11 enabled and temps are read, it always looks like this:
The system is completely frozen at this point.
Another interessting thing is by looking at the lspci output the cards run on a different PCI-E speed and ignoring my manual x1 setting from bios:
This is from ubuntu 14.04.
---
Updated with latest tests to have them complete
recently three of my older cards brick (1 x hd5970, 1 x hd6990 and 1 x hd7970) so bought three hd7970's.
The new cards are all GIGABYTE GV-7970C-3GD which are overclocked to 1000mhz by vendor.
Problem is that when running them in parallel the OS hangs after 1-2 minutes.
But I think the GPUs are ok because if I run them solo with -d 1, -d 2 and -d 3 the OS does not hang. It's only if I run at least two in parallel the OS hangs. It doesn't matter if they run it in a single or in multiple oclHashcat instances.
To sort out the problem i tried a lot of different scenarios but now I am out of ideas
First I tried them on two different systems that I use for a couple of time with other GPUs:
1st:
- Intel I7 4770k
- ASUS Z87 Expert
- Ubuntu 14.04 lts, 64 bit
2nd:
- Intel I7 4770k
- ASUS Z87 A
- Ubuntu 12.04 lts, 64 bit
On both systems the behavior is exactly the same, and since I used these system with other cards my feeling tells me there's no hardware defect on the boards, cpus or rams.
More Information about hardware:
- The original cooler as been removed and replaced with EK Watercooling blocks. They are connected in serial with a watercooling bridge. The cooling flow works fine
- There are no extender cables/risers involved. All cards sit directly on the board
- All cards run headless, none of them is connected to a monitor
Heat: The GPU's run at ~40c on idle and increase to ~55c under load before the OS hangs. There's no special heat threshold that lead to OS hangs, it happens somewhere > 50c.
Power: each GPU has a dedicated 700W power supply
Things I tried to change:
- Tryed with catalyst 14.4 and 14.6 beta on both systems, always ran amdconfig --initial -f --adapter=all afterwards and rebooted
- Updated the Mainboards bios to the latest versions (1803)
- Updated the GPUs bios to the latest versions (F72)
- Manually switched PCI-E settings in bios to from x16 to x1
- Manually disabled ASPM in bios
- Manually disabled all other power-managed related stuff in bios
- Underclocked the cards to stock hd7970 settings (925/1375)
- Attached original fan to fan-plug on the cards
- Switched the GPU positions from 1 to 2, 2 to 3, 3 to 1, etc..
- Disabled iommu on kernel commandline
- Blacklisted mei and mei_me modules
- Tried only with 2 cards
- Tried both, ALU intensive and memory intensive algorithm
- Bought a new mainboard (AMD AM3+ with 990fx chipset); to make sure it's not the Z87
- Switched the switch onboard to "2" to use the F70 bios that comes per default
- Switched back to "1" and tried to flash with reference hd7970 bios. This action nearly bricked the card as it was causes instant kernel reboots. So I flashed it back to F72 which is the latest version
- Installed a fresh Windows 7 (64 bit) and tried on windows
- Attached the crossfire bridges
- Removed the crossfire bridges
- dmesg didn't say anything usefull
- X11 log didn't say anything usefull
- Replaced the PSU's with other ones
One thing to note is that when I disabled X11 (so that ADL can't work and oclHashcat can not read temps etc) it looks like this:
Quote:Speed.GPU.#1...: 15891.9 MH/s
Speed.GPU.#2...: 0 H/s
Speed.GPU.#3...: 0 H/s
Speed.GPU.#*...: 15891.9 MH/s
... and when I then continiously press "s" it seems #1 continues to work ...
But when I have X11 enabled and temps are read, it always looks like this:
Quote:[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
Speed.GPU.#1...: 15873.6 MH/s
Speed.GPU.#2...: 15871.1 MH/s
Speed.GPU.#3...: 15880.4 MH/s
Speed.GPU.#*...: 47625.2 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 1567903186944/6634204312890625 (0.02%)
Skipped........: 0/1567903186944 (0.00%)
Rejected.......: 0/1567903186944 (0.00%)
HWMon.GPU.#1...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#2...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#3...: 98% Util, 42c Temp, 31% Fan
[s]tatus [p]ause [r]esume [b]ypass [q]uit =>
ERROR: Temperature limit on GPU 2 reached, aborting...
The system is completely frozen at this point.
Another interessting thing is by looking at the lspci output the cards run on a different PCI-E speed and ignoring my manual x1 setting from bios:
Code:
root@et:~# lspci -vv | grep -e "VGA " -e Width
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
This is from ubuntu 14.04.
---
Updated with latest tests to have them complete