2021-09-18 01:20 CEST

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0006431Spring engineGeneralpublic2021-01-04 05:11
ReporterFabriceFABS 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusfeedbackResolutionreopened 
Product Version105.0 
Target VersionFixed in Version 
Summary0006431: memleak in unitsync?
DescriptionFYI, we worked with Bibim a couple of hours on it.
All is in the title... :
Unitsync fails randomly when starting spads multi instances

After some work, I/we've found several tests :

#### TEST 1 ####
If I got my regular ./spring/maps number in it (~ 500), on first instance of spads launch, it runs, for exemple [ACE]Been, no problem.
FYI [SPADS] Spring archives loading process took 16 minutes and 53 seconds to generate its first cache.

A second instance of spads, for exemple [ACE]Mettalic, fails on this step :
******************
NOTICE - [SPADS] Initializing SPADS 0.12.21
NOTICE - [SPADS] Loading Spring archives using unitsync library version 105.0
Processus arrêté (process ended)
******************
No log available, however, bibim modified code that permits to see that it fails during the scan of the maps archives.
We had to remove bilateral.sdz (1st fail «Processus arrêté») and on Eisland.sdz (2nd fail «Processus arrêté»).
We notice that finally this instance of [ACE]Metallic can be run without those 2 maps.
I opened those 2 maps, and didn't found actually anything wrong, to be checked.
Both are in "old" format : ie > only 3 files in it.

#### TEST 2 ####
./spring/maps contains only 6 maps :
Comet Catcher Redux, DeltaSiegeDry, Red Comet, Red Comet v1.3, Throne v5 & TheRockFinal.
Nothing wrong to run several instance, no problem.
However, I found that it stops a long time on the archives, on certains hosts, it seems frozen on this step, even if there are only 5 maps.

#### TEST 3 ####
Remove of .spring[Pool/Rapid/Packages] (mv to Temp)
Launching [ACE]Been => Unbelievable... Took 10s to scan maps, surely due that the cache already done above.
Next instance of spads failed the same.
Replace back [Pool/Rapid/Packages] into ./spring
All spads server has been !quit

#### TEST 4 ####
After this, run only [ACE]Been on 105.0 > OK
Another 105.0 instance is trying to launch on [ACE]Fusion but failed around this map archive scan :
...
mapName=4B_facility
mapName=4_bases_v3
mapName=Aberdeen3v3v3
mapName=Aberdeen6v6_Fix
Processus arrêté

#### TEST 5 ####
Run only [ACE]Been on 105.0 > OK
Archive cash already done and ok, scan took ~1 sec during spads launch.
cp of [ACE]Been/cache/105rel- to [ACE]*/cache/105rel-
Result : All other hosts runs correctly and launched fine and fast like [ACE]Been.

*** CONCLUSION ***
So... I know that #### TEST 5 #### is clearly not the solution, but it works.
There's something wrong with unitsync scans archives when there's already a spads's host running fine with 105.0 engine.

Sorry for my english, don't hesitate to ask me more if I'm not understandable enough or for some extra test.
Steps To Reproduce1. You need to run more than 1 instance of Spads to reproduce it,
2. Others instances that don't have archives in cache will fail randomly,
3. Others instance that already got the archive (#### TEST 5 ####) cache will run.
Additional InformationVery tired, time to sleep... :-O
TagsNo tags attached.
Checked infolog.txt for ErrorsIrrelevant
Attached Files

-Relationships
+Relationships

-Notes

~0020577

FabriceFABS (reporter)

--- 3 new additives tests which deserves all its importance ---

### TEST 6 ###
I stopped [ACE]Beep, and placed back Eisland.sd7 and bilateral.sdz in .spring/maps
(In fact I forget also to make my tests with those 2x problematics maps but I do it now)

When launching back Spads, it stops during map scanning @Eisland.sd7 with the floowing output :
***
...
mapName=EE-Deserted_Sinkhole-v02
mapName=Earatope_V3
mapName=Easter Bunny v2 Horse
mapName=Eisland
Processus arrêté
$
***
Same for bilateral.sdz.

### TEST 7 ###
During the file operation above while replacing those 2 problematics maps,
Spads currently running on others instances (105) detected new maps, and updates cache.
This generated a crash with the following :
***
...
mapName=EE-Deserted_Sinkhole-v02
mapName=Earatope_V3
mapName=Easter Bunny v2 Horse
mapName=Eisland
ERROR - [SPADS] Failed to load archives, undefined mods data (unitsync library crash ?)
NOTICE - [SPADS] AutoHost shutdown scheduled (reason: Unable to auto-reload Spring archives)
Can't use an undefined value as a subroutine reference at /Spads/SimpleEvent.pm line 536.
$
***
Some hosts crashed, some others seems ok

**/Spads/SimpleEvent.pm**
sub _checkSimpleSockets {
  if(%sockets) {
    my @pendingSockets=IO::Select->new(keys %sockets)->can_read($conf{timeSlice});
    foreach my $pendingSock (@pendingSockets) {
      &{$sockets{$pendingSock}}($pendingSock); <<<<<<<<< Line 536
    }
  }else{
    Time::HiRes::usleep($conf{timeSlice} * 1_000_000);
  }
}
************************

### TEST 8 ###
Bad maps above removed, spads map's cache directory renamed.
Running @ the same time [ACE]Been & [ACE]Censeur.
They are both generating map cache...
Cache done ~ 16 min, 2x hosts running well.

So the final issue is :
Some maps makes problem with Spring105 unitsync and make it crashes during the (RE)/Generation of the cache.

~0020578

abma (administrator)

can you provide unitsync.log after it "crashed" please? it should be in ~/.config/spring

yeah, because of the new release, the old cache isn't used any more.

~0020579

abma (administrator)

for the reference:

https://api.springfiles.com/?filename=bilateral.sdz
https://api.springfiles.com/?filename=eisland.sd7

~0020580

abma (administrator)

out of memory? apparmor?

-> check the ouput of dmesg

out of disk space?

-> check the output of "df"

where does the error message "Failed to load archives, undefined mods data (unitsync library crash ?)"

come from?

i can't find such a string in spads or springs git repos:

https://github.com/Yaribz/SPADS/search?q=Failed+to+load+archives

https://github.com/spring/spring/search?q=Failed+to+load+archives

~0020581

abma (administrator)

it was OOM:

[19:15:41] <[ACE]FabriceFABS> Jan 3 19:05:07 FABS-Serveur kernel: [ 2172.050413] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=perl,pid=70446,uid=1000
[19:15:41] <[ACE]FabriceFABS> Jan 3 19:05:07 FABS-Serveur kernel: [ 2172.050423] Out of memory: Killed process 70446 (perl) total-vm:354776kB, anon-rss:285496kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:660kB oom_score_adj:0

~0020582

FabriceFABS (reporter)

However, I allow myself to add a note to this diagnosis.
I did monitor seriously the processes, and return the following results :

- An instance of Spads consumes ~55MB of RAM without any map cache generate.
- An instance of Spads running unitsync for generate the map cache is not the same :
During the scan, Spads with Unitsync running uses between 80MB and upto 450MB peaks, but in general it stays around the use of a range of 150MB to 250MB during this process.
Once the map scan is performed with the generation of the cached map file from unitsync, Unitsync does not flush memory it used and keep using ~160MB, even after the caching work has been performed.

Only way to get back memory is to kill the instance of Spads and restart it.
This is not normal and may be problematic.
UnitSync won't have to run just after, so it will be ok for normal memory usage.

~0020583

FabriceFABS (reporter)

Have a loop of the map memory allocation once the unitsync process ended :
Heap : 140MB, normal use is 50MB on any other perl instance that didn't run unitsync.

~0020584

abma (administrator)

> Once the map scan is performed with the generation of the cached map file from unitsync, Unitsync does not flush memory it used and keep using ~160MB, even after the caching work has been performed.

as the peak usage is much higher than permanent usage when map files are scanned this IMHO is ok.

a quick lock into the source code:

https://github.com/Yaribz/SPADS/blob/1ab55e3c3598c7305d8b9d0d4ed1feefc52bb80d/src/getDefaultModOptions.pl#L376

PerlUnitSync::RemoveAllArchives();

isn't called before PerlUnitSync::UnInit() for the last "availableMods": this maybe cleans up the 100MB.


i had several times run unitsync compiled with -fsanitize=memory which shows memory leaks and the last time i remember had none.

as long as it doesn't permanent leak memory and memory usage increases over time, i see this as a minor issue. my time is way to limited to investigate it.

if you want to investigate it more: compile unitsync with CMAKE_BUILD_TYPE=DEBUG and CMAKE_CXX_FLAGS=-fsanitize=address and provide the output of "direct leaks".


+Notes

-Issue History
Date Modified Username Field Change
2021-01-03 04:26 FabriceFABS New Issue
2021-01-03 14:44 FabriceFABS Note Added: 0020577
2021-01-03 18:29 abma Note Added: 0020578
2021-01-03 18:41 abma Note Added: 0020579
2021-01-03 19:09 abma Assigned To => abma
2021-01-03 19:09 abma Status new => feedback
2021-01-03 19:09 abma Note Added: 0020580
2021-01-03 19:17 abma Status feedback => resolved
2021-01-03 19:17 abma Resolution open => fixed
2021-01-03 19:17 abma Note Added: 0020581
2021-01-04 00:26 FabriceFABS Status resolved => feedback
2021-01-04 00:26 FabriceFABS Resolution fixed => reopened
2021-01-04 00:26 FabriceFABS Note Added: 0020582
2021-01-04 00:33 FabriceFABS File Added: Capture.jpg
2021-01-04 00:33 FabriceFABS Note Added: 0020583
2021-01-04 00:33 FabriceFABS Status feedback => assigned
2021-01-04 05:10 abma Assigned To abma =>
2021-01-04 05:10 abma Status assigned => feedback
2021-01-04 05:10 abma Note Added: 0020584
2021-01-04 05:11 abma Summary Unitsync fails randomly when starting spads multi instances => memleak in unitsync?
2021-01-04 05:11 abma Checked infolog.txt for Errors Irrelevant => Irrelevant
2021-01-04 05:11 abma Priority high => normal
2021-01-04 05:11 abma Severity crash => minor
2021-01-04 05:11 abma Checked infolog.txt for Errors Irrelevant => Irrelevant
+Issue History