springfiles.com mirroring system

springfiles.com mirroring system

For the discussion of infrastructure improvements and changes.

Moderator: Moderators

abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

springfiles.com mirroring system

Post by abma »

Edit:

https://github.com/springfiles/upq
is used for mirroring files. all info here is deprecated:

How it currently works in short:

user uploads file, Drupal (the content management-system) creates an empty file.

The cron daemon runs a script which:

checks if the empty file exists, if so, it deletes the file and starts a script which extracts metadata from uploaded files.
Then the file is uploaded to all mirrors. After an upload, the files md5 are validated by a php-script on the location server, if file is valid, it is marked as valid in database -> xml-rpc api will it return as result.

The Cron-script is currently run every 10 Minutes, so uploaded files should be avaiable on all mirrors (depends on upload speed) in max 15 Minutes.

The modules for xml-rpc and mirroring are avaiable on github: https://github.com/springfiles (if someone is interested)

My personal wishlist is:
-replace php-upload (within a drupal-cron-job) with a standalone python-script, which could be easier reused on other sites, too.

-make something like an archive: old files are still recognized but no more distributed to other mirrors

-add some hooks to allow other services to be notified about new files

-improve the script that extracts the metadata: https://github.com/springfiles/searchap ... xporter.py so it extracts all data like on http://mapinfo.adune.nl
Last edited by abma on 05 Apr 2012, 04:31, edited 1 time in total.
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

abma wrote:-replace php-upload (within a drupal-cron-job) with a standalone python-script, which could be easier reused on other sites, too.
I could do that - I love python :)
I suggest as an interface the py-script takes cmd line arguments from drupal and for the return calls a cmd line PHP script that does the DB stuff. That way all the DB code stays in the webapp (in one place).
I'd implement a queue system, so that the amount of simultaneous uploads is configurable.
User avatar
smoth
Posts: 22309
Joined: 13 Jan 2005, 00:46

Re: springfiles.com mirroring system

Post by smoth »

HEY DAN! glad you made it to the forums! Also good on ya for offering to help!
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

Hi smoth :)

I'm glad I can give something back / contribute.
abma wrote:-make something like an archive: old files are still recognized but no more distributed to other mirrors
* How can old and new files be distinguished?
* Why are old files uploaded to the mirrors anyway, if the cronjob does upload only new files?
* Maybe it's desirable to upload "the archive" to at least 1 mirror for backup reasons, but not to all?
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

* How can old and new files be distinguished?
that should be done by hand, something similar to rapid:

You've game, like Zero-K v0.9 Zero-K v0.91, ... and you can mark one version as stable and an other as testing.
But this would require to detect automaticly, which game a file belongs to.

Rapid already can do that. (http://packages.springrts.com/)
* Why are old files uploaded to the mirrors anyway, if the cronjob does upload only new files?
because there are no "old" files, files aren't currently marked as old or outdated
Maybe it's desirable to upload "the archive" to at least 1 mirror for backup reasons, but not to all?
yes, if its possible to select these files, that could be done. I think it would be best, that mirrors with large storage do that, smaller mirrors will simply delete that file.
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

dansan wrote: I suggest as an interface the py-script takes cmd line arguments from drupal and for the return calls a cmd line PHP script that does the DB stuff. That way all the DB code stays in the webapp (in one place).
I'd implement a queue system, so that the amount of simultaneous uploads is configurable.
my idea was, to make a python program, that runs always which has something a socket, to notify about new files.

if notified, the python program runs the index-process for the file, writes data to db and uploads the file(s).

footprint for that listener should be really low, as the program has only to listen for notifies.

When run by cron/without db access, there are some disadvantages:

-cron will poll for changes (polling is nearly always bad), and there will be a delay (cron can be run at max every minute...)
-the program has to know passwords/usernames, calling with parameters is a bad idea, because all users on the host can see parameters, so i think db-access is fine
-what if something went wrong during uploading, retry it endless? or mark it in a db as failed and retry after xx hours?

the current tables that store mirror data are nearly independant to drupal, so they could be used easily. I also think that the general structure won't change, only some rows will be added.
User avatar
Licho
Zero-K Developer
Posts: 3803
Joined: 19 May 2006, 19:13

Re: springfiles.com mirroring system

Post by Licho »

You could also use simple rsync push .. have folder with "verified" files and simply rsync push that to remote serve.

Rsync can use ssh keys for security (no need for password stuff).

There is rsync server service for windows too (delta copy) so you will get multiplatform universal system without much effort and you dont have to care about interrupted traffic or updating corrupted files.
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

don't start the discussion from the beginning please. only rsync imo is to unflexible, we need something that triggers these uploads and php from the website can't do that. i don't want an rsync job that runs regulary. it should be only run when new files are for uploading there and retry when some uploads failed. i also want something reuseable. rsync would also mean to upload 30GB at once for new mirrors, it also produces high load on re-uploads as it scans through all files. :-/

current system works very nice, but its difficult to debug/test/improve...

the idea i have is something like this:

Code: Select all

wait for new files
     call hooks for pre-upload
     call hooks for upload
     call hooks for post-upload
a hook is a configurable script which is called with some parameters (maybe fileid, filename, size)
most efforts will go into these hooks/scripts, the python-script should be very small and simple.

this hook cold be a script that calls rsync, too.
User avatar
Licho
Zero-K Developer
Posts: 3803
Joined: 19 May 2006, 19:13

Re: springfiles.com mirroring system

Post by Licho »

I don't understand what is this discussion about then :)

Use "uploading" service (can be in python) + some IPC?
on linux that could be a named pipe where you dump file name to process + parameters.
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

I proposed cmd line args as communication method, because I thought it's the most portable way.
What for are passwords needed? (DB-pw is static, and can be stored in a conf file.)


<disclaimer>
Here comes a very long :!: text.... sorry about that.... I'm not a pro programmer, so I'm a little insecure on the approach :oops:, and want to discuss this before jumping into work. Also: time is precious for me, so I don't want to do unneeded things.
</disclaimer>

Because of the store-and-forward nature of the upload/mirror process I had the idea to mimic a bit my beloved postfix.

All SMTP servers must consider:
* at any time a new job can come in
* a job can result in lots of work
* while a job is running more jobs can arrive
* each job can result in multiple new jobs
* process count and bandwidth must be throttable
* a lot of things can go wrong
* data must never ever get lost
* after a (forceful) shutdown the server must continue where it left of (consistent and persistent states)

The Postfix model (as opposed to sendmail == monolithic):
* separate processes for different tasks (well maintainable code, processes can have diff. permissions, processes can run inside/outside a chroot, processes can run on diff. hosts)
* simple interfaces
* IPC over socket (transparent file or network)
* queues are used to control the server load
* full architecture overview: http://www.postfix.org/OVERVIEW.html

OK, so here is the queue model for the upload system:

Code: Select all

                  queue           queue
socket   \         |               |
  or      >-- incoming-mgr -- outgoing-mgr
cmd line /         |           |    |   |
               meta-extr      FTP  MD5  DB
Queues are directories that contain text files that contain job infos. The "data" is never passed around, as it is a file already stored on the disk, and there is no need to move it (and if there is, than make it a job).

From left to right:
IMO it doesn't really matter if the input gets communicated to a socket of a running daemon or by cmd line call (starting a prog) that results in a program doing exactly the same :)
The result in both cases: a job description gets written to a file in the incoming-queue-directory. Then the incoming-queue-mgr gets active (socket call) and calls the appropriate incoming-hook: an external program.
The external program moves the job-file to its working-queue-directory (afaik filesystems move atomically) and does its job (metadata extraction, DB entry). Upon completion it exits with a exit code (success/failure) and possibly appends info to the job-file before moving it back to the incoming-queue-directory.
The incoming-queue-mgr reacts according to the exit code:
success -> run next ext.prog or move job to outgoing-queue(-mgr)
failure -> move job to incoming-deferred-queue, contact admin / retry after x min / call SF error func

The outgoing-queue-mgr starts the external programs for FTP-transfer, md5 local-remote-diff, DB entry for the job in this order. It reacts according to the exit code much like the incoming-queue-mgr.

That makes 2 running daemons that know how many ext. progs they have started. When the system starts it checks all queues (incl. the working-queues of the ext. progs) for jobs, and advances those found.

I have never programed such a thing, but I'd like to try. If it works we get a generic, extensible, configurable, workflow oriented job manager(TM).
The manager should be generic as it does not mind the kind of jobs it handles, and that it has an input and a output to another mgr or is the end of the line.
It'd be extensible in that more mgr can be appended "to the right" and more ext.progs can be started in a defined order by each mgr.
It'd be configurable about success/failure actions and number of concurrent jobs.


WOW... now I've written this super long posting and wonder if there is something like that already as OSS :roll: .
... I also wonder if it's not total overkill for "uploading a bunch of files" :mrgreen:
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

I proposed cmd line args as communication method, because I thought it's the most portable way.
What for are passwords needed? (DB-pw is static, and can be stored in a conf file.)
only config for mysql-password + database is required.
OK, so here is the queue model for the upload system:

Code: Select all

                  queue           queue
socket   \         |               |
  or      >-- incoming-mgr -- outgoing-mgr
cmd line /         |           |    |   |
               meta-extr      FTP  MD5  DB
Queues are directories that contain text files that contain job infos. The "data" is never passed around, as it is a file already stored on the disk, and there is no need to move it (and if there is, than make it a job).
i think using the database for that makes more sense as requirements for program would be lower (only database access would be needed, no writing to filesystem, database entries can't be broken as files, no parsing of files), but lets see what are our requirements:

-upload new file
-reupload failed uploads
-move files into an archive-folder on mirrors (or delete it, if mirror has no archive)
-re-upload damaged/missing files
-verify files
-extract metadata
-notify other services/websites about new files
-import files already on mirrors (would mean a script that notifies about a new file)

currently there are 3 tables that are relevant for uploads
table files, contains a list of all files, currently on springfiles
table file_mirror, contains config for mirrors (hostname, ftp access,...)
table file_mirror_files, contains metainfo about files on filemirrors

From left to right:
IMO it doesn't really matter if the input gets communicated to a socket of a running daemon or by cmd line call (starting a prog) that results in a program doing exactly the same :)
The result in both cases: a job description gets written to a file in the incoming-queue-directory. Then the incoming-queue-mgr gets active (socket call) and calls the appropriate incoming-hook: an external program.
yes, true, if the job is in a queue-dir or table... thats in general the same.
The external program moves the job-file to its working-queue-directory (afaik filesystems move atomically) and does its job (metadata extraction, DB entry). Upon completion it exits with a exit code (success/failure) and possibly appends info to the job-file before moving it back to the incoming-queue-directory.
The incoming-queue-mgr reacts according to the exit code:
success -> run next ext.prog or move job to outgoing-queue(-mgr)
failure -> move job to incoming-deferred-queue, contact admin / retry after x min / call SF error func
i think this would mean, the script has to return something like this: temp error / general failure. and depending on that, delete the job or retry later. In a database i would mark a hard failed file/job only as failed and ignore it. (Something like UPDATE file set status=failure where fileid=xy) and for a failed upload i would mark it as not uploaded (and it is re-uploaded on next upload run).
The outgoing-queue-mgr starts the external programs for FTP-transfer, md5 local-remote-diff, DB entry for the job in this order. It reacts according to the exit code much like the incoming-queue-mgr.
that should be the same code, yes (incoming, upload, post-upload hook)
That makes 2 running daemons that know how many ext. progs they have started. When the system starts it checks all queues (incl. the working-queues of the ext. progs) for jobs, and advances those found.
what are the 2 daemons? i think only one deamon is running, the other program isn't a daemon, only a script.
The manager should be generic as it does not mind the kind of jobs it handles, and that it has an input and a output to another mgr or is the end of the line.
It'd be extensible in that more mgr can be appended "to the right" and more ext.progs can be started in a defined order by each mgr.
It'd be configurable about success/failure actions and number of concurrent jobs.
don't make it to generic, this could make it to complicate.
WOW... now I've written this super long posting and wonder if there is something like that already as OSS :roll: .
... I also wonder if it's not total overkill for "uploading a bunch of files" :mrgreen:
it's not only "uploading some files", especially the meta-data extraction + notify other services/sites requires something similar.

are you sometimes in lobby? this would simplify things a bit :-)
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

Licho wrote: I don't understand what is this discussion about then :)
where things are, and where they go...
Licho wrote: Use "uploading" service (can be in python) + some IPC?
on linux that could be a named pipe where you dump file name to process + parameters.
yes, exactly.
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

abma wrote:i think using the database for that makes more sense as requirements for program would be lower (only database access would be needed, no writing to filesystem, database entries can't be broken as files, no parsing of files)
You're right. I have done to much system administration, so IMO having a DB was a higher requirement than files/dirs, but from a programming point of view DBs simplify things a lot :)
abma wrote:-notify other services/websites about new files
What do you have in mind? Update a RSS feed?
abma wrote:-import files already on mirrors (would mean a script that notifies about a new file)
import from where to where? mirror->SF?
abma wrote:currently there are 3 tables that are relevant for uploads
table files, contains a list of all files, currently on springfiles
table file_mirror, contains config for mirrors (hostname, ftp access,...)
table file_mirror_files, contains metainfo about files on filemirrors
Can you mail me the SQL schema, or is it in some repo?
what are the 2 daemons? i think only one deamon is running, the other program isn't a daemon, only a script.
A script can be a daemon too, if it runs in the backgroud.
The 2 daemons are the two queue-mgrs.
don't make it to generic, this could make it to complicate.
I'll try to find a good middle ground. I guess this is where programming experience pays of a lot (that I'm missing) :)
it's not only "uploading some files", especially the meta-data extraction + notify other services/sites requires something similar.
Can I see the code that is used atm to extract metadata from uploaded files?
are you sometimes in lobby? this would simplify things a bit :-)
When I'm playing :)
I'll be there in a few hours and look for you.
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

I've commited my wip to https://github.com/springfiles/upq
I'll produce some documentation tomorrow, as I'm tired now.

Ah wth... here is a good overview: There is a daemon running that listens on a socket (atm file socket, but could be network too). It accepts cleartext commands in the form "cmd1 arg1 arg2" "cmd2 arg1 arg2"...
It then tries to find a module "jobs/Cmd1" by name convention and if success queues the job and returns "OK bla bla".
Each Job-type has its own queue that can have a configurable amount of threads working on its jobs.
Each Job can have sub"Task"s that do most of its work. Its use is not necessary - my idea was to promote easy code sharing.
Jobs and Tasks are python modules that are searched in configurable directories and are loaded dynamically when needed (quiet simple in python:).
The standard python Queue was extended to make Jobs persistent in the MySQL DB to withstand (upq) server crash/restart (see upq.sql). After the server starts it looks in the DB to see if any jobs have not finished, and restarts them. I'm having trouble to maintain a stable connection to MySQL. Probably because the lib for that is not thread safe... I'll have to find some sort of pooling solution or locking or I don't know... it's driving me crazy for two days... weird stuff :)
There is a basic notification system that atm can send emails and log to syslog if a task finishes un/successful.
Missing peaces:
* Possibility to put a job "on hold" and schedule it to try (n times) again later without blocking a thread.
* No useful job was written as I don't have access to springfiles.com I don't even know the correct paths - just an example and some "templates/job ideas".
* bugs :)
* reread ini-file on signal (USR1 or something)
* atm notifications are configured for a class of jobs, but maybe it would be better to set it up per-job when connecting to the socket
* I'm not aware of most things that go on @sf.com, so probably a lot of features are missing :)

The whole point of the sw is to be able to easily write and schedule Jobs by just deriving from a class and editing an ini-file. So it should hopefully be quick/easy to migrate jobs to this.
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

looks very promising, will have a look at the weekend as i'm on a journey the next days.
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

hm, now i'm more busy than expected :-/

some questions/sugestions:
  • how to create a new job within a job?
  • why isn't "notify" a job (without notifications) itself?
  • only thing i miss is something like a retry-task "in n mintes", but imo this could be done by a task itself, with slightly modifications?
  • where to add springfiles tasks? make an additional repo for the tasks folder?
  • i know, springfiles is currently like a black-box. but be happy that you can't see these ugly code. if you're still interested: nearly all code that is used for SF's mirroring is at https://github.com/springfiles/file_mirror

these are my planed jobs (+subjobs in order):


new file job
  • hash file
  • metadata extraction + verification (for example: can unitsync load file?)
  • upload to mirrors
  • upload verification
  • notify other services about new file:
    • auto create page on springfiles(existing but untested php script)
    • notify map|modinfo.adune.nl
    • notify plasmaservice
    • message within a channel
    • deprecate old files (archive)
    • ...
download job
  • download
  • call "new file job"
archive job
  • move file on remote site into archive
import files
  • search files on remote site
  • call "download job" for unknown files
any suggestions about that?


finally i need more time ;-)
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

abma wrote:[*]how to create a new job within a job?

Code: Select all

uj = UpqJob(jobname, jobcfg, jobdata, paths)
uj.check()
uj.run()
Only problem is, that with this __init__ you'd need "jobcfg"... Either from Parseconfig (https://github.com/springfiles/upq/blob ... pq.py#L106) or you already have it, if it's the same job... Hmm... that's a problem I guess...

With every module more I wrote I got more and more the impression it'd have been a good idea to make the [ini]-data globally available, instead of passing info around in function arguments (bloating the interface), so it made it into TODO -> https://github.com/springfiles/upq/blob/master/TODO#L12
Beware: when implemented some interfaces will change.

Why do you want to create a job from a job? Maybe you can use Tasks for this: put your work-code into UpqTasks and use the UpqJob for control flow.
abma wrote:[*]why isn't "notify" a job (without notifications) itself?
Didn't think of that - sounds like a good idea :)
abma wrote:[*]only thing i miss is something like a retry-task "in n mintes", but imo this could be done by a task itself, with slightly modifications?
On my definitely-TODO-list :) https://github.com/springfiles/upq/blob/master/TODO#L2
abma wrote:[*]where to add springfiles tasks? make an additional repo for the tasks folder?
I don't mind. upq is already under springfiles on github, so project specific code is to be expected. If someone contacts us, that they found upq (how?) and want to use it at their place, then we can still move sf specific code out of there. Or is there security related code that should be kept out of sight?
abma wrote:[*]i know, springfiles is currently like a black-box. but be happy that you can't see these ugly code. if you're still interested: nearly all code that is used for SF's mirroring is at https://github.com/springfiles/file_mirror
I'm happy if you take care of sf-jobs, because I' like to focus on the tasks in TODO. Also: the MySQL connection is really unstable, making it a total show stopper. Makes me a bit angry: why is there such a great lib for PostgreSQL: psycopg and not for MySQL?
abma wrote:these are my planed jobs (+subjobs in order):
[..]
[*]upload to mirrors
[*]upload verification
done -> tasks/remote_md5.py
abma wrote:[*]notify other services about new file:
  • auto create page on springfiles(existing but untested php script)
  • notify map|modinfo.adune.nl
OT: I'd love map|modinfo.adune.nl to have a RSS feed that lobby clients read to notify users about new maps/mods. (Also: a tournament announcement notice.)
abma wrote:[*]notify plasmaservice
[*]message within a channel
[*]deprecate old files (archive)
[*]...[/list][/list]

download job
From where and what does sf.com download?
abma wrote:
  • download
  • call "new file job"
archive job
  • move file on remote site into archive
import files
  • search files on remote site
How do files get there? I thought sf.com is the one and only source... What other places are there? (curious)
abma wrote:[*]call "download job" for unknown files[/list]

any suggestions about that?


finally i need more time ;-)
Next things I'll do (when I find time...) will be global-[cfg] (because it'll change some interfaces) and stabilize the mysql connection.
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

dansan wrote: Why do you want to create a job from a job? Maybe you can use Tasks for this: put your work-code into UpqTasks and use the UpqJob for control flow.
because jobs have a stetting, how many can be run in parrallel, this allows something like this:
(concurrent jobs)
hash-job 1
upload 2
notify 6

with tasks it looks currently impossible.


dansan wrote: Or is there security related code that should be kept out of sight?
imo authentification should be done on the frontend... access to socket should be granted by unix access rights.
dansan wrote:Makes me a bit angry: why is there such a great lib for PostgreSQL: psycopg and not for MySQL?
sqlalchemy looks great, too (but was a bit complicate at the beginning)
dasan wrote: From where and what does sf.com download?
http://packages.springrts.com/builds/
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: springfiles.com mirroring system

Post by dansan »

abma wrote:
dansan wrote: Why do you want to create a job from a job? Maybe you can use Tasks for this: put your work-code into UpqTasks and use the UpqJob for control flow.
because jobs have a stetting, how many can be run in parrallel, this allows something like this:
(concurrent jobs)
hash-job 1
upload 2
notify 6

with tasks it looks currently impossible.
I see... I didn't think that of parallelism inside a job, but it makes total sense if you want to upload 1 file to multiple mirrors and notify multiple news consumers at once.
Then all those subtasks must be UpqJobs, or have UpqTasks have the same infra as UpqJobs - making code double -> lets drop the UpqTask class and make subtasks UpqJobs.

abma wrote:
dansan wrote: Or is there security related code that should be kept out of sight?
imo authentification should be done on the frontend... access to socket should be granted by unix access rights.
Clean, easy and done :)
Oh no... file permissions are 755 for socket and 644 for the log... I'll change that to 750 and 640.
abma wrote:
dansan wrote:Makes me a bit angry: why is there such a great lib for PostgreSQL: psycopg and not for MySQL?
sqlalchemy looks great, too (but was a bit complicate at the beginning)
Very nice - happy you found (and already coded) that :)
abma
Spring Developer
Posts: 3798
Joined: 01 Jun 2009, 00:08

Re: springfiles.com mirroring system

Post by abma »

games that are also on rapid are now removed from the mirrors, if they have only one tag (!= stable / testing) in rapid and they are older than 100 days. this should have freed up some space on the mirrors.

all files are still kept on springfiles.com.

(edit: damnit, bumped the wrong thread, sorry)
Post Reply

Return to “Infrastructure Development”