| View previous topic :: View next topic |
| Author |
Message |
IceManNCSU New User

Joined: 08 Mar 2010 Posts: 8
|
Posted: Wed Mar 10, 2010 2:26 am Post subject: Multi-Core Processing of Headers |
|
|
Hey guys, I recently purchased NMP for the primary role of importing headers from the top 500+ binary groups out there. I think that I have plenty of horsepower to run this app and disk storage to get a great start.
My system is as follows.
ASUS P5E3 Premium
Win 7 Ultimate 64bit
Intel Q9450 @ 3.0GHz Quad Core
8GB DDR3 (4x2GB) @ 1200MHz, 5GB dedicated to SQL server
250GB HD @ 7200rpm <-- My weakest link right now, will be soon replaced with a SAS solution. This is not my OS drive, this is dedicated to the DB files.
MS SQL Server 2008 Enterprise 64bit, Full-Text Searching Disabled for now.
Before I give my request, here are my results with 3.0.3.3 so far. Creating work-groups and adding groups to them is seamless. The program is very intuitive and well thought out. I was actually in the process of writing my own in C# to only grab headers and save them to a DB. But for the offering here at $30 it seems to be reasonably priced. Would be nice if the source/solution was available to us who purchased, but I can understand. I give two thumbs up to the developer with the DB layout. Since I plan to use the DB as the back-end of a web-service it will be very usable in its existing layout, with few modifications.
Ok, so heres my only existing issue that I would like to see in future releases. Since header processing is so near and dear to me, can we see more emphasis on utilizing all available cores/physical processors to complete grouping and importing of the headers. Specifically spawn worker threads/processes to handle in the incoming work loads. Maybe look into using the existing capabilities of the nVidia CUDA arch?
One more thing, it would be nice to be able to cancel or abort the current activity in the lower task bar. I know I have had a few times I wanted to cancel the actions being applied to the DB and didn't want to close the app in order to kill the threads. This is just a thought.
Thanks!
Last edited by IceManNCSU on Wed Mar 10, 2010 5:50 pm; edited 1 time in total |
|
| Back to top |
|
 |
administrator Developer


Joined: 24 Jul 2004 Posts: 4752 Location: King William, VA
|
Posted: Wed Mar 10, 2010 4:09 pm Post subject: |
|
|
| IceManNCSU wrote: | | Since header processing is so near and dear to me, can we see more emphasis on utilizing all available cores/physical processors to complete grouping and importing of the headers. Specifically spawn worker threads/processes to handle in the incoming work loads. |
If we were simply stuffing the headers in the database, spawning additional threads would make more sense. The problem is that we actually look for relationships between headers for multi-part messages and then do additional processing to test message completeness and, if you have it enabled, group messages together for easier browsing. These additional steps require that only a single process evaluate these header relationships. I could see a case for spawning a thread per Workgroup, since header relationships don't cross Workgroups, but it would only bring benefits if header counts were more or less evenly distributed amongst your Workgroups. The other potential downside is that the batch insert into the database does take measurable processing time and, with more than one thread doing this, there would be significant performance implications, particularly if you are trying to do anything else.
| IceManNCSU wrote: | | One more thing, it would be nice to be able to cancel or abort the current activity in the lower task bar. |
Agreed.
Regards |
|
| Back to top |
|
 |
IceManNCSU New User

Joined: 08 Mar 2010 Posts: 8
|
Posted: Wed Mar 10, 2010 4:31 pm Post subject: |
|
|
Can you please advise me on the most efficient way to setup my workgroups.
As I have stated in my original post I want to basically archive headers and all the header info. From what I have looked at so far the WGHEADER table contains all the main header info for all workgroups. If a header has multiple parts then the additional parts/message id's are located in the WGPARTSXX table for each workgroup.
| Quote: | | I could see a case for spawning a thread per Workgroup, since header relationships don't cross Workgroups, but it would only bring benefits if header counts were more or less evenly distributed amongst your Workgroups. |
From your reply it sounds as if I may see performance increase by just placing one singe group in a workgroup by itself. As is currently stands I am placing about 20-25 groups per workgroup. Then processing the workgroup for all headers, then eventually only 'new'. So far for the past few days I have been just doing groups with a max of around 1.5 million articles. So as I inch up to the big boys, it would be nice to have all my settings in place correctly. |
|
| Back to top |
|
 |
administrator Developer


Joined: 24 Jul 2004 Posts: 4752 Location: King William, VA
|
Posted: Wed Mar 10, 2010 4:59 pm Post subject: |
|
|
I don't think that there is any efficiency to be gained by any particular Workgroup setup. The bottleneck will always be the database+HD+available memory.
For the record, I tend to group newsgroups together into Workgroups by content (for binaries) and topic (for text groups).
You haven't said how many days you plan to archive, but if you are hoping to store everything that is available on the news server, you will be disappointed. News servers are architected very differently than what you need as an end user. To do that, you'd be better off installing news server software onto a box and subscribe to the news feeds you need to populate the groups you are interested in.
Since you seem to be a developer, there's no reason you couldn't separate your headers into different databases so that you can have a "live" database and an "archive" database, using a job to move the data from one to the other. You could even have a database for each Workgroup, if you wanted to. There are lots of options, but ultimately, there is no way I can see that you'll be able to store more than 3-5 days worth of headers in a single database on a non-server for the 500 largest binary groups. You'll also be spending a lot of time simply managing/pruning your database. If you built a dedicated SQL Server with very fast HD storage you could store more, but my qustion would be, why would you want to?
For binary groups, I only keep headers that I want to download. Everyday, I delete everything else. Considering that nearly 80% of the headers are spam and/or viruses, I don't see any point in trying to store them long term. If I end up deleting something I want later, I use Binsearch or one of the other free Usenet search services to get an NZB file.
Regards |
|
| Back to top |
|
 |
IceManNCSU New User

Joined: 08 Mar 2010 Posts: 8
|
Posted: Wed Mar 10, 2010 5:46 pm Post subject: |
|
|
Well currently I already have about 110 groups fully indexed at ~575 days, cant give the exact numbers right now due to my inability to remote back to the machine at home . But I think that the DB is at about 30-40GB... I can not remember how many articles were in the DB at that time... One thing that I have done to improve my personal situation on my MS SQL server is change the logging type to 'Simple' from 'Full'. I found from watching the activity through profiler and activity monitor, there is a lot of waits for LOGGING to finish before SQL Server would continue. I feel that this has increased my performance if only a little, some is more than none.
I currently plan on storing MAX retention from Giganews. At least that is what I have been doing so far... But as I approach the big ones, like a.b.boneless I will probably grab chunks to help keep the temp_headers folder from growing to several hundred thousand files large awaiting processing. There seems to be another issue with performance with the creation of the temp_header files waiting import to the DB. The more files in this folder the longer 'importing' takes.
TIP: Stop & Disable "Windows Search" process. This will increase your disk performance and help streamline reads and writes.
Another nice feature would be to manage the list of header jobs waiting import to the DB. I know that there is already a queue that manages the upcoming NNTP connection jobs, header update jobs etc. But wouldn't it be nice to manage the waiting batch processing jobs too post download? This may be a sensitive issue since I feel that there is a log/db records to manage also.
| Quote: | | Since you seem to be a developer, there's no reason you couldn't separate your headers into different databases so that you can have a "live" database and an "archive" database, using a job to move the data from one to the other. You could even have a database for each Workgroup, if you wanted to. There are lots of options, but ultimately, there is no way I can see that you'll be able to store more than 3-5 days worth of headers in a single database on a non-server for the 500 largest binary groups. You'll also be spending a lot of time simply managing/pruning your database. If you built a dedicated SQL Server with very fast HD storage you could store more, but my qustion would be, why would you want to? |
Yes I do intend on making this a 24/7 service that will be constantly rotating through groups looking for 'new' headers removing old records (but will use a SQL script to do this).
I will be also looking at different ways to move the data into a more production friendly environment as time goes on. Right now is mainly stressing my existing hardware to see what exactly I need to get. I knew that I should have asked for that SAS controller card for Christmas!
Main motivation for doing this is as follows. Since newzbin.com is an awesome service and is looking like legal action is pending against them. I thought that I would invest in my own system for searching USENET. I am very familar with binsearch.info and what it can do. In fact its running off a modified version of the open-source IKBIN software. http://sourceforge.net/projects/ikbin/ But IKBIN in the currently published version is very flaky in handling LARGE binary groups. It doesn't appear to be an active project either. The only thing it has going for it is the front end component for searching the DB, and it could use work.
Yes, I am going to place a PHP/HTML front end on the DB for searching and creation of NZB based on the newzbin.com format. Haven't decided if I will be making it publicly available yet, since my upload to the world is only 1.5Mbps (maybe be moving to 3.0Mbps YaY!) and that wouldn't float more than 5 people browsing page to page. |
|
| Back to top |
|
 |
administrator Developer


Joined: 24 Jul 2004 Posts: 4752 Location: King William, VA
|
Posted: Wed Mar 10, 2010 6:43 pm Post subject: |
|
|
| IceManNCSU wrote: | Well currently I already have about 110 groups fully indexed at ~575 days, cant give the exact numbers right now due to my inability to remote back to the machine at home . But I think that the DB is at about 30-40GB... |
Impressive. I haven't used SQL Server for NMP in awhile, but I may switch over from MySQL just to get another taste of the dark side.
| IceManNCSU wrote: | | One thing that I have done to improve my personal situation on my MS SQL server is change the logging type to 'Simple' from 'Full'. |
That's a really good idea, and I think it actually give you a bigger performance boost than you think.
| IceManNCSU wrote: | | There seems to be another issue with performance with the creation of the temp_header files waiting import to the DB. The more files in this folder the longer 'importing' takes. |
I think that is OS related as Windows is working harder to maintain the directory. You can move the files out of the folder and feed them back in later if you want to temporarily speed things up.
| IceManNCSU wrote: | | Another nice feature would be to manage the list of header jobs waiting import to the DB. I know that there is already a queue that manages the upcoming NNTP connection jobs, header update jobs etc. But wouldn't it be nice to manage the waiting batch processing jobs too post download? This may be a sensitive issue since I feel that there is a log/db records to manage also. |
There is no management to these batches at all. NMP processes any .HDR file it finds, in no particular order. The only way to manage them is to move the files. To provide any other management tools would require either recording the batches in the database, which I don't recommend, or building an in-memory queue that the header processing thread could read from and still allow user-interaction. This second option could be worth looking into, but thread would have to be paused in order to interact with the queue.
Good luck with your project and keep us posted.
Regards |
|
| Back to top |
|
 |
IceManNCSU New User

Joined: 08 Mar 2010 Posts: 8
|
Posted: Wed Mar 10, 2010 7:02 pm Post subject: |
|
|
| Quote: | | Impressive. I haven't used SQL Server for NMP in awhile, but I may switch over from MySQL just to get another taste of the dark side. |
I usually always use open-source software when possible. But since SQL is a very powerful piece of software I tend to use the MS flavor over MySQL. Especially since I have a MSDN license that includes development usage rights for almost all their software. So why not?
| Quote: | | Good luck with your project and keep us posted. |
Will do and if I can help out with any future development, I come with C#, VB.net, HTML and PHP skills.
Take care! |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|