Fires of Heaven Guild Message Board  

Go Back   Fires of Heaven Guild Message Board > General forums > Development
User Name
Password
Or, use your gamerDNA username: (more...)
ForumSpy Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
Old 04-04-2008, 02:51 PM   #1 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
database question..

Right, I have a large database with a lot of repeating data. The repeating data will be sorted but the database will still be extremely large - at the moment it's about 1.8m rows, but I guess that isn't that large.

one of the fields is called 'key' and is a repeating field that coudl be one of say.. 50,000 things.

'company' is a field that states the name of a company.

I run the following sql query:

SELECT distinct key from adata where company like "%MBNA%"

It then runs through the database of 1.8m entries and finds how many individual keys that the company MBNA is uniquely associated with.

The problem is this is for a website back end, where the response has to be quick. Currently this takes about 3 minutes to complete the query.. I assume it has to scan the entire 1.8m row database. Is there anyway to speed this up? The MySQL database does absolutely TONS of hard disk read/writes during the query. Ugh it's so annoying, I need to to be so much faster. Maybe if the whole database is in ram?

The database is a dedicated windows XP machine w/1 gb ram and a 2.8ghz processor. Would it be significantly faster if I installed CentOS?

Please help
Grooverider is offline   Reply With Quote
Old 04-04-2008, 03:08 PM   #2 (permalink)
Maio
Registered User
 
Join Date: Dec 2004
Posts: 65
-1 Internets
Create a table named `companies` and a column in `adata` named `company_id` that references the companies in `adata` as a foreign key. Then make the comparison on `companies` (which will have less records than `adata` and therefore fasten up the lookup).

SELECT DISTINCT `adata`.`key` FROM `adata` JOIN `companies` ON `adata`.`company_id` = `companies`.`id` WHERE `companies`.`name` LIKE '%MBNA%'
Maio is offline   Reply With Quote
Old 04-04-2008, 03:17 PM   #3 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
Problem is, company's will be added in constantly. About 10 rows are being added to the adata table each second (every second of the day.. everyday etc)

Adata table is constantly getting bigger and bigger. Forever. New data being constantly added. The php front end is analyzing this data. Alot of it is repeating..

Sound bad?

I looked at the computer while ruinning a query.. 0% cpu use.. all hard disk swapping.
Grooverider is offline   Reply With Quote
Old 04-04-2008, 04:51 PM   #4 (permalink)
Phelps McManus
I'm dangerous!
 
Join Date: Jan 2002
Location: Atlanta
Posts: 891
I am not familiar with MySQL. Is there a field for each column of the table that lets you specify whether that column is indexed? A search on a non-indexed column, would be pretty slow.

Also, if that company column is indexed (it needs to be), then the database spends all that time indexing your string. Maio's solution would help a lot. You should almost always use string tables (two columns: primary key and the string) for repeating data.

I don't know how arbitrarily you came up with that 50,000 number for the data. It implies that they repeat an average of 30-35 times over 1.8m rows. If this is the case then you should really turn this into a relational database and create a 3rd table called 'CompanyData'. This table would have a foreign key for the 'Company' and 'adata' tables (so two columns total).

Unless a new company or adata item is created, all updates will be to this CompanyData table. Since it is just 2 integers, it should be fairly fast.
__________________
If God didn't want us to eat animals, why did He make them out of meat?
Phelps McManus is offline   Reply With Quote
Old 04-04-2008, 05:25 PM   #5 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
I don't know anything about indexing (or databases really!!), but it's clear I need to entirely redesign the database. Anyone fancy talking to me about it on some kind of messenger program? :P

On the 30x repeat.. true! :/


Or I could keep asking questions in this thread.. i'll reply more concisely after Charlie Wilsons War finishes

Thanks for the help so far..
Grooverider is offline   Reply With Quote
Old 04-04-2008, 05:39 PM   #6 (permalink)
Fog
Registered User
 
Join Date: Feb 2006
Posts: 1,657
+5 Internets
This is a rude thing to ask, but I have to admit I'm curious as to how you got in charge of designing and administering this big corporate database of something if you don't know how to design databases.

I suggest you just head out to the bookstore and get a couple good database books if you want to learn this stuff. It's not a very good idea to design large and interconnected systems like database schemas with the mindset of hacking something together and then solving the problems that pop up one at a time.
Fog is offline   Reply With Quote
Old 04-04-2008, 05:48 PM   #7 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
Hehe, it's not a big corporate database.. I don't know where you got that idea. It's a personal project but it must be made right as it could turn into something more.

The data comes from backend system coded in Visual Studio (something that I have a lot more experience in than this).

As for the book - I bought a SQL book today but I wanted some real human input.
Grooverider is offline   Reply With Quote
Old 04-04-2008, 08:39 PM   #8 (permalink)
Phelps McManus
I'm dangerous!
 
Join Date: Jan 2002
Location: Atlanta
Posts: 891
This scenario is pretty common. I am an engineer, not a programmer, but I got stuck with our billing database at my last company (~50 employees, $11mil/yr revenue). It was written in Access, bloated, and slow as molasses.

The company I work at now has ~1000 employees, ~500mil/yr revenue, and uses PeopleSoft (on MSSQL server). They must have a team of at least 10 contractors who work on it. Costs a goddamn fortune (millions per year) and is less user friendly than the Access database. We need cheat sheets to fill out expenses.

My neighbor is a DBA for the cable company here (Time Warner). Based on discussions with him and my past 2 jobs, I don't think most people in charge of their company's databases understand normalization and proper database design. There is just not enough talent to meet the demand, so a lot of people just get thrown at it because they are good with computers.

Anyway, shoot your questions on here and I can try my best to answer them. I am no DBA but I have a decent grasp of design.
__________________
If God didn't want us to eat animals, why did He make them out of meat?

Last edited by Phelps McManus : 04-04-2008 at 08:41 PM.
Phelps McManus is offline   Reply With Quote
Old 04-05-2008, 03:36 AM   #9 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
OK, I appreciate this.. here goes. I think I need a complete redesign.

Currently I have two tables in a database, one called kwlist. The other one called adata.

Kwlist has two fields one is a list of unique words or phrases, the other is a primary key, incrementing integer.

Adata has multiple fields, one being adataid an incrementing primary key (integer). Then 'kw' which is a plain text version of a word from the kwlist. Would it make my database more efficient if I replaced the plaintext kw in adata with it's related ID in the KWlist table?

After this we have company name (often repeating), text data (repeating), more text data (repeating), Date added, Time added, a true/false field.

Problem with company name is that they are constantly being added so i'm not sure about the link table thing.

I have no idea how to make it so repeating data isn't added, perhaps add a field called 'last scanned' and if it's a duplicated - update 'last scanned' rather than add the new field?

Problem is, the thing adding all the data is a Visual Studio back end that simply pulls the data and pushes it into a MySQL query. It cannot make intelligent decisions unless I add another mySQL query to check a situation for example..

I hope this all made sense

At this moment i'd be pretty willing to pay someone for a database design because I think it's a real skill that probably takes a long time to develop.
Grooverider is offline   Reply With Quote
Old 04-05-2008, 05:24 AM   #10 (permalink)
mylek
Registered User
 
mylek's Avatar
 
Join Date: Jan 2006
Posts: 69
+4 Internets
It seems that you are overlooking some fundamental aspects of how a database should be structured. Your database should be broken up into multiple tables depending on expected usage. Any columns that you intend to search or sort by must be indexed for good read performance. Matching text as a means to look up data should be avoided and used only when necessary. Whenever possible instead use a primary key or unique id. In the case of your statement in the OP I believe that the company column will need to have a fulltext index to perform your query fastest.

It seems to me that you should have separate company table. Primary key being the company's unique id, and your key as one of its columns. When your program interacts with the database it would use the company's unique id to reference information. Using a primary key is the fastest possible way to get information from the database. (Your primary key should be some kind of integer)

Could be something like this:
[KEY_WORDS] id, word
[COMPANIES] id, some_key, key_word_id, name
[LOG] id, company_id, note1, note2, date_added, time_added, hidden

Assuming your program can know the company's ID when referencing it your statement could be this "SELECT some_key FROM companies WHERE id = '2'"

Or if you are associating multiple key words with each company:
[KEY_WORDS] id, word
[COMPANY_WORDS] id, key_word_id, company_id
[COMPANIES] id, some_key, name
[LOG] id, company_id, note1, note2, date_added, time_added, hidden

I'm guessing as to what you are intending to do so this may be totally off track.

Last edited by mylek : 04-05-2008 at 05:28 AM.
mylek is offline   Reply With Quote
Old 04-05-2008, 05:55 AM   #11 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
Ok, so I construct the database like this:

[company list]
- {company id} - {company name in text}

[kw list]
- {kw id} - {word/phriase in text}

[log]
- {kw id, instead of kw list from text} - {company id, instead of company in text} - {various data} - {various data}

Now say I want to pull a list of phrases that the company is associated with in the database in the log. Would this be faster than my previous set up?

i.e. this time i'm doing a : search log for ID etc etc join the tables, or something?
Grooverider is offline   Reply With Quote
Old 04-05-2008, 06:32 AM   #12 (permalink)
Phelps McManus
I'm dangerous!
 
Join Date: Jan 2002
Location: Atlanta
Posts: 891
Quote:
Originally Posted by Grooverider View Post
Currently I have two tables in a database, one called kwlist. The other one called adata.
I recommend breaking up kwlist into Companies and Items. The Company table should be significantly smaller and therefore quicker to search than a combined string table with 50,000 other items (especially when using LIKE).

Quote:
Adata has multiple fields, one being adataid an incrementing primary key (integer). Then 'kw' which is a plain text version of a word from the kwlist. Would it make my database more efficient if I replaced the plaintext kw in adata with it's related ID in the KWlist table?
Absolutely use the id from the appropriate string table rather than the string itself. The time saved from comparing integers instead of strings will more than make up for the time incurred for performing a JOIN. It will also significantly reduce the size of the database.

Quote:
After this we have company name (often repeating), text data (repeating), more text data (repeating), Date added, Time added, a true/false field.
Like I said, replace the CompanyName column with a foreign key (companyID) into a Company string table. The other strings are OK as long as they aren't indexed or repeated often. Consider combining the Date and Time columns into a single DateTime column. Make sure this is stored as actual DateTime (8 bytes I think) data and not as a string.

Quote:
Problem with company name is that they are constantly being added so i'm not sure about the link table thing.
Unique company names are constantly being added? How many companies are in the database now? How often would new companies be added? I can't tell if you are confusing "adding a company" with "adding a new data record for an existing company".

Quote:
I have no idea how to make it so repeating data isn't added, perhaps add a field called 'last scanned' and if it's a duplicated - update 'last scanned' rather than add the new field?
When data gets submitted by the web front end, you check if the strings already exist in their respective string tables. If they don't, insert them. If they do exist, get their ID (or get the result of the insert). Then you use these IDs when you perform an insert on your main transaction table (called Log your later post).

This should eliminate redundant data without the need of a 'last scanned' field or anything like that.

Quote:
Problem is, the thing adding all the data is a Visual Studio back end that simply pulls the data and pushes it into a MySQL query. It cannot make intelligent decisions unless I add another mySQL query to check a situation for example..
Well, you can get fancy (and quickly leaving the area of my limited experience) and have a Stored Procedure in the database that accepts CompanyName, KWtext, various data, various data. This procedure will then do all of the work of replacing the text with IDs (inserting new ones as needed) and performing the insert (including grabbing the time stamp).

Quote:
At this moment i'd be pretty willing to pay someone for a database design because I think it's a real skill that probably takes a long time to develop.
Well this is a pretty simple database with only 3 tables, two of which are simply string tables. Imagine if these were Items and Companies that sold them and then you had Customers and Invoices of items they purchased. You can also start getting into Transactions (which can be rolled back or commited), Triggers, and a bunch of other crap that a good DBA should know.
__________________
If God didn't want us to eat animals, why did He make them out of meat?
Phelps McManus is offline   Reply With Quote
Old 04-05-2008, 06:42 AM   #13 (permalink)
Phelps McManus
I'm dangerous!
 
Join Date: Jan 2002
Location: Atlanta
Posts: 891
Quote:
Originally Posted by Grooverider View Post
i.e. this time i'm doing a : search log for ID etc etc join the tables, or something?
You would search the company table and then JOIN its ID into the log table.

SELECT Company.Name, KwList.Name, Log.Data1, Log.Data2, Log.Timestamp
FROM (Company JOIN Log ON (Company.ID = Log.CompanyID)) LEFT JOIN KwList ON (Log.KwListID = KwList.ID)
WHERE Company.Name LIKE "%MBNA%"

or maybe

SELECT Company.Name, KwList.Name, Log.Data1, Log.Data2, Log.Timestamp
FROM (Select * FROM Company WHERE Company.Name LIKE "%MBNA%") JOIN Log ON (Company.ID = Log.CompanyID)) LEFT JOIN KwList ON (Log.KwListID = KwList.ID)
__________________
If God didn't want us to eat animals, why did He make them out of meat?
Phelps McManus is offline   Reply With Quote
Old 04-05-2008, 11:04 AM   #14 (permalink)
Grooverider
hildog
 
Grooverider's Avatar
 
Join Date: Aug 2002
Posts: 860
+0 Internets
Just a quick reply, as I will really have to concentrate when I read your post. However - new companies are constantly being added.
Grooverider is offline   Reply With Quote
Old 04-11-2008, 03:18 PM   #15 (permalink)
Niceshot23
Registered User
 
Join Date: Oct 2004
Posts: 112
+0 Internets
select * from kwlist where kw in (select kw from adata where company like "%MBNA%")

since they are both not primary or foreign keys, then do not join the tables ... that will make it do a table scan which is the slowest querying you can do in sql ... this should retrieve the data you are looking for without having to do a redesign of the db ....


and im not too sure if mysql uses it ... but so that you dont keep inserting the same data, then use
"IF NOT EXISTS(select field1, field2 from table where field a = 'random' and b = 'random2')
insert into table
select 'random1', 'random2'"
Niceshot23 is offline   Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is On
Trackbacks are On
Pingbacks are On
Refbacks are On
uberguilds network



All times are GMT -7. The time now is 01:32 AM.


Powered by vBulletin® Version 3.6.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.0.0 RC6