This is a script to download emails from Gmail and store them in a SQLite database for further analysis. I find it extremely useful to have all my emails in a database to run queries on them. For example, I can find out how many emails I received per sender, which emails take the most space, and which emails from which sender I never read.
Clone this repository: git clone https://github.com/marcboeker/gmail-to-sqlite.git.
Install the requirements: pip install -r requirements.txt
Open the OAuth consent screen and create a new consent screen. You only need to provide a name and contact data.
Next open Create OAuth client ID and create credentials for a Desktop app. Download the credentials file and save it under credentials.json in the root of this repository.
Run the script: python main.py sync --data-dir path/to/your/data where --<data-dir> is the path where all data is stored. This creates a SQLite database in <data-dir>/messages.db and stores the user credentials under <data-dir>/credentials.json.
After the script has finished, you can query the database using, for example, the sqlite3 command line tool: sqlite3 <data-dir>/messages.db.
You can run the script again to sync all new messages. Provide --full-sync to force a full sync. However, this will only update the read status, the labels, and the last indexed timestamp for existing messages.
usage: main.py [-h] [--data-dir DATA_DIR] [--update] {sync, sync-message}
Main commands:
sync Sync emails from Gmail to the database.
sync-message Sync a single message from Gmail to the database.
--data-dir DATA_DIR Path to the directory where all data is stored.
--full-sync Force a full sync.
--message-id MESSAGE_ID Sync only the message with the given message id.
CREATETABLEIF NOT EXISTS "messages" (
"id"INTEGERNOT NULLPRIMARY KEY, -- internal id"message_id"TEXTNOT NULL, -- Gmail message id"thread_id"TEXTNOT NULL, -- Gmail thread id"sender" JSON NOT NULL, -- Sender as JSON in the form {"name": "Foo Bar", "email": "[email protected]"}"recipients" JSON NOT NULL, -- JSON object: {-- "to": [{"email": "[email protected]", "name": "Foo Bar"}, ...],-- "cc": [{"email": "[email protected]", "name": "Foo Bar"}, ...],-- "bcc": [{"email": "[email protected]", "name": "Foo Bar"}, ...]--}"labels" JSON NOT NULL, -- JSON array: ["INBOX", "UNREAD", ...]"subject"TEXTNOT NULL, -- Subject of the email"body"TEXTNOT NULL, -- Extracted body either als HTML or plain text"size"INTEGERNOT NULL, -- Size reported by Gmail"timestamp" DATETIME NOT NULL, -- When the email was sent/received"is_read"INTEGERNOT NULL, -- 0=Unread, 1=Read"is_outgoing"INTEGERNOT NULL, -- 0=Incoming, 1=Outgoing"last_indexed" DATETIME NOT NULL-- Timestamp when the email was last seen on the server
);
Get the number of emails per sender
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
GROUP BY sender->>'$.email'ORDER BY count DESC
Show the number of unread emails by sender
This is great to determine who is spamming you the most with uninteresting emails.
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE is_read =0GROUP BY sender->>'$.email'ORDER BY count DESC
Get the number of emails for a specific period
For years: strftime('%Y', timestamp)
For months in a year: strftime('%m', timestamp)
For days in a month: strftime('%d', timestamp)
For weekdays: strftime('%w', timestamp)
For hours in a day: strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count
FROM messages
GROUP BY period
ORDER BY count DESC
Find all newsletters and group them by sender
This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use
SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE body LIKE'%newsletter%'OR body LIKE'%unsubscribe%'GROUP BY sender->>'$.email'ORDER BY count DESC
Show who has sent the largest emails in MB
SELECT sender->>'$.email', sum(size)/1024/1024AS size
FROM messages
GROUP BY sender->>'$.email'ORDER BY size DESC
Count the number of emails that I have sent to myself
SELECTcount(*)
FROM messages
WHERE EXISTS (
SELECT1FROM json_each(messages.recipients->'$.to')
WHERE json_extract(value, '$.email') ='[email protected]'
)
AND sender->>'$.email'='[email protected]'
List the senders who have sent me the largest total volume of emails in megabytes
SELECT sender->>'$.email', sum(size)/1024/1024as total_size
FROM messages
WHERE is_outgoing=false
GROUP BY sender->>'$.email'ORDER BY total_size DESC
Detect deleted emails and mark them as deleted in the database.